Allocations

For high-performance software in Swift, it’s often important to understand where your heap allocations are coming from. The next step can then be to reduce the number of allocations your software makes.

This is very similar to other performance questions: Before you can optimise performance you need to understand where you spend your resources. And resources can be CPU time, as well as memory, or heap allocations. In this document we will solely focus on the number of heap allocations, not their size.

On macOS, you can use Instruments’s “Allocations” instrument. The Allocations instrument shows you two sets of values: The live allocations (i.e. allocated and not freed) as well as the transient allocations (all allocations made).

Your production workloads however will likely run on Linux and depending on your setup the number of allocations can differ significantly between macOS and Linux.

Preparation

To not waste your time, be sure to do any profiling in release mode. Swift’s optimiser will produce significantly faster code which will also allocate less in release mode. Usually this means you need to run

swift run -c release

Install `perf`

Follow the installation instructions in the Linux perf utility guide.

Clone the `FlameGraph` project

To see some pretty graphs, clone the FlameGraph repository on the machine/container where you need it. The rest of this guide will assume that it’s available at /FlameGraph:

git clone https://github.com/brendangregg/FlameGraph

Tip: With Docker, you may want to bind mount the FlameGraph repository into the container using

docker run -it --rm \
           --privileged \
           -v "/path/to/FlameGraphOnYourMachine:/FlameGraph:ro" \
           -v "$PWD:PWD" -w "$PWD" \
           swift:latest

or similar.

Tools

In this guide, we will be using the Linux perf tool. If you’re struggling to get perf to work, have a look at our information regarding perf. If you’re running in a Docker container, don’t forget that you’ll need a privileged container. And generally, you will need root access, so you may need to prefix the commands with sudo.

Getting a `perf` user probe

In this guide, we will be counting the number of allocations. Most allocations from a Swift program (on Linux) will be done through the malloc function.

To get information about when an allocation function is called, we will install a perf “user probes” on the allocation functions. Because Swift also uses other allocation functions such as calloc and posix_memalign, we’ll install a user probe for them all. From then on, there will be an event in perf that will fire whenever one of the allocation functions is called.

# figures out the path to libc
libc_path=$(readlink -e /lib64/libc.so.6 /lib/x86_64-linux-gnu/libc.so.6)

# delete all existing user probes on libc (instead of * you can also list them individually)
perf probe --del 'probe_libc:*'

# installs a probe on `malloc`, `calloc`, and `posix_memalign`
perf probe -x "$libc_path" --add malloc --add calloc --add posix_memalign

The result (hopefully) looks somewhat like this:

Added new events:
  probe_libc:malloc    (on malloc in /usr/lib/x86_64-linux-gnu/libc-2.31.so)
  probe_libc:calloc    (on calloc in /usr/lib/x86_64-linux-gnu/libc-2.31.so)
  probe_libc:posix_memalign (on posix_memalign in /usr/lib/x86_64-linux-gnu/libc-2.31.so)

[...]

What perf is telling you here is that it added a new events called probe_libc:malloc, probe_libc:calloc, … which will fire every time the respective function is called.

Let’s confirm that our probe_libc:malloc probe actually works by running:

perf stat -e probe_libc:malloc -- bash -c 'echo Hello World'

which should output something like

Hello World

 Performance counter stats for 'bash -c echo Hello World':

              1021      probe_libc:malloc

       0.003840500 seconds time elapsed

       0.000000000 seconds user
       0.003867000 seconds sys

Which seems to have allocated 1021 times, great. If that probe fired 0 times, something went wrong.

Running the allocation analysis

After we have confirmed that our user probe on malloc works in general, let’s dial it up a little. The first thing we’ll need is a program that we’d like to analyse the allocations of.

For example, we could analyse a program which does 10 subsequent HTTP requests using AsyncHTTPClient. If you’re interested in the full source code, please expand below.

Demo program source code

With the following dependencies

dependencies: [
    .package(url: "https://github.com/swift-server/async-http-client.git", from: "1.3.0"),
    .package(url: "https://github.com/apple/swift-nio.git", from: "2.29.0"),
    .package(url: "https://github.com/apple/swift-log.git", from: "1.4.2"),
],

We could write this program

import AsyncHTTPClient
import NIO
import Logging

let urls = Array(repeating:"http://httpbin.org/get", count: 10)
var logger = Logger(label: "ahc-alloc-demo")

logger.info("running HTTP requests", metadata: ["count": "\(urls.count)"])
MultiThreadedEventLoopGroup.withCurrentThreadAsEventLoop { eventLoop in
    let httpClient = HTTPClient(eventLoopGroupProvider: .shared(eventLoop),
                                backgroundActivityLogger: logger)

    func doRemainingRequests(_ remaining: ArraySlice<String>,
                             overallResult: EventLoopPromise<Void>,
                             eventLoop: EventLoop) {
        var remaining = remaining
        if let first = remaining.popFirst() {
            httpClient.get(url: first, logger: logger).map { [remaining] _ in
                eventLoop.execute { // for shorter stacks
                    doRemainingRequests(remaining, overallResult: overallResult, eventLoop: eventLoop)
                }
            }.whenFailure { error in
                overallResult.fail(error)
            }
        } else {
            return overallResult.succeed(())
        }
    }

    let promise = eventLoop.makePromise(of: Void.self)
    // Kick off the process
    doRemainingRequests(urls[...],
                        overallResult: promise,
                        eventLoop: eventLoop)

    promise.futureResult.whenComplete { result in
        switch result {
        case .success:
            logger.info("all HTTP requests succeeded")
        case .failure(let error):
            logger.error("HTTP request failure", metadata: ["error": "\(error)"])
        }

        httpClient.shutdown { maybeError in
            if let error = maybeError {
                logger.error("AHC shutdown failed", metadata: ["error": "\(error)"])
            }
            eventLoop.shutdownGracefully { maybeError in
                if let error = maybeError {
                    logger.error("EventLoop shutdown failed", metadata: ["error": "\(error)"])
                }
            }
        }
    }
}

logger.info("exiting")

Assuming you have a program as a Swift package, we should first of all compile it in release mode using swift build -c release. Then you should find a binary called .build/release/your-program-name which we can then analyse.

Allocation counts

Before we go into visualising the allocations as a flame graph, let’s start with the simplest analysis: Getting the total number of allocations

perf stat -e 'probe_libc:*' -- .build/release/your-program-name

The above command instructs perf to run your program and count the number of times the probe_libc:malloc probe was hit. This should be the number of allocations done by your program.

The output should look something like

 Performance counter stats for '.build/release/your-program-name':

                68      probe_libc:posix_memalign
                35      probe_libc:calloc_1
                 0      probe_libc:calloc
              2977      probe_libc:malloc

[...]

In this case, my program allocated 2,977 times through malloc and a few more times through the other allocation functions. If you just want to compare the effects of a pull request you may just want to run this perf stat command twice. If you would like to find out where your allocations come from, read on.

Please note that in this guide we’ll use -e probe_libc:* instead of individually listing every event like -e probe_libc:malloc,probe_libc:calloc,probe_libc:calloc_1,probe_libc:posix_memalign. This assumes that you have no other perf user probes installed. If you do, please specify each event you would like to use individually.

Collecting the raw data

With perf, we can’t really create live graphs whilst the program is running. For most analyses, we want to first record some raw data (usually with perf record) and later on transform the recorded data into a graph.

To get started, let’s have perf run the program for us and collect the information using the libc_probe:malloc we set up before.

perf record --call-graph dwarf,16384 \
     -m 50000 \
     -e 'probe_libc:*' -- \
     .build/release/your-program-name

Let’s break down this command a little:

perf record instructs perf to record data, makes sense.
--call-graph dwarf,16384 instructs perf to use the DWARF information to create the call graphs. It also sets the maximum stack dump size to 16k which should be enough to get you full stack traces. Unfortunately, using DWARF is rather slow (see below) but it creates the best call graphs for you.
-m 50000: The size of the ring buffer that perf uses to buffer. This is given in multiples of PAGE_SIZE (usually 4kB) and especially with DWARF this needs to be pretty huge to prevent data loss.
-e 'probe_libc:*': Record when the malloc/calloc/… probes fire

What you want to see if output like this

<your program's output>
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 401.088 MB perf.data (49640 samples) ]

If perf tells you about “lost chunks” and asks you to “check the IO/CPU overhead”, you should jump to the ‘Overcoming “lost chunks”’ section at the end of this document.

Flame graphs

After a successful perf record, you can invoke the following command line to produce an SVG file with the flame graph

perf script | \
    /FlameGraph/stackcollapse-perf.pl - | \
    swift demangle --simplified | \
    /FlameGraph/flamegraph.pl --countname allocations \
        --width 1600 > out.svg

Let’s expand a little on what the above command does:

It runs perf script which dumps the binary information that perf record recorded into a textual form.
Next, we invoke stackcollapse-perf on it which transforms the stacks that perf script outputs into the right format for Flame Graphs,
then we invoke swift demangle --simplified which will give us nice symbol names,
and lastly we create the Flame Graph itself

After this command has run (which may run for a while), you should have an SVG file that you can open in your browser.

For the above example program, please see an example flame graph below. Note how you can hover over the stack frames and get more information. To focus on a sub tree, you can click any stack frame too.

Generally, in flame graphs, the X axis just means “count”, it does not mean time. In other words, whether a stack appears on the left or the right is not determined when that stack was live (this is different in flame charts).

Note that this flame graph is not a CPU flame graph, 1 sample means 1 allocation here and not time spent on the CPU. Also be aware that stack frames that appear wide don’t necessarily allocate directly, it means that they or something they call has allocated a lot. For example, BaseSocketChannel.readable is a very wide frame, and yet, it is not a function which allocates directly. However, it calls other functions (such as other parts of SwiftNIO and AsyncHTTPClient) that do allocate a lot. It may take a little while to get familiar with flame graphs but there are great resources available online.

Allocation flame graphs on macOS

So far, this tutorial focussed on Linux and the perf tool. You can however create the same graphs on macOS. The process is fairly similar.

First, let’s collect the raw data using DTrace.

sudo dtrace -n 'pid$target::malloc:entry,pid$target::posix_memalign:entry,pid$target::calloc:entry,pid$target::malloc_zone_malloc:entry,pid$target::malloc_zone_calloc:entry,pid$target::malloc_zone_memalign:entry { @s[ustack(100)] = count(); } ::END { printa(@s); }' -c .build/release/your-program > raw.stacks

Similar to perf’s user probes, dtrace also has probes and the above command instructs DTrace to aggregate the number of calls to the allocation functions malloc, posix_memalign, calloc, and the malloc_zone_* equivalents. On Apple platforms, Swift uses a slightly larger number of allocation functions than on Linux, therefore we need to specify a few more functions.

Once we collected the data, we can also create an SVG file using

cat raw.stacks |\
    /FlameGraph/stackcollapse.pl - | \
    swift demangle --simplified | \
    /FlameGraph/flamegraph.pl --countname allocations \
        --width 1600 > out.svg

which you will notice is very similar to the perf invocation. The only differences are:

We use cat raw.stacks instead of perf script because we already have the textual data in a file with DTrace
Instead of stackcollapse-perf.pl (which parses perf script output) we use stackcollapse.pl (which parses DTrace aggregation output)

Other `perf` tricks

Prettifying Swift’s allocation pattern

Allocations in Swift usually have a very distinct shape:

Some code creates for example a class instance (which allocates).
This calls swift_allocObject,
which calls swift_slowAlloc,
which calls malloc (where we have our probe).

To make our flame graphs look nicer, we can apply a small transformation after we have demangled the collapsed stacks:

sed -e 's/specialized //g' \
    -e 's/;swift_allocObject;swift_slowAlloc;__libc_malloc/;A/g'

which will get rid of "specialized " and replaces swift_allocObject calling swift_slowAlloc, calling malloc with just an A (for allocation). The full command will then look like

perf script | \
    /FlameGraph/stackcollapse-perf.pl - | \
    swift demangle --simplified | \
    sed -e 's/specialized //g' \
        -e 's/;swift_allocObject;swift_slowAlloc;__libc_malloc/;A/g' | \
    /FlameGraph/flamegraph.pl --countname allocations --flamechart --hash \
    > out.svg

Overcoming “lost chunks”

When using perf with the DWARF call stack unwinding, it is unfortunately easy to run into the following issue

[ perf record: Woken up 189 times to write data ]
Warning:
Processed 4346 events and lost 144 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 30.868 MB perf.data (3817 samples) ]

When perf tells you that it lost a number of chunks it means that it lost data. If perf lost data, you have a few options:

Reduce the amount of work your program is doing. For every allocation, perf will need to record a stack trace.
Reduce the maximum “stack dump” that perf records by changing the --call-graph dwarf parameter to for example --call-graph dwarf,2048. The default is to record a maximum of 4096 bytes which gives you pretty deep stacks, if you don’t need that you can reduce the number. The tradeoff is that the flame graph may show you [unknown] stack frames which means that there are missing stack frames there. The unit is bytes.
You can raise the number of the -m parameter which is the size of the ring buffer that perf uses in memory (in multiples of PAGE_SIZE, usually that is 4kB)
You can give up nice call graphs and replace --call-tree dwarf with --call-tree fp (fp stands for frame pointer).