Allocations
For high-performance software in Swift, it’s often important to understand where your heap allocations are coming from. The next step can then be to reduce the number of allocations your software makes.
This is very similar to other performance questions: Before you can optimise performance you need to understand where you spend your resources. And resources can be CPU time, as well as memory, or heap allocations. In this document we will solely focus on the number of heap allocations, not their size.
On macOS, you can use Instruments’s “Allocations” instrument. The Allocations instrument shows you two sets of values: The live allocations (i.e. allocated and not freed) as well as the transient allocations (all allocations made).
Your production workloads however will likely run on Linux and depending on your setup the number of allocations can differ significantly between macOS and Linux.
Preparation
To not waste your time, be sure to do any profiling in release mode. Swift’s optimiser will produce significantly faster code which will also allocate less in release mode. Usually this means you need to run
swift run -c release
Install perf
Follow the installation instructions in the Linux perf
utility guide.
Clone the FlameGraph
project
To see some pretty graphs, clone the FlameGraph
repository on the machine/container where you need it. The rest of this guide will assume that it’s available at /FlameGraph
:
git clone https://github.com/brendangregg/FlameGraph
Tip: With Docker, you may want to bind mount the FlameGraph
repository into the container using
docker run -it --rm \
--privileged \
-v "/path/to/FlameGraphOnYourMachine:/FlameGraph:ro" \
-v "$PWD:PWD" -w "$PWD" \
swift:latest
or similar.
Tools
In this guide, we will be using the Linux perf
tool. If you’re struggling to get perf
to work, have a look at our information regarding perf
. If you’re running in a Docker container, don’t forget that you’ll need a privileged container. And generally, you will need root
access, so you may need to prefix the commands with sudo
.
Getting a perf
user probe
In this guide, we will be counting the number of allocations. Most allocations from a Swift program (on Linux) will be done through the malloc
function.
To get information about when an allocation function is called, we will install a perf
“user probes” on the allocation functions. Because Swift also uses other allocation functions such as calloc
and posix_memalign
, we’ll install a user probe for them all. From then on, there will be an event in perf
that will fire whenever one of the allocation functions is called.
# figures out the path to libc
libc_path=$(readlink -e /lib64/libc.so.6 /lib/x86_64-linux-gnu/libc.so.6)
# delete all existing user probes on libc (instead of * you can also list them individually)
perf probe --del 'probe_libc:*'
# installs a probe on `malloc`, `calloc`, and `posix_memalign`
perf probe -x "$libc_path" --add malloc --add calloc --add posix_memalign
The result (hopefully) looks somewhat like this:
Added new events:
probe_libc:malloc (on malloc in /usr/lib/x86_64-linux-gnu/libc-2.31.so)
probe_libc:calloc (on calloc in /usr/lib/x86_64-linux-gnu/libc-2.31.so)
probe_libc:posix_memalign (on posix_memalign in /usr/lib/x86_64-linux-gnu/libc-2.31.so)
[...]
What perf
is telling you here is that it added a new events called probe_libc:malloc
, probe_libc:calloc
, … which will fire every time the respective function is called.
Let’s confirm that our probe_libc:malloc
probe actually works by running:
perf stat -e probe_libc:malloc -- bash -c 'echo Hello World'
which should output something like
Hello World
Performance counter stats for 'bash -c echo Hello World':
1021 probe_libc:malloc
0.003840500 seconds time elapsed
0.000000000 seconds user
0.003867000 seconds sys
Which seems to have allocated 1021 times, great. If that probe fired 0 times, something went wrong.
Running the allocation analysis
After we have confirmed that our user probe on malloc
works in general, let’s dial it up a little. The first thing we’ll need is a program that we’d like to analyse the allocations of.
For example, we could analyse a program which does 10 subsequent HTTP requests using AsyncHTTPClient. If you’re interested in the full source code, please expand below.
Demo program source code
With the following dependencies
dependencies: [
.package(url: "https://github.com/swift-server/async-http-client.git", from: "1.3.0"),
.package(url: "https://github.com/apple/swift-nio.git", from: "2.29.0"),
.package(url: "https://github.com/apple/swift-log.git", from: "1.4.2"),
],
We could write this program
import AsyncHTTPClient
import NIO
import Logging
let urls = Array(repeating:"http://httpbin.org/get", count: 10)
var logger = Logger(label: "ahc-alloc-demo")
logger.info("running HTTP requests", metadata: ["count": "\(urls.count)"])
MultiThreadedEventLoopGroup.withCurrentThreadAsEventLoop { eventLoop in
let httpClient = HTTPClient(eventLoopGroupProvider: .shared(eventLoop),
backgroundActivityLogger: logger)
func doRemainingRequests(_ remaining: ArraySlice<String>,
overallResult: EventLoopPromise<Void>,
eventLoop: EventLoop) {
var remaining = remaining
if let first = remaining.popFirst() {
httpClient.get(url: first, logger: logger).map { [remaining] _ in
eventLoop.execute { // for shorter stacks
doRemainingRequests(remaining, overallResult: overallResult, eventLoop: eventLoop)
}
}.whenFailure { error in
overallResult.fail(error)
}
} else {
return overallResult.succeed(())
}
}
let promise = eventLoop.makePromise(of: Void.self)
// Kick off the process
doRemainingRequests(urls[...],
overallResult: promise,
eventLoop: eventLoop)
promise.futureResult.whenComplete { result in
switch result {
case .success:
logger.info("all HTTP requests succeeded")
case .failure(let error):
logger.error("HTTP request failure", metadata: ["error": "\(error)"])
}
httpClient.shutdown { maybeError in
if let error = maybeError {
logger.error("AHC shutdown failed", metadata: ["error": "\(error)"])
}
eventLoop.shutdownGracefully { maybeError in
if let error = maybeError {
logger.error("EventLoop shutdown failed", metadata: ["error": "\(error)"])
}
}
}
}
}
logger.info("exiting")
Assuming you have a program as a Swift package, we should first of all compile it in release mode using swift build -c release
. Then you should find a binary called .build/release/your-program-name
which we can then analyse.
Allocation counts
Before we go into visualising the allocations as a flame graph, let’s start with the simplest analysis: Getting the total number of allocations
perf stat -e 'probe_libc:*' -- .build/release/your-program-name
The above command instructs perf to run your program and count the number of times the probe_libc:malloc
probe was hit. This should be the number of allocations done by your program.
The output should look something like
Performance counter stats for '.build/release/your-program-name':
68 probe_libc:posix_memalign
35 probe_libc:calloc_1
0 probe_libc:calloc
2977 probe_libc:malloc
[...]
In this case, my program allocated 2,977 times through malloc
and a few more times through the other allocation functions. If you just want to compare the effects of a pull request you may just want to run this perf stat
command twice. If you would like to find out where your allocations come from, read on.
Please note that in this guide we’ll use -e probe_libc:*
instead of individually listing every event like -e probe_libc:malloc,probe_libc:calloc,probe_libc:calloc_1,probe_libc:posix_memalign
. This assumes that you have no other perf
user probes installed. If you do, please specify each event you would like to use individually.
Collecting the raw data
With perf
, we can’t really create live graphs whilst the program is running. For most analyses, we want to first record some raw data (usually with perf record
) and later on transform the recorded data into a graph.
To get started, let’s have perf
run the program for us and collect the information using the libc_probe:malloc
we set up before.
perf record --call-graph dwarf,16384 \
-m 50000 \
-e 'probe_libc:*' -- \
.build/release/your-program-name
Let’s break down this command a little:
perf record
instructsperf
torecord
data, makes sense.--call-graph dwarf,16384
instructsperf
to use the DWARF information to create the call graphs. It also sets the maximum stack dump size to 16k which should be enough to get you full stack traces. Unfortunately, using DWARF is rather slow (see below) but it creates the best call graphs for you.-m 50000
: The size of the ring buffer thatperf
uses to buffer. This is given in multiples ofPAGE_SIZE
(usually 4kB) and especially with DWARF this needs to be pretty huge to prevent data loss.-e 'probe_libc:*'
: Record when themalloc
/calloc
/… probes fire
What you want to see if output like this
<your program's output>
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 401.088 MB perf.data (49640 samples) ]
If perf tells you about “lost chunks” and asks you to “check the IO/CPU overhead”, you should jump to the ‘Overcoming “lost chunks”’ section at the end of this document.
Flame graphs
After a successful perf record
, you can invoke the following command line to produce an SVG file with the flame graph
perf script | \
/FlameGraph/stackcollapse-perf.pl - | \
swift demangle --simplified | \
/FlameGraph/flamegraph.pl --countname allocations \
--width 1600 > out.svg
Let’s expand a little on what the above command does:
- It runs
perf script
which dumps the binary information thatperf record
recorded into a textual form. - Next, we invoke
stackcollapse-perf
on it which transforms the stacks thatperf script
outputs into the right format for Flame Graphs, - then we invoke
swift demangle --simplified
which will give us nice symbol names, - and lastly we create the Flame Graph itself
After this command has run (which may run for a while), you should have an SVG file that you can open in your browser.
For the above example program, please see an example flame graph below. Note how you can hover over the stack frames and get more information. To focus on a sub tree, you can click any stack frame too.
Generally, in flame graphs, the X axis just means “count”, it does not mean time. In other words, whether a stack appears on the left or the right is not determined when that stack was live (this is different in flame charts).
Note that this flame graph is not a CPU flame graph, 1 sample means 1 allocation here and not time spent on the CPU. Also be aware that stack frames that appear wide don’t necessarily allocate directly, it means that they or something they call has allocated a lot. For example, BaseSocketChannel.readable
is a very wide frame, and yet, it is not a function which allocates directly. However, it calls other functions (such as other parts of SwiftNIO and AsyncHTTPClient) that do allocate a lot. It may take a little while to get familiar with flame graphs but there are great resources available online.
Allocation flame graphs on macOS
So far, this tutorial focussed on Linux and the perf
tool. You can however create the same graphs on macOS. The process is fairly similar.
First, let’s collect the raw data using DTrace.
sudo dtrace -n 'pid$target::malloc:entry,pid$target::posix_memalign:entry,pid$target::calloc:entry,pid$target::malloc_zone_malloc:entry,pid$target::malloc_zone_calloc:entry,pid$target::malloc_zone_memalign:entry { @s[ustack(100)] = count(); } ::END { printa(@s); }' -c .build/release/your-program > raw.stacks
Similar to perf
’s user probes, dtrace also has probes and the above command instructs DTrace to aggregate the number of calls to the allocation functions malloc
, posix_memalign
, calloc
, and the malloc_zone_*
equivalents. On Apple platforms, Swift uses a slightly larger number of allocation functions than on Linux, therefore we need to specify a few more functions.
Once we collected the data, we can also create an SVG file using
cat raw.stacks |\
/FlameGraph/stackcollapse.pl - | \
swift demangle --simplified | \
/FlameGraph/flamegraph.pl --countname allocations \
--width 1600 > out.svg
which you will notice is very similar to the perf
invocation. The only differences are:
- We use
cat raw.stacks
instead ofperf script
because we already have the textual data in a file with DTrace - Instead of
stackcollapse-perf.pl
(which parsesperf script
output) we usestackcollapse.pl
(which parses DTrace aggregation output)
Other perf
tricks
Prettifying Swift’s allocation pattern
Allocations in Swift usually have a very distinct shape:
- Some code creates for example a class instance (which allocates).
- This calls
swift_allocObject
, - which calls
swift_slowAlloc
, - which calls
malloc
(where we have our probe).
To make our flame graphs look nicer, we can apply a small transformation after we have demangled the collapsed stacks:
sed -e 's/specialized //g' \
-e 's/;swift_allocObject;swift_slowAlloc;__libc_malloc/;A/g'
which will get rid of "specialized "
and replaces swift_allocObject
calling swift_slowAlloc
, calling malloc
with just an A
(for allocation). The full command will then look like
perf script | \
/FlameGraph/stackcollapse-perf.pl - | \
swift demangle --simplified | \
sed -e 's/specialized //g' \
-e 's/;swift_allocObject;swift_slowAlloc;__libc_malloc/;A/g' | \
/FlameGraph/flamegraph.pl --countname allocations --flamechart --hash \
> out.svg
Overcoming “lost chunks”
When using perf
with the DWARF call stack unwinding, it is unfortunately easy to run into the following issue
[ perf record: Woken up 189 times to write data ]
Warning:
Processed 4346 events and lost 144 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 30.868 MB perf.data (3817 samples) ]
When perf
tells you that it lost a number of chunks it means that it lost data. If perf
lost data, you have a few options:
- Reduce the amount of work your program is doing. For every allocation,
perf
will need to record a stack trace. - Reduce the maximum “stack dump” that
perf
records by changing the--call-graph dwarf
parameter to for example--call-graph dwarf,2048
. The default is to record a maximum of 4096 bytes which gives you pretty deep stacks, if you don’t need that you can reduce the number. The tradeoff is that the flame graph may show you[unknown]
stack frames which means that there are missing stack frames there. The unit is bytes. - You can raise the number of the
-m
parameter which is the size of the ring buffer thatperf
uses in memory (in multiples ofPAGE_SIZE
, usually that is 4kB) - You can give up nice call graphs and replace
--call-tree dwarf
with--call-tree fp
(fp
stands for frame pointer).