Mar 9th, 2023 @ justine's web page
I wrote a program called
rusage.com which provides the best
possible way to report resource usage statistics when launching command
line programs. It provides a much richer set of information than the
$ time /bin/echo hi >/dev/null real 0m0.001s user 0m0.001s sys 0m0.000s
Above we see
time isn't very good, since it only reports
CPU usage at millisecond granularity. On the other
rusage.com gives you the full power of the underlying
system calls (
provide a richer set of higher precision information.
$ rusage.com /bin/echo hi hi RL: took 551µs wall time RL: ballooned to 1,472kb in size RL: needed 478µs cpu (0% kernel) RL: caused 89 page faults (100% memcpy)
rusage.com is distributed as a single-file dependency-free
executable that you can use on any AMD64 operating system, as well as
ARM64 platforms like Apple M1 and Raspberry Pi.
332kb - PE+ELF+MachO+ZIP+SH + examples (debug data)
designed for linux/macos/windows/freebsd/netbsd/openbsd on amd64/arm64
Here's how easy it is to get started:
curl https://justine.lol/rusage/rusage.com >rusage.com chmod +x rusage.com ./rusage.com COMMAND
The source code for
rusage.com is part of the Cosmopolitan
Libc monolithic repository.
You may build it yourself on Linux by running:
git clone https://github.com/jart/cosmopolitan cosmo cd cosmo make -j8 o//examples/rusage.com
The usefulness of this tool becomes apparent when running beefier
programs, like an artificial general intelligence model with thirteen
billion parameters. Here's what happens if we use
llama.cpp to load
Facebook's 13B LLaMA model. The code is currently implemented so that it
std::ifstream to trawl through a 12gb file of floating
point values, so they can be deserialized off spinning disk and loaded
into C++ tensor data structures.
In order to simulate what would happen the very first time a user runs
the model after booting up, we'll be using a hack to drop page caches,
which reveals the true cost of a program (noting that
is tired and
doas is wired).
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches'
We'll now measure the classic, portable, "just stdio" way of loading the model:
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 75,260,133µs wall time RL: ballooned to 8,272,800kb in size RL: needed 29,599,572µs cpu (15% kernel) RL: caused 6,784 page faults (100% memcpy) RL: 62,178 context switches (99% consensual) RL: performed 15,895,192 reads and 0 write i/o operations
Wow it took 75 seconds to generate a single token. That's what makes this problem so important. Notice however how few page faults occur? Also notice the fact that they're all minor faults, which the tool reports as "100% memcpy". Minor faults don't actually hit disk, but rather are transitions of virtual to resident memory.
The most powerful system call of them all is
we all know can solve this problem, even if we don't know how. So let's
say we created our own drop-in replacement for
std::ifstream, which is what the llama.cpp codebase
currently uses, and we had it implement the i/o using
mmap(MAP_SHARED) instead. Our goal here would be to avoid
the overhead of making explicit
read() system calls.
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches' $ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 76,060,148µs wall time RL: ballooned to 16,219,688kb in size RL: needed 28,200,083µs cpu (12% kernel) RL: caused 192,671 page faults (67% memcpy) RL: 62,188 context switches (99% consensual) RL: performed 15,896,968 reads and 0 write i/o operations
Here we see that using
read() actually goes 1% slower in the first-run
scenario. That's because hard disks are still slow, and we're not even
cutting out any system call overhead; an i/o system call that happens
magically as a page fault is still a system call.
Let's see what happens if we use
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches' $ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 76,228,287µs wall time RL: ballooned to 16,219,800kb in size RL: needed 28,083,874µs cpu (12% kernel) RL: caused 161,665 page faults (61% memcpy) RL: 62,189 context switches (99% consensual) RL: performed 15,896,992 reads and 0 write i/o operations
Here we see that
madvise() caused there to be fewer page
faults, since the kernel was likely performing some readahead in the
background. However it didn't do much to improve the latency of this
How about we try
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches' $ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 76,137,186µs wall time RL: ballooned to 16,219,548kb in size RL: needed 28,084,624µs cpu (13% kernel) RL: caused 193,166 page faults (67% memcpy) RL: 62,179 context switches (99% consensual) RL: performed 15,897,472 reads and 0 write i/o operations
Once again, we see
madvise() doesn't do much for our use
case (which we'll question in a later section). That's because any
advantage you could hope to gain in terms of i/o performance here, is
almost certainly implemented by default. Kernel developers usually don't
make high performance an opt-in feature, since they don't want to get
roasted in the media at benchmark games.
For batch processing jobs, where advice makes sense is if you wanted to
ask the kernel to be more conscientious and cool its heels. For example,
imagine you're logged into a multi-user mainframe like SDF. One of the
reasons they use NetBSD is because the Linux Kernel would salt the earth
if it'd make a single program faster. One user running just one program
shouldn't dominate the page caches in such a way that everything other
people are doing slows down. Alternatively, imagine that you're running
a nightly cron job to process huge files, a single time. In that case
you should use
of mmap + madvise.
Now let's use
rusage.com to analyze the average case, where
the command is run multiple times, in which case the hard disk overhead
goes away. Here's what happens with classic i/o with a warm disk cache:
RL: took 4,027,307µs wall time RL: ballooned to 8,272,344kb in size RL: needed 26,135,090µs cpu (6% kernel) RL: caused 6,647 page faults (100% memcpy) RL: 62 context switches (29% consensual)
Now let's try using
mmap(MAP_SHARED) with a warm cache:
RL: took 3,326,692µs wall time RL: ballooned to 16,219,332kb in size RL: needed 24,780,516µs cpu (2% kernel) RL: caused 130,365 page faults (100% memcpy) RL: 59 context switches (30% consensual)
Here we see
mmap() can make our loading code 18% faster
than using a classic i/o approach of explicit
Let's try to make it even faster. Here's
madvise(MADV_WILLNEED) with a warm
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 3,498,649µs wall time RL: ballooned to 16,220,136kb in size RL: needed 25,158,517µs cpu (3% kernel) RL: caused 130,624 page faults (100% memcpy) RL: 80 context switches (31% consensual)
Once again, we see that for this particular use case,
madvise() doesn't do much. Neither would huge 2mb pages,
although it's not worth going into further details on how the Linux
Kernel implements that monstrosity.
rusage.com command has helped us determine exactly how
much we'd stand to gain if we switched from a
mmap() i/o model, which offers an average reduction of
latency by 18%.
However it's worth noting that we asked rusage the stupid question. As
great of a tool it may be, it unfortunately can't illuminate the things
we fail to consider. In this case, the true value of
is that it lets us load data structures into memory without having to
touch them at all. There's no reason for llama.cpp to scan through,
deserialize, and construct everything at startup, when it could simply
map already-constructed data. In that case, we wouldn't
rusage.com to tell us that we'd gain a 100% reduction
in startup latency. This is particularly nice if the data structures are
Please note that
madvise() isn't totally without merit.
Even if we reduce startup latency to zero, that just means we'll get hit
with page faults on an as-needed basis. This is a good thing, since it
means our app can focus on the matrix multiplication ops needed to
generate tokens, which would leave the hard disk under-utilized. In that
madvise() is how we'd ask the kernel to exploit those spare
resources by optimistically reading pages in the the background, thereby
ensuring that fewer major page faults occur. I've been invited to
collaborate on the llama.cpp project so I can help ensure these changes
happen. Very soon, your artificial general intelligence will be pleasant
enough to use in shell scripts, at which point I'll write a follow-up
blog post with benchmark numbers.
Funding for Justine's projects comes from Mozilla's MIECO program (which funds innovators who are advancing the future of the Internet) as well as her GitHub sponsors and Patreon subscribers. Your support is what makes this work possible. Thank you so much!