Mar 9th, 2023 @ justine's web page
I wrote a program called rusage.com
which provides the best
possible way to report resource usage statistics when launching command
line programs. It provides a much richer set of information than the
traditional UNIX time
command:
$ time /bin/echo hi >/dev/null real 0m0.001s user 0m0.001s sys 0m0.000s
Above we see time
isn't very good, since it only reports
CPU usage at millisecond granularity. On the other
hand, rusage.com
gives you the full power of the underlying
system calls (wait4(2)
and getrusage(2)
) which
provide a richer set of higher precision information.
$ rusage.com /bin/echo hi hi RL: took 551µs wall time RL: ballooned to 1,472kb in size RL: needed 478µs cpu (0% kernel) RL: caused 89 page faults (100% memcpy)
rusage.com
is distributed as a single-file dependency-free
executable that you can use on any AMD64 operating system, as well as
ARM64 platforms like Apple M1 and Raspberry Pi.
rusage.com
332kb - PE+ELF+MachO+ZIP+SH + examples (debug data)
designed for linux/macos/windows/freebsd/netbsd/openbsd on amd64/arm64
9d6c8b52d352d071ed78c9c79571eb62a372873298011e5f7e9693ba964bf184
Here's how easy it is to get started:
curl https://justine.lol/rusage/rusage.com >rusage.com chmod +x rusage.com ./rusage.com COMMAND
The source code for rusage.com
is part of the Cosmopolitan
Libc monolithic repository.
You may build it yourself on Linux by running:
git clone https://github.com/jart/cosmopolitan cosmo cd cosmo make -j8 o//examples/rusage.com
The usefulness of this tool becomes apparent when running beefier
programs, like an artificial general intelligence model with thirteen
billion parameters. Here's what happens if we use
llama.cpp to load
Facebook's 13B LLaMA model. The code is currently implemented so that it
uses std::ifstream
to trawl through a 12gb file of floating
point values, so they can be deserialized off spinning disk and loaded
into C++ tensor data structures.
In order to simulate what would happen the very first time a user runs
the model after booting up, we'll be using a hack to drop page caches,
which reveals the true cost of a program (noting that sudo
is tired and doas
is wired).
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches'
We'll now measure the classic, portable, "just stdio" way of loading the model:
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 75,260,133µs wall time RL: ballooned to 8,272,800kb in size RL: needed 29,599,572µs cpu (15% kernel) RL: caused 6,784 page faults (100% memcpy) RL: 62,178 context switches (99% consensual) RL: performed 15,895,192 reads and 0 write i/o operations
Wow it took 75 seconds to generate a single token. That's what makes this problem so important. Notice however how few page faults occur? Also notice the fact that they're all minor faults, which the tool reports as "100% memcpy". Minor faults don't actually hit disk, but rather are transitions of virtual to resident memory.
The most powerful system call of them all is mmap()
, which
we all know can solve this problem, even if we don't know how. So let's
say we created our own drop-in replacement for
std::ifstream
, which is what the llama.cpp codebase
currently uses, and we had it implement the i/o using
mmap(MAP_SHARED)
instead. Our goal here would be to avoid
the overhead of making explicit read()
system calls.
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches' $ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 76,060,148µs wall time RL: ballooned to 16,219,688kb in size RL: needed 28,200,083µs cpu (12% kernel) RL: caused 192,671 page faults (67% memcpy) RL: 62,188 context switches (99% consensual) RL: performed 15,896,968 reads and 0 write i/o operations
Here we see that using mmap()
instead
of read()
actually goes 1% slower in the first-run
scenario. That's because hard disks are still slow, and we're not even
cutting out any system call overhead; an i/o system call that happens
magically as a page fault is still a system call.
Let's see what happens if we use mmap(MAP_SHARED)
with madvise(MADV_SEQUENTIAL)
:
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches' $ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 76,228,287µs wall time RL: ballooned to 16,219,800kb in size RL: needed 28,083,874µs cpu (12% kernel) RL: caused 161,665 page faults (61% memcpy) RL: 62,189 context switches (99% consensual) RL: performed 15,896,992 reads and 0 write i/o operations
Here we see that madvise()
caused there to be fewer page
faults, since the kernel was likely performing some readahead in the
background. However it didn't do much to improve the latency of this
operation.
How about we try mmap(MAP_SHARED)
with madvise(MADV_WILLNEED)
:
$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches' $ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 76,137,186µs wall time RL: ballooned to 16,219,548kb in size RL: needed 28,084,624µs cpu (13% kernel) RL: caused 193,166 page faults (67% memcpy) RL: 62,179 context switches (99% consensual) RL: performed 15,897,472 reads and 0 write i/o operations
Once again, we see madvise()
doesn't do much for our use
case (which we'll question in a later section). That's because any
advantage you could hope to gain in terms of i/o performance here, is
almost certainly implemented by default. Kernel developers usually don't
make high performance an opt-in feature, since they don't want to get
roasted in the media at benchmark games.
For batch processing jobs, where advice makes sense is if you wanted to
ask the kernel to be more conscientious and cool its heels. For example,
imagine you're logged into a multi-user mainframe like SDF. One of the
reasons they use NetBSD is because the Linux Kernel would salt the earth
if it'd make a single program faster. One user running just one program
shouldn't dominate the page caches in such a way that everything other
people are doing slows down. Alternatively, imagine that you're running
a nightly cron job to process huge files, a single time. In that case
you should use posix_fadvise(POSIX_FADV_SEQUENTIAL)
instead
of mmap + madvise.
Now let's use rusage.com
to analyze the average case, where
the command is run multiple times, in which case the hard disk overhead
goes away. Here's what happens with classic i/o with a warm disk cache:
RL: took 4,027,307µs wall time RL: ballooned to 8,272,344kb in size RL: needed 26,135,090µs cpu (6% kernel) RL: caused 6,647 page faults (100% memcpy) RL: 62 context switches (29% consensual)
Now let's try using mmap(MAP_SHARED)
with a warm cache:
RL: took 3,326,692µs wall time RL: ballooned to 16,219,332kb in size RL: needed 24,780,516µs cpu (2% kernel) RL: caused 130,365 page faults (100% memcpy) RL: 59 context switches (30% consensual)
Here we see mmap()
can make our loading code 18% faster
than using a classic i/o approach of explicit
read()
.
Let's try to make it even faster. Here's mmap(MAP_SHARED)
and madvise(MADV_WILLNEED)
with a warm
cache:
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt RL: took 3,498,649µs wall time RL: ballooned to 16,220,136kb in size RL: needed 25,158,517µs cpu (3% kernel) RL: caused 130,624 page faults (100% memcpy) RL: 80 context switches (31% consensual)
Once again, we see that for this particular use case,
madvise()
doesn't do much. Neither would huge 2mb pages,
although it's not worth going into further details on how the Linux
Kernel implements that monstrosity.
The rusage.com
command has helped us determine exactly how
much we'd stand to gain if we switched from a read()
to
mmap()
i/o model, which offers an average reduction of
latency by 18%.
However it's worth noting that we asked rusage the wrong question. As
great of a tool it may be, it unfortunately can't illuminate the things
we fail to consider. In this case, the true value of mmap()
is that it lets us load data structures into memory without having to
touch them at all. There's no reason for llama.cpp to scan through,
deserialize, and construct everything at startup, when it could simply
map already-constructed data. In that case, we wouldn't
need rusage.com
to tell us that we'd gain a 100% reduction
in startup latency. This is particularly nice if the data structures are
sparsely used.
Please note that madvise()
isn't totally without merit.
Even if we reduce startup latency to zero, that just means we'll get hit
with page faults on an as-needed basis. This is a good thing, since it
means our app can focus on the matrix multiplication ops needed to
generate tokens, which would leave the hard disk under-utilized. In that
case,
madvise()
is how we'd ask the kernel to exploit those spare
resources by optimistically reading pages in the the background, thereby
ensuring that fewer major page faults occur. I've been invited to
collaborate on the llama.cpp project so I can help ensure these changes
happen. Very soon, your artificial general intelligence will be pleasant
enough to use in shell scripts, at which point I'll write a follow-up
blog post with benchmark numbers.
Funding for Justine's projects comes from Mozilla's MIECO program (which funds innovators who are advancing the future of the Internet) as well as her GitHub sponsors and Patreon subscribers. Your support is what makes this work possible. Thank you so much!