Mar 9th, 2023 @ justine's web page

Portable rusage command

I wrote a program called rusage.com which provides the best possible way to report resource usage statistics when launching command line programs. It provides a much richer set of information than the traditional UNIX time command:

$ time /bin/echo hi >/dev/null
real    0m0.001s
user    0m0.001s
sys     0m0.000s

Above we see time isn't very good, since it only reports CPU usage at millisecond granularity. On the other hand, rusage.com gives you the full power of the underlying system calls (wait4(2) and getrusage(2)) which provide a richer set of higher precision information.

$ rusage.com /bin/echo hi
hi
RL: took 551µs wall time
RL: ballooned to 1,472kb in size
RL: needed 478µs cpu (0% kernel)
RL: caused 89 page faults (100% memcpy)

Download   [Linux] [Windows] [MacOS] [FreeBSD] [OpenBSD] [NetBSD] [AMD64] [ARM]

rusage.com is distributed as a single-file dependency-free executable that you can use on any AMD64 operating system, as well as ARM64 platforms like Apple M1 and Raspberry Pi.

Here's how easy it is to get started:

curl https://justine.lol/rusage/rusage.com >rusage.com
chmod +x rusage.com
./rusage.com COMMAND

Source Code

The source code for rusage.com is part of the Cosmopolitan Libc monolithic repository.

You may build it yourself on Linux by running:

git clone https://github.com/jart/cosmopolitan cosmo
cd cosmo
make -j8 o//examples/rusage.com

Example Use Case

The usefulness of this tool becomes apparent when running beefier programs, like an artificial general intelligence model with thirteen billion parameters. Here's what happens if we use llama.cpp to load Facebook's 13B LLaMA model. The code is currently implemented so that it uses std::ifstream to trawl through a 12gb file of floating point values, so they can be deserialized off spinning disk and loaded into C++ tensor data structures.

Cold Cache First Run Analysis

In order to simulate what would happen the very first time a user runs the model after booting up, we'll be using a hack to drop page caches, which reveals the true cost of a program (noting that sudo is tired and doas is wired).

$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches'

We'll now measure the classic, portable, "just stdio" way of loading the model:

$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt
RL: took 75,260,133µs wall time
RL: ballooned to 8,272,800kb in size
RL: needed 29,599,572µs cpu (15% kernel)
RL: caused 6,784 page faults (100% memcpy)
RL: 62,178 context switches (99% consensual)
RL: performed 15,895,192 reads and 0 write i/o operations

Wow it took 75 seconds to generate a single token. That's what makes this problem so important. Notice however how few page faults occur? Also notice the fact that they're all minor faults, which the tool reports as "100% memcpy". Minor faults don't actually hit disk, but rather are transitions of virtual to resident memory.

The most powerful system call of them all is mmap(), which we all know can solve this problem, even if we don't know how. So let's say we created our own drop-in replacement for std::ifstream, which is what the llama.cpp codebase currently uses, and we had it implement the i/o using mmap(MAP_SHARED) instead. Our goal here would be to avoid the overhead of making explicit read() system calls.

$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches'
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt
RL: took 76,060,148µs wall time
RL: ballooned to 16,219,688kb in size
RL: needed 28,200,083µs cpu (12% kernel)
RL: caused 192,671 page faults (67% memcpy)
RL: 62,188 context switches (99% consensual)
RL: performed 15,896,968 reads and 0 write i/o operations

Here we see that using mmap() instead of read() actually goes 1% slower in the first-run scenario. That's because hard disks are still slow, and we're not even cutting out any system call overhead; an i/o system call that happens magically as a page fault is still a system call.

Let's see what happens if we use mmap(MAP_SHARED) with madvise(MADV_SEQUENTIAL):

$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches'
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt
RL: took 76,228,287µs wall time
RL: ballooned to 16,219,800kb in size
RL: needed 28,083,874µs cpu (12% kernel)
RL: caused 161,665 page faults (61% memcpy)
RL: 62,189 context switches (99% consensual)
RL: performed 15,896,992 reads and 0 write i/o operations

Here we see that madvise() caused there to be fewer page faults, since the kernel was likely performing some readahead in the background. However it didn't do much to improve the latency of this operation.

How about we try mmap(MAP_SHARED) with madvise(MADV_WILLNEED):

$ doas sh -c 'echo -n 3 >/proc/sys/vm/drop_caches'
$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt
RL: took 76,137,186µs wall time
RL: ballooned to 16,219,548kb in size
RL: needed 28,084,624µs cpu (13% kernel)
RL: caused 193,166 page faults (67% memcpy)
RL: 62,179 context switches (99% consensual)
RL: performed 15,897,472 reads and 0 write i/o operations

Once again, we see madvise() doesn't do much for our use case (which we'll question in a later section). That's because any advantage you could hope to gain in terms of i/o performance here, is almost certainly implemented by default. Kernel developers usually don't make high performance an opt-in feature, since they don't want to get roasted in the media at benchmark games.

For batch processing jobs, where advice makes sense is if you wanted to ask the kernel to be more conscientious and cool its heels. For example, imagine you're logged into a multi-user mainframe like SDF. One of the reasons they use NetBSD is because the Linux Kernel would salt the earth if it'd make a single program faster. One user running just one program shouldn't dominate the page caches in such a way that everything other people are doing slows down. Alternatively, imagine that you're running a nightly cron job to process huge files, a single time. In that case you should use posix_fadvise(POSIX_FADV_SEQUENTIAL) instead of mmap + madvise.

Hot Cache Repeated Run Analysis

Now let's use rusage.com to analyze the average case, where the command is run multiple times, in which case the hard disk overhead goes away. Here's what happens with classic i/o with a warm disk cache:

RL: took 4,027,307µs wall time
RL: ballooned to 8,272,344kb in size
RL: needed 26,135,090µs cpu (6% kernel)
RL: caused 6,647 page faults (100% memcpy)
RL: 62 context switches (29% consensual)

Now let's try using mmap(MAP_SHARED) with a warm cache:

RL: took 3,326,692µs wall time
RL: ballooned to 16,219,332kb in size
RL: needed 24,780,516µs cpu (2% kernel)
RL: caused 130,365 page faults (100% memcpy)
RL: 59 context switches (30% consensual)

Here we see mmap() can make our loading code 18% faster than using a classic i/o approach of explicit read().

Let's try to make it even faster. Here's mmap(MAP_SHARED) and madvise(MADV_WILLNEED) with a warm cache:

$ ./rusage.com ./llama.cpp -m ./models/13B/ggml-model-q4_0.bin -t 12 -n 1 -f ~/prompt.txt
RL: took 3,498,649µs wall time
RL: ballooned to 16,220,136kb in size
RL: needed 25,158,517µs cpu (3% kernel)
RL: caused 130,624 page faults (100% memcpy)
RL: 80 context switches (31% consensual)

Once again, we see that for this particular use case, madvise() doesn't do much. Neither would huge 2mb pages, although it's not worth going into further details on how the Linux Kernel implements that monstrosity.

Judgement

The rusage.com command has helped us determine exactly how much we'd stand to gain if we switched from a read() to mmap() i/o model, which offers an average reduction of latency by 18%.

However it's worth noting that we asked rusage the wrong question. As great of a tool it may be, it unfortunately can't illuminate the things we fail to consider. In this case, the true value of mmap() is that it lets us load data structures into memory without having to touch them at all. There's no reason for llama.cpp to scan through, deserialize, and construct everything at startup, when it could simply map already-constructed data. In that case, we wouldn't need rusage.com to tell us that we'd gain a 100% reduction in startup latency. This is particularly nice if the data structures are sparsely used.

Please note that madvise() isn't totally without merit. Even if we reduce startup latency to zero, that just means we'll get hit with page faults on an as-needed basis. This is a good thing, since it means our app can focus on the matrix multiplication ops needed to generate tokens, which would leave the hard disk under-utilized. In that case, madvise() is how we'd ask the kernel to exploit those spare resources by optimistically reading pages in the the background, thereby ensuring that fewer major page faults occur. I've been invited to collaborate on the llama.cpp project so I can help ensure these changes happen. Very soon, your artificial general intelligence will be pleasant enough to use in shell scripts, at which point I'll write a follow-up blog post with benchmark numbers.

Funding

[United States of Lemuria - two dollar bill - all debts public and primate]

Funding for Justine's projects comes from Mozilla's MIECO program (which funds innovators who are advancing the future of the Internet) as well as her GitHub sponsors and Patreon subscribers. Your support is what makes this work possible. Thank you so much!