April 5th, 2023 @ justine's web page
When Meta released LLaMA back in February, many of us were excited to see a high-quality Large Language Model (LLM) become available for public access. Many of us who signed up however, had difficulties getting LLaMA to run on our edge and personal computer devices. One month ago, Georgi Gerganov started the llama.cpp project to provide a solution to this, and since then his project has been one of the hottest things on GitHub, having earned itself 19k stars. I spent the last few weeks volunteering for this project, and I've got some great news to share about its recent progress.
We modified llama.cpp to load weights using mmap() instead of C++ standard I/O. That enabled us to load LLaMA 100x faster using half as much memory. Our changes have just been made available in the latest release. The benefits are as follows:
mmap()
lets us map the read-only weights
using MAP_SHARED
, which is the same technique that's
traditionally been used for loading executable software. So we figured,
why aren't we using it to load neural network software too? Now we can.
mmap()
avoids the need to copy pages. Copying pages
is bad, because you don't want copied memory to compete with the kernel
file cache. When too much copied memory gets created, the kernel reacts
by evicting cache entries, which means LLaMA will load slowly from disk
each time. Since reducing memory requirements, users have been telling
wonderful stories, like running
LLaMA-13B on an old Android phone. For PCs with 32GB of RAM, you
should be able to comfortably run LLaMA-30B, since it's 20GB with 4-bit
quantized weights.
cat
command. However, if your use case requires
frequently restarting inference for reasons of context or quality, then
you'll now have a quicker road to recovery. There is however a catch:
after your weights file instantly loads, you still need to wait for your
prompt to load. That's something you can expect to see addressed soon.
One of the reasons llama.cpp attracted so much attention is because it
lowers the barriers of entry for running large language models. That's
great for helping the benefits of these models be more widely accessible
to the public. It's also helping businesses save on costs. Thanks to
mmap()
we're much closer to both these goals than we were before.
Furthermore, the reduction of user-visible latency has made the tool
more pleasant to use.
The new mmap()
based loader is now available in the
llama.cpp project, which is released under the MIT license on GitHub in
both source and binary forms:
https://github.com/ggerganov/llama.cpp
Existing users will need to convert their GGML weights to the new file format:
less migrate-ggml-2023-03-30-pr613.py # view manual python migrate-ggml-2023-03-30-pr613.py SRC DST # run tool
New users should request access from Meta and read Simon Willison's blog post for an explanation of how to get started. Please note that, with our recent changes, some of the steps in his 13B tutorial relating to multiple .1, etc. files can now be skipped. That's because our conversion tools now turn multi-part weights into a single file.
When the llama.cpp project received feedback that we should be using
mmap()
the first idea that came to mind was to find a way to make it
work within the confines of our C++ library abstractions.
@apaz-cli was the person who
got the ball rolling on this. The basic idea we tried was to see how
much better mmap()
could make the loading of weights, if we wrote a new
implementation of std::ifstream
. This meant that, rather
than having the underlying I/O implementation call read()
,
it would instead use mmap()
from the constructor, and then
the our_ifstream::read()
function would just do a
memcpy()
under the hood.
We determined that this would improve load latency by 18%. This was a
big deal, since it's user-visible latency. However it turned out we were
measuring the wrong thing. Please note that I say "wrong" in the best
possible way; being wrong makes an important contribution to knowing
what's right. I don't think I've ever seen a high-level library that's
able to do what mmap()
does, because it defies attempts at
abstraction. After comparing our solution to dynamic linker
implementations, it became obvious that the true value
of mmap()
was in not needing to copy the memory at all. The
weights are just a bunch of floating point numbers on disk. At runtime,
they're just a bunch of floats in memory. So what mmap()
does is it simply makes the weights on disk available at whatever memory
address we want. We simply must ensure that the layout on disk is the
same as the layout in memory.
After going back to the drawing board, the tricky thing here was that the C++ loading process appeared to reshape the tensors after reading them. If we add printf statements to the old loading code, we'd get results like:
moving 0x640 bytes from offset 0x4a607 to offset 0 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4ac47 to offset 0xc80 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4b287 to offset 0x1900 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4b8c7 to offset 0x2580 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4bf07 to offset 0x3200 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4c547 to offset 0x3e80 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4cb87 to offset 0x4b00 (n_dims=2 n_parts=2) moving 0x640 bytes from offset 0x4d1c7 to offset 0x5780 (n_dims=2 n_parts=2) ... and so forth, for another 200k+ lines
There were also a number of C++ STL containers that got populated with information during the loading process. It became clear that, in order to have a mappable file whose memory layout was the same as what evaluation wanted at runtime, we'd need to not only create a new file, but also serialize those STL data structures too. The only way around it would have been to redesign the file format, rewrite all our conversion tools, and ask our users to migrate their model files. We'd already earned an 18% gain, so why give that up to go so much further, when we didn't even know for certain the new file format would work?
I ended up writing a quick and dirty hack to show that it would work. I used a C library override trick where I started with code like this:
int main(int argc, char **argv) { gpt_vocab vocab; llama_model model; llama_model_load(model, vocab); for (;;) { llama_eval(model, vocab); } }
Then I modified the code above to avoid using the stack or static memory, and instead rely on the heap. On platforms like Linux, I was able to easily override the libc allocators by doing something like this:
struct magic *mag; int main(int argc, char **argv) { gpt_vocab *vocab; llama_model *model; long len = 100l*1024*1024*1024 int fd = open("magic.dat", O_RDWR|O_CREAT); ftruncate(fd, len); mag = mmap(0x330000000000, len, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0); if (!mag->vocab) { vocab = new gpt_vocab; model = new llama_model; llama_model_load(*model, *vocab); msync(0x330000000000, len); mag->model = model; mag->vocab = vocab; } else { vocab = mag->vocab; model = mag->model; } for (;;) { llama_eval(*model, *vocab); } } |
void *memalign(size_t a, size_t n) { if (n < 1) n = 1; if (a < 16) a = 16; while (a & (a - 1)) ++a; // set p to next chunk in *mag on a ((size_t *)p)[-1] = n; return p; } void *malloc(size_t n) { return memalign(16, n); } void *calloc(size_t n, size_t z) { void *p; if ((p = malloc((n *= z)))) memset(p, 0, n); return p; } void *realloc(void *p, size_t n) { void *q; if (!p) return malloc(n); if (!n) { free(p); return 0; } if ((q = malloc(n))) memcpy(q, p, ((size_t *)p)[-1]); return q; } void free(void *p) {} |
Pseudo-C++ adapted from 5b8023d935401072b73b63ea995aaae040d57b87
The cool thing about the C library, is just about everything depends on
it. If you override functions like malloc()
on platforms
like Linux, then all the languages and tools downstream of C (e.g. C++)
will use it too. So the code above not only captures the GGML library
use of malloc()
, but also the STL vectors and maps that
were being created too. The only thing I had to do, was make sure the
stack-allocated memory got placed on the heap, which was basically just
the model and vocab objects. The pointers to those of course needed to
be stored in the magically mapped region, so that upon the process
loading a second time, it'd have access to the root of the object graph.
This hack is how I made the case that loading could in fact be instantaneous. I didn't need to know much about the implementation details of the loader. I just redefined the heap so that it was a memory mapped file rather than the anonymous mapping it would use normally. Please note the above code does not follow any best practices. I think my code even deserves the honor of being called an abomination, which makes it the very best kind of experimental code. The correct and proper way of doing things is obviously to change the file format. But that would take 10x more effort. Now we knew for sure that it was worth doing. So the code you see above was eventually tossed away, so we could focus on the file format.
About a week later, the first code we ended up putting in the main
branch that calls the mmap()
function was
Slaren's
change. This might surprise some of the people who've been following
my work. Managers and celebrities are usually the ones who get all the
kudos. The tech industry isn't used to having its key collaborators on
landmark technical achievements be anonymous people from 4chan, but
that's exactly what happened here. While bringing the benefits
of mmap()
was a team effort, you could say
that @Slaren was the person who
added mmap()
support. He did that by pointing out something
very smart, which is that the 7B model only had 1-dimensional tensors,
and as a result, didn't need to be unsharded, and therefore required no
file format change. So he wrote the code and updated the project to map
the file. Then he changed the loader so that it simply assigns a pointer
to
tensor->data
instead of calling read()
,
whenever the tensor is 1-d. In doing this, Slaren showed us that it was
possible to bring the benefits of instant load times to LLaMA 7B users
immediately.
The hardest thing about introducing support for a function like mmap()
though, is figuring out how to get it to work on Windows. I wouldn't be
surprised if many of the people who had the same idea in the past, about
using mmap()
to load machine learning models, ended up not doing it
because they were discouraged by Windows not having it. It turns out
that Windows has a set of nearly, but not quite identical functions,
called CreateFileMapping()
and MapViewOfFile()
. @oKatanaaa
is the person most responsible for helping us figure out how to use them
to create a wrapper function. Thanks to him, we were able to delete all
of the old standard i/o loader code at the end of the project, because
every platform in our support vector was able to be supported by mmap()
.
That meant we actually had a net negative impact on the number
lines of C++ code! I think coordinated efforts like this are rare, yet
really important for maintaining the attractiveness of a project like
llama.cpp, which is surprisingly able to do LLM inference using only a
few thousand lines of code and zero dependencies. We also had some help
from @CoderRC who had
previously designed his own set of
POSIX functions
for Mingw32 and knew the best technique for mmap feature detection.
So far, we've nailed down mmap()
support for 7B. However
we're still using the old C++ standard I/O code for the larger models.
So the only thing left to do at this point was to change the file
format, so that mmap()
generalized to all the models we
were using. That was the part I was responsible for doing.
In order to do inference, we need to load a few hundred tensors out of .pth files using torch, inside our conversion script. With the 7B model this was relatively simple. We only needed to iterate over the tensors in a single file, and produce a single file of output. The tensors in 7B were perfect already, and fully contiguous.
$ ls -hal models/7B/ -rw-r--r-- 1 jart staff 3.9G Mar 29 17:45 ggml-model-q4_0.bin
The issue was that, for models larger than 7B, the tensors were sharded into multiple files. Under our old way of doing things, we were simply doing a 1:1 copy when converting from .pth to GGML. As a result, the ugliness of loading from multiple files was preserved. Here's what it looked like on disk, for instance, with the LLaMA-65B model:
$ ls -hal models/65B/ -rw-r--r-- 1 jart staff 4.8G Mar 16 13:42 ggml-model-q4_0.bin -rw-r--r-- 1 jart staff 4.8G Mar 16 13:43 ggml-model-q4_0.bin.1 -rw-r--r-- 1 jart staff 4.8G Mar 16 13:43 ggml-model-q4_0.bin.2 -rw-r--r-- 1 jart staff 4.8G Mar 16 13:44 ggml-model-q4_0.bin.3 -rw-r--r-- 1 jart staff 4.8G Mar 16 13:45 ggml-model-q4_0.bin.4 -rw-r--r-- 1 jart staff 4.8G Mar 16 13:45 ggml-model-q4_0.bin.5 -rw-r--r-- 1 jart staff 4.8G Mar 16 13:46 ggml-model-q4_0.bin.6 -rw-r--r-- 1 jart staff 4.8G Mar 16 13:46 ggml-model-q4_0.bin.7
Each file had the same structure, except the tensor data itself was like interlaced movie frames.
To make matters more challenging, different tensors are split apart in
different ways, depending on the name. Some were split across columns,
and some were split across rows. mmap()
is a powerful system call, but
it doesn't let you create overlapping mappings that interleave tensors
appropriately. Even if we were willing to use hundreds of thousands of
mmap()
calls to reassemble the read/write operations in a
copyless manner, mmap()
has a 4096-byte alignment
requirement that is too coarse for the tensors in this format. We had to
rewrite the converter tool to put them back together by hand, into a
much larger unified file, as an upfront one-time cost.
$ ls -hal models/65B/ -rw-r--r-- 1 jart staff 38G Mar 16 13:42 ggml-model-q4_0.bin2
The C++ loader was already doing the necessary conversion. All I had to do was simply move that code into the Python conversion script instead. That ensured the same commands people used before would automatically use the new format. Once I patched that, all which remained was writing a migration script. That was important since many people deleted Meta's original .pth files to save hard disk space, and they needed a tool to convert from the old format to the new format. This tool is the script that was recommended above, called migrate-ggml-2023-03-30-pr613.py. It was relatively straightforward to make, since it follows a similar logic as the conversion tool. Except in this case, I didn't need Torch, pickle, or anything like that. All that was needed, was plain old Numpy combined with seek, read, and write system calls. That's nice, since my favorite distro Alpine can't even run Torch!
The interesting thing about the seek()
function is that
operating systems let us seek past the end of a file. So it creates a
convenient framework for unsharding tensors from multi-part files, since
the i/o can be performed by writing tensors to disk, in such a way that
the tensors have holes. We can then fill those in multiple passes once
the remaining shards are processed. Doing that raises interesting
questions of course, about how the file system might allocate blocks in
the underlying physical medium. It's something that's not necessarily
within our control, but I'd still love to learn more about it. For
example, on some file systems I've noticed that, after converting a
file, it might load from disk faster if cp
is used
afterwards to produce a copy.
There's one last important benefit to the new file format. It ensures
tensors are aligned on a 32-byte boundary. The old file format didn't
perform a roundup after writing the model vocabulary to disk. As a
result, floats were being mmap()
'd to odd addresses half
the time, which would trigger UBSAN errors. It also potentially left
some meat on the table when it comes to SIMD instructions. Alignment
generally isn't a problem on modern microarchitectures of the two major
architectures. In practice, the only time misalignment is completely
forbidden is with semaphores on ARM. However just because it seems to
work doesn't mean misalignment won't consume additional resources under
the hood, or cause other problems in sneaky ways. One example would be
x86, where misaligned semaphores will seem to work until you have the
unlucky chance of your unsigned int
overlapping a 64-byte
cacheline boundary. For that reason, the new file format takes a more
conservative approach, and it may potentially open some doors in the
future for certain kinds of optimizations.
For further details, please see 78ca9838ee36660a776e97e3391b6fb5dcaacf7f and ee0c40dd6de8c3c658ae43199939ef40bb1cf408.
Many sources of information on the world wide web that explain how to
use mmap()
will also insist upon the use
of madvise()
as though its benefits were established fact.
I couldn't measure any evidence that it'd be helpful in our case, since
transformer models like LLaMA need to immediately fault every single
memory page as soon as the weights are loaded.
The madvise()
system call is probably only helpful in
situations where only a subset of pages are needed for a nontrivial
amount of time, during which the disk would otherwise become
underutilized.
posix_fadvise(POSIX_FADV_SEQUENTIAL)
would be an example of
a kind of advice that'd be potentially more helpful to users of LLMs.
One of the downsides of the Linux cp
command, is copying a
file larger than RAM will destroy every existing entry in the file
cache. Under normal circumstances this is a good thing, since a least
recently used strategy usually works. However it can be problematic if
you're just organizing your files on a production system where you don't
want to disrupt performance. As far as I know, no standard command line
utility offers a way to exploit this functionality. So we may provide an
example of how to replace the cp
command that could address
this use case too. Another feature such a command could offer, would be
the use of copy_file_range()
, which enables files to be
copied within the same partition 2x faster than the
sendfile()
technique that's used by standard utilities.