June 12th, 2023 @ justine's web page
A few days ago, DeepMind published a blog post talking about a paper they wrote, where they discovered tinier kernels for sorting algorithms. They did this by taking their deep learning wisdom, which they gained by building AlphaGo, and applying it to the discipline of of superoptimization. That piqued my interest, since as a C library author, I'm always looking for opportunities to curate the best stuff. In some ways that's really the whole purpose of the C library. There are so many functions that we as programmers take for granted, which are the finished product of decades of research, distilled into plain and portable code.
DeepMind earned a fair amount of well-deserved attention for this discovery, but unfortunately they could have done a much better job explaining it. Let's start with the assembly code they published for sorting an array with three items, translated from pseudo-assembly into assembly:
/ move37.S .equ P,%rax .equ Q,%rcx .equ R,%rdx .equ S,%rsi move37: mov (%rdi),P mov 8(%rdi),Q mov 16(%rdi),R mov R,S cmp P,R cmovg P,R cmovl P,S cmp S,Q cmovg Q,P cmovg S,Q mov R,(%rdi) mov Q,8(%rdi) mov P,16(%rdi) ret .type move37,@function .size move37,.-move37 .globl move37 |
// deepsort1.c #include <stdio.h> void move37(long *); int main() { long A[3] = {3, 1, 2}; move37(A); printf("%d %d %d\n", A[0], A[1], A[2]); } |
I named this function move37()
because the DeepMind blog
post compares it to the 37th move AlphaGo made during its second
match
with Lee Sedol back in 2016. That's the move which stunned the
experts who thought AlphaGo had made a mistake, but they were wrong,
because the machine ultimately did achieve
victory
against its opponent, who saw himself as the Hector of humanity. So if
we run the DeepMind code:
# run this on the shell
cc -o deepsort1 deepsort1.c move37.S
./deepsort1
2 1 3
That looks like a mistake to me. The array we gave it was {3, 1, 2} but
move37()
sorted it into {2, 1, 3}. DeepMind must be
trolling us, because I don't believe that 2 comes before 1. Let's take a
look at the
open source contribution they
made to LLVM libcxx which should hopefully clarify things:
// Ensures that *__x, *__y and *__z are ordered according to the comparator __c, // under the assumption that *__y and *__z are already ordered. template <class _Compare, class _RandomAccessIterator> inline _LIBCPP_HIDE_FROM_ABI void __partially_sorted_swap( _RandomAccessIterator __x, _RandomAccessIterator __y, _RandomAccessIterator __z, _Compare __c) { using value_type = typename iterator_traits<_RandomAccessIterator>::value_type; bool __r = __c(*__z, *__x); value_type __tmp = __r ? *__z : *__x; *__z = __r ? *__x : *__z; __r = __c(__tmp, *__y); *__x = __r ? *__x : *__y; *__y = __r ? *__y : __tmp; }
Now it makes sense. So move37()
isn't actually a sorting
function. It's a sorting kernel that's intended to be used as
the building block of the sort3()
function. It would have
been nice if the paper and blog post had mentioned that, since it made
me feel quite confused for the briefest of moments. Here's a better
version of the code, which includes the missing swap operation.
sort3: mov (%rdi),%rcx mov 8(%rdi),%rdx mov 16(%rdi),%rsi mov %rdx,%rax cmp %rdx,%rsi cmovl %rsi,%rax cmovl %rdx,%rsi mov %rcx,%rdx cmp %rcx,%rsi cmovl %rsi,%rdx cmovl %rcx,%rsi cmp %rax,%rdx cmovge %rax,%rcx cmovl %rax,%rdx mov %rcx,(%rdi) mov %rdx,8(%rdi) mov %rsi,16(%rdi) ret .globl sort3 .size sort3,.-sort3
To explain why their code is important, let's consider how this
algorithm works at a high level. When I first tried to solve
the sort3()
problem on my own, I came up with this:
// sorts [a,b,c] if (a > b) SWAP(a, b); if (a > c) SWAP(a, c); if (b > c) SWAP(b, c);
Then I looked at libcxx and found out they were doing the same thing. The issue with the code above is that compilers aren't very good at optimizing it. If you try to compile the above code, you'll notice that your compiler inserts a lot of branch instructions. This is what DeepMind sought to improve upon with their LLVM contribution, because cleverer ways exist to code this sort of thing. However those techniques tend to be less understandable. I actually like my naive code, since if we squint our eyes a bit, we can see a pattern exists where it shares the same basic idea as DeepMind's state of the art assembly code. That idea is this problem essentially boils down into being just three compare and swap operations:
mov %rdx,%rax // create temporary cmp %rdx,%rsi // compare cmovl %rsi,%rax // conditional move cmovl %rdx,%rsi // conditional move / repeat thrice
The above code was the state of the art beforehand for sorting networks.
Now here's where DeepMind's novel discovery came into play. They found
out that sometimes the mov
instruction above is
unnecessary.
sort3: mov (%rdi),%rcx mov 8(%rdi),%rdx mov 16(%rdi),%rsi mov %rdx,%rax cmp %rdx,%rsi cmovl %rsi,%rax cmovl %rdx,%rsi mov %rcx,%rdx cmp %rcx,%rsi cmovl %rsi,%rdx cmovl %rcx,%rsi mov %rdx,%rcx // <-- wrekt by AlphaDev cmp %rax,%rdx cmovge %rax,%rcx cmovl %rax,%rdx mov %rcx,(%rdi) mov %rdx,8(%rdi) mov %rsi,16(%rdi) ret
If you try running the above code, then you'll see it's 100% correct with or without the striked-out line. The line of code looks like it's doing something, but it's actually doing nothing. So it doesn't surprise me that something like this could have gone unnoticed by computer science for decades. It should also become clearer now how AlphaDev works. DeepMind basically built an artificial intelligence that fiddles around with assembly code and deletes stuff at random to see if it breaks. I'm not saying this to dismiss AlphaDev's intelligence, since I'd be lying if I said I wasn't doing the same thing.
sort3: mov (%rdi),%rcx mov 8(%rdi),%rdx mov 16(%rdi),%rsi mov %rdx,%rax // can it go? cmp %rdx,%rsi cmovl %rsi,%rax cmovl %rdx,%rsi mov %rcx,%rdx // can it go? cmp %rcx,%rsi cmovl %rsi,%rdx cmovl %rcx,%rsi mov %rdx,%rcx // <-- wrekt by AlphaDev cmp %rax,%rdx cmovge %rax,%rcx cmovl %rax,%rdx mov %rcx,(%rdi) mov %rdx,8(%rdi) mov %rsi,16(%rdi) ret
DeepMind also left some meat on the table. There's still two more
mov
instructions in the above code we could potentially
shave away. One of the ways we might do that is by using the ARM64
instruction set, which yields tinier code for problems like these. Here
we see that we don't need any instructions for creating temporary
variables:
sort3: ldp x1,x2,[x0] ldr x3,[x0,16] cmp x2,x3 csel x4,x2,x3,le csel x2,x2,x3,ge cmp x2,x1 csel x3,x2,x1,le csel x2,x2,x1,ge cmp x4,x3 csel x5,x1,x4,gt csel x4,x4,x3,ge stp x5,x4,[x0] str x2,[x0,16] ret
Arm is all the rage these days, and I imagine the example above serves
as evidence of why they've earned their fame. Arm Limited is also one of
the most benevolent companies in open source right now. For example,
their MbedTLS library
is one of the most underrated gems I've seen so far. When I started
using it, I originally had this scheme of modifying Arm's code to work
better on x86 hardware instead. I wrote all these tricked out assembly
optimizations bringing it to the same realm of performance as OpenSSL on
x86. MbedTLS is plain, portable, hackable C code, so this is good news
for anyone wanting a crypto library that isn't assembly generated by
Perl. I told the folks at Arm what I was doing, and instead of finding
it subversive they were very kind and encouraging, since that's the kind
of people they are. One day I hope to find time to do what DeepMind did,
and upstream my modifications. Arm is also prolific for
their Optimized
Routines library, which is up there with
double-conversion
in terms of impeccable quality. It's of particular interest to C
libraries, since for decades the open source community has subsisted off
math functions written by Sun Microsystems back in the early 90's. Arm
found a way to improve upon several of them, such
as pow(x,y)
. That's a very high impact thing to do,
considering it's a one of the most fundamental operations in math. For
example, if you use Arm's solution in pure software to
implement pow(x,y)
on an x86 machine, then it'll go 5x
faster than Intel's native x87 instructions for doing the same thing.
Since we're so fortunate to have DeepMind entering this game too, I've
taken the liberty of translating their libcxx diff into plain readable C
code, so that everyone can appreciate its beauty. It's another thing I
would have liked to see included in the paper and blog post, because in
this code you'll find the canonical trick that experts use for getting
compilers to generate branchless MOVcc
instructions.
// sorts [a,b] static inline void Sort2(long *a, long *b) { int r = *a < *b; long t = r ? *a : *b; *b = r ? *b : *a; *a = t; } // sorts [a,b,c] assuming [b,c] is already sorted static inline void PartialSort3(long *a, long *b, long *c) { int r = *c < *a; long t = r ? *c : *a; *c = r ? *a : *c; r = t < *b; *a = r ? *a : *b; *b = r ? *b : t; } // sorts [a,b,c] static inline void Sort3(long *a, long *b, long *c) { Sort2(b, c); PartialSort3(a, b, c); }
// sorts [a,b,c,d] static inline void Sort4(long *a, long *b, long *c, long *d) { Sort2(a, c); Sort2(b, d); Sort2(a, b); Sort2(c, d); Sort2(b, c); }
// sorts [a,b,c,d,e] static inline void Sort5(long *a, long *b, long *c, long *d, long *e) { Sort2(a, b); Sort2(d, e); PartialSort3(c, d, e); Sort2(b, e); PartialSort3(a, c, d); PartialSort3(b, c, d); }
Once I saw the Sort5()
function, I felt like I had gained a
better understanding of what had motivated DeepMind's research. If you
compile the Sort5()
function on ARM64, then your compiler
will produce a function juggling 11 registers. If you were reasoning
about a mathematical equation, then would you be able to hold eleven
variables in your working memory at once? Probably not, which is why
having a nice kernel function like PartialSort3
is so
useful. As sentient creatures, human beings aren't that much different
from the monkeys we once were. The main thing that makes us intelligent
is our ability to take hard problems and break them down into smaller
ones. So it's nice to see deep learning being applied to turbocharging
our abstractions.
It's also worth mentioning that Sort3()
and Sort5()
are kernels themselves, since they're intended
to be building blocks for a conventional sorting function. The blog post
covers this topic, but I thought it'd be useful to share something
that's actually portable and executable.
static inline void InsertionSort(long *A, long n) { long i, j, t; for (i = 1; i < n; i++) { t = A[i]; j = i - 1; while (j >= 0 && A[j] > t) { A[j + 1] = A[j]; j = j - 1; } A[j + 1] = t; } } void longsort(long *A, long n) { long t, p, i, j; switch (n) { case 0: return; case 1: return; case 2: return Sort2(A, A + 1); case 3: return Sort3(A, A + 1, A + 2); case 4: return Sort4(A, A + 1, A + 2, A + 3); case 5: return Sort5(A, A + 1, A + 2, A + 3, A + 4); default: if (n <= 32) { InsertionSort(A, n); } else { for (p = A[n >> 1], i = 0, j = n - 1;; i++, j--) { while (A[i] < p) i++; while (A[j] > p) j--; if (i >= j) break; t = A[i]; A[i] = A[j]; A[j] = t; } longsort(A, i); longsort(A + i, n - i); } break; } }
The above algorithm shows what the new and improved libcxx is doing. It's basically quicksort except it switches to the sorting kernels and insertion sort when recursing into smaller slices. With libcxx I think they even took the added step of schlepping in heapsort, which is kind of slow, but prevents adversaries from smashing your stack.
The main thing you may be wondering at this point is, can I use this? Do
these sorting network kernels actually make sorting go faster? I would
say yes and no. When all you want is to sort ascending longs, the code
above will go 2x faster than the standard qsort()
function
provided by your C library. Except you don't need the kernels to do
that. What I've determined so far is that, on my personal computer
(which has an Intel Core i9-12900KS) the above function sorts longs at
255 megabytes per second. However if I comment out the sorting kernels:
void longsort(long *A, long n) { long t, p, i, j; switch (n) { case 0: return; case 1: return; /* case 2: */ /* return Sort2(A, A + 1); */ /* case 3: */ /* return Sort3(A, A + 1, A + 2); */ /* case 4: */ /* return Sort4(A, A + 1, A + 2, A + 3); */ /* case 5: */ /* return Sort5(A, A + 1, A + 2, A + 3, A + 4); */ default: if (n <= 32) { InsertionSort(A, n); } else { for (p = A[n >> 1], i = 0, j = n - 1;; i++, j--) { while (A[i] < p) i++; while (A[j] > p) j--; if (i >= j) break; t = A[i]; A[i] = A[j]; A[j] = t; } longsort(A, i); longsort(A + i, n - i); } break; } }
Then my longsort()
function goes 275 megabytes per second.
I achieved 7% better performance by dumbing down the algorithm. I say
let Intel do the damage. This is the function cosmopolitan libc uses to
sort elf symbol tables when loading your executables. The nice thing
about long
is it's long enough to store an int
key value pair. Being able to sort map entries fast is useful trick to
have. Blending naive quicksort with naive insertion sort is the best
solution I've found so far, in terms of striking a balance between
tininess and performance that's just right for my use case. The above
function compiles down to just 181 bytes of x86-64 machine code. Since
DeepMind's sort3()
is only 42 bytes, I was hoping I could
trade away some size to gain a performance advantage. Since the next
best algorithm I've found so far would be switching to radix sort, which
goes 400 MB/s but needs a whopping 763 bytes of binary footprint, in
addition to depending on malloc()
. So it would have been
nice to see these kernels do better.
That's not to imply DeepMind's ideas don't have merit. I think it's
important to note that DeepMind was generous enough to give us their
vectorized
quicksort library last year (back when they were called Google
Brain) and in doing so achieved a sorting supremacy that can never be
challenged. Vqsort literally sorts longs at 1155 MB/s on my computer. It
even marginally outperforms
djbsort, which is one of the
most beloved libraries in the open source community even though it never
generalized to more data types than int
. The way both
implementations pulled it off is by vectorizing sorting networks. I
think that's where the sorting network technique truly shines. I imagine
AlphaDev would have done that if it weren't still a toddler as far as
intelligent entities go. When you're starting from first principles, the
baseline instruction set alone is enormously difficult to support. If we
wait, then I think we can expect to see great things from AlphaDev in
the future, as it works its way up to more formidable challenges.
I also just love the fact that DeepMind is making algorithms tinier, since that's something I don't see very often. Size coding is one of my favorite hobbies. On this blog I've published a 383 byte virtual machine for lambda calculus and a lisp machine with garbage collection in 436 bytes. I've also blogged about the size optimization tricks I've used for the cosmpolitan c library. I love DeepMind's parent company too, since Google awarded me an open source peer bonus a few weeks ago, and it's nice to see them sharing my passion for making software smaller. It'd be great to see them using it to improve vectorized quicksort. I would love nothing more than to have the world's best longsort not be a C++ monster that adds 24kB of binary footprint. It's 23,000 lines of assembly for ascending long sort alone. Those are the lines of code I can't wait to see AlphaDev ravage someday.
Finally, I like the idea of an artificial intelligence company building machines that write code in machine language. Why wouldn't they? It's in a machine's nature to be a machine. As a builder I find that much less triggering than the future OpenAI is creating, where they've built a great big paternalistic machine that competes with every builder on earth in a zero-sum economy, and then enticing the rent-seekers of the world to take control of that machine through government regulation. I don't think it's progress that OpenAI is promising to automate all the tasks I love doing most, like coding. What I want is to be able to control a machine that's able to do the things I'm not able to do on my own, like discovering sorting kernels. That's the real kind of progress that's going to help humanity enrich itself, and I view every line of assembly we can shave away as a step in a positive direction towards that dream.