July 13^th, 2022 @ justine's web page

Porting OpenBSD pledge() to Linux

OpenBSD is an operating system that's famous for its focus on security. Unfortunately, OpenBSD leader Theo states that there are only 7000 users of OpenBSD. So it's a very small but elite group, that wields a disproportionate influence; since we hear all the time about the awesome security features these guys get to use, even though we usually can't use them ourselves.

Pledge is like the forbidden fruit we all covet when the boss says we must use things like Linux. Why does it matter? It's because pledge() actually makes security comprehensible. Linux has never really had a security layer that mere mortals can understand. For example, let's say you want to do something on Linux like control whether or not some program you downloaded from the web is allowed to have telemetry. You'd need to write stuff like this:

static const struct sock_filter kFilter[] = {
    /* L0*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall, 0, 14 - 1),
    /* L1*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[0])),
    /* L2*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 2, 4 - 3, 0),
    /* L3*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 10, 0, 13 - 4),
    /* L4*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[1])),
    /* L5*/ BPF_STMT(BPF_ALU | BPF_AND | BPF_K, ~0x80800),
    /* L6*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 1, 8 - 7, 0),
    /* L7*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 2, 0, 13 - 8),
    /* L8*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[2])),
    /* L9*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0, 12 - 10, 0),
    /*L10*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 6, 12 - 11, 0),
    /*L11*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 17, 0, 13 - 11),
    /*L12*/ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    /*L13*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(nr)),
    /*L14*/ /* next filter */
};

Oh my gosh. It's like we traded one form of security privilege for another. OpenBSD limits security to a small pond, but makes it easy. Linux is a big tent, but makes it impossibly hard. SECCOMP BPF might as well be the Traditional Chinese of programming languages, since only a small number of people who've devoted the oodles of time it takes to understand code like what you see above have actually been able to benefit from it. But if you've got OpenBSD privilege, then doing the same thing becomes easy:

pledge("stdio rpath", 0);

That's really all OpenBSD users have to do to prevent things like leaks of confidential information. So how do we get it that simple on Linux? I believe the answer is to find someone with enough free time to figure out how to use SECCOMP BPF to implement pledge. The latest volunteer is me, so look upon my code ye mighty and despair.

cosmopolitan/libc/calls/pledge.c
system call
cosmopolitan/libc/calls/pledge-linux.c
system call polyfill
cosmopolitan/tool/build/pledge.c
pledge command
cosmopolitan/test/libc/calls/pledge_test.c
unit tests

There's been a few devs in the past who've tried this. I'm not going to name names, because most of these projects were never completed. When it comes to SECCOMP, the online tutorials only explain how to whitelist the system calls themselves, so most people lose interest before figuring out how to filter arguments. The projects that got further along also had oversights like allowing the changing of setuid/setgid/sticky bits. So none of the current alternatives should be used. I believe this effort gets us much closer to having pledge() than ever before.

Command Line Utility

I originally wrote my pledge() polyfill for the redbean web server as a sandboxing solution. However it turns out pledge() is robust enough as an abstraction that I thought it'd be useful to create a small command line utility which launches processes under pledge(), so that anyone can use it, without having to configure it in C code.

pledge-1.8.com
88kb - x86-64 elf executable (debug data, source code)
Written by Justine Alexandra Roberts Tunney (Twitter, GitHub, LinkedIn)
22d33574e244883a87e54169f4ed82ea40cabb17b79c9e57559b0fa8454dd698

That binary will work on all Linux distros since RHEL6. Root privileges are not required. You just use it to wrap your command invocations. It's so tiny and lightweight that it only adds a few microseconds of startup latency to your program. It's great for shell scripts and automated tools. For example, if you want to run the list directory command, and only permit that command to do basic stdio (-p stdio) and filesystem path (-p rpath) reading in the current directory (-v .), then you'd say:

$ wget https://justine.lol/pledge/pledge.com
$ chmod +x pledge.com
$ ./pledge.com -v. -p 'stdio rpath' ls
file listing output...

You can now be certain your ls command isn't doing things like spying on you, or uploading your bitcoin wallet to the cloud. However let's say authorizing network access is what you want. One command that has a real legitimate need for that is curl. However, since it needs needs DNS, it's a little trickier because DNS is the Hunger Games of systems engineering, and not all Libc implementations agree on how it should be implemented. Here's some strategies depending on your tools and distro:

# standard curl on alpine linux 3.16 (musl)
./pledge.com -p 'stdio rpath dns inet' \
  curl -s http://justine.lol/hello.txt

# standard curl on ubuntu 22.04 (glibc)
./pledge.com -p 'stdio rpath inet dns tty sendfd recvfd' \
  curl -s http://justine.lol/hello.txt
hello world

# cosmopolitan's curl as static binary
# see git clone and make instructions below
./assimilate.com ./curl.com
./pledge.com -p 'stdio rpath dns inet' \
  ./curl.com https://justine.lol/hello.txt

# cosmopolitan's curl as ape binary
# non-assimilated cosmopolitan ape binary
./pledge.com -p 'stdio rpath prot_exec dns inet' \
  ./curl.com https://justine.lol/hello.txt

The choice of C library usually impacts which permissions are needed. Musl and Cosmopolitan need the least permission since they were built with sandboxing in mind. Glibc on the other hand does some strange stuff with DNS, which requires us to weaken the sandbox with recvmsg() and sendmsg() which also enable SCM_RIGHTS unfortunately.

Both Musl and Glibc use dynamic binaries. In order to be able to launch them, pledge.com temporarily implies both exec and prot_exec. We then inject an LD_PRELOAD library which runs inside the process at initialization. That library calls pledge() again automatically, and drops the both exec and prot_exec privileges if needed. This dynamic library also lets us print helpful messages to stderr to explain which promises are needed when a violation occurs.

Let's say you have a public ssh server and you want to let people read and take notes of your book collection, but you don't want anyone rewriting your books. In that case, you can repupose something like the nano command as a strictly read-only editor. Since nano has a TUI interface, you'd need to grant it TTY privileges.

./pledge.com -v $HOME/books -np 'stdio rpath tty' nano ~/books/bofh.txt

Here's how you'd sandbox Vim to only be able to change the current directory, tested on Alpine and Ubuntu.

./pledge.com \
  -v rwc:. \
  -v /etc/vim \
  -v $HOME/.vimrc \
  -v /usr{,/local}/share/vim \
  -p 'stdio rpath wpath cpath tty prot_exec' \
  vim

Here's how you'd sandbox Emacs to only be able to change the current directory, tested on Alpine and Ubuntu.

./pledge.com \
  -v rwc:. \
  -v $HOME/.emacs \
  -v rwc:$HOME/.emacs.d \
  -v /etc/emacs \
  -v /etc/passwd \
  -v /usr/share/X11/locale \
  -v /usr{,/local}/{libexec,share}/emacs \
  -p 'stdio rpath wpath cpath tty proc tmppath prot_exec' \
  emacs -nw

Troubleshooting

If your program crashes, then you can figure out why by tracing the binary and seeing which system call is EPERM'ing or which veiled path is EACCES'ing. For example, let's see what happens if we reduce the privileges to just stdio.

$ strace -ff ./pledge.com -p stdio ls
open("/etc/ld-musl-x86_64.path", O_RDONLY|O_CLOEXEC) = -1 EPERM (Operation not permitted)

Well that didn't take long. Now that you know what's wrong, you would then consult the Promises section to see which promise you need. For example, you'd know open(O_RDONLY) is provided by rpath and that in order to fork() you need -p proc.

Resource Limits

In addition to polyfilling pledge, your pledge command is also able to apply some other very important safety hacks that aren't obvious to the uninitiated. For example, we've all run a program before that hammers the system. Linux is very generous in how much memory programs can allocate. An accidental loop in just one program, by default on Linux, will absolutely take the whole machine out of commission for a few minutes before the "OOM Killer" kicks in. In other cases, like a fork() bomb, the default Linux environment provides no such protection, so it's essentially equivalent to a blue screen of death.

Your pledge command imposes some perfectly reasonable resource quotas on programs by default, to prevent that from happening. By default, unless you tune the flags, a program is allowed to use only the amount of memory you have. If you've permitted it to fork off new processes, then it won't be able to spawn more of them at the same time than twice your number of CPUs. This way if your sandboxed program gets out of control, it'll most likely crash itself before it can crash your whole computer.

We also have a niceness feature. Have you ever had a program use so much disk i/o that everything crawls to a halt? You run some program, and then suddenly every small file takes seconds to load in Emacs? Your pledge command can fix that. If you're got a compute heavy long running program, then pass the -n flag for a nice that's actually nice. The naive nice command doesn't really do much, since it doesn't change the scheduler and it doesn't change the i/o priority. This command actually does. Using the -n flag will guarantee the sandbox program will stay out of the way, since the kernel will only let it use spare capacity.

Pledge Command Flags

-n

Apply maximum niceness to program. This means

nice is set to 19,
i/o priority is set to idle, and
scheduler is set to idle.

-p PROMISES

Defaults to -p 'stdio rpath'. It's repeatable. May contain any of following separated by spaces:
See also the Promises section below which goes into much greater depth on what each category does.

stdio: allow stdio, threads, and benign system calls
rpath: read-only path ops
wpath: write path ops
cpath: create path ops
dpath: create special files
flock: allow file locking
tty: terminal ioctls
recvfd: allow recvmsg
sendfd: allow sendmsg
fattr: allow changing some struct stat bits
inet: allow IPv4 and IPv6
unix: allow local sockets
dns: allow dns
proc: allow fork process creation and control
id: allow setuid and friends
exec: allow executing binaries
prot_exec: allow creating executable memory (dynamic / ape)
vminfo: allow executing ape binaries
tmppath: allow executing ape binaries

-v [PERM:]PATH

Unveils path. By default, your sandbox restricts access to all file system paths (except for metadata [e.g. file size] which is always visible). Using this flag will allow a new path to be used, and it may be a directory too in which case all paths beneath that folder are allowed. PERM defaults to r and may have any of the following:

r makes PATH available for read-only path operations, corresponding to the pledge promise "rpath".
w makes PATH available for write operations, corresponding to the pledge promise "wpath".
x makes PATH available for execute operations, corresponding to the pledge promises "exec" and "execnative".
c allows PATH to be created and removed, corresponding to the pledge promise "cpath".

Some paths are implicitly defined by pledge.com depending on which promises you've used. See the Implicitly Unveiled Paths section for further details. Unveiling is implemented using Landlock which requires Linux Kernel 5.13+. On older kernels, all filesystem paths will be allowed (unless you use the chroot flag).

-N

Don't normalize file descriptors. by default, pledge.com guarantees (1) the stdio file descriptors exist, and (2) file descriptors that the parent process or shell forgot to close will be closed. We do this using close_range() which needs Linux 5.9+. On older kernels, we use poll() to quickly make sure the first 256 file descriptors are safe, but the number may be lower depending on system limits.

-V

Disable unveiling (i.e. only pledge)

-T pledge

If SECCOMP BPF isn't available, then pledge.com will launch your command anyway without it. If you need a guarantee that restrictions are imposed, then you can run pledge.com -T pledge to test for the availability of this feature. Please note this only impacts very old Linux systems like RHEL5 since SECCOMP was introduced around 2010.

-T unveil

If Landlock LSM isn't available (introduced in 2021) then pledge.com will launch your command anyway without it. If you need a guarantee that filesystem restrictions are imposed, then you can run

pledge.com
-T unveil

to test for the availability of this feature.

-T unveil

exits 0 if unveil() is supported by host system

-g GID

Call setgid() before executing program (not allowed if setuid binary)

-u UID

Call setuid() before executing program (not allowed if setuid binary)

-c PATH

Call chroot() before executing program (needs root privileges)

-C SECS

Set CPU time limit in seconds. By default, this is not changed from what was inherited (which in practice means unlimited). A negative value means unlimited. If the requested limit is greater than the limit already being imposed by a parent process, then the limit will be silently scaled down. If this limit is violated, then a SIGXCPU signal is sent to your program, after which it has precisely one second to gratefully shutdown before SIGKILL is used.

-M BYTES

Set virtual memory limit. The default is the total amount of physical RAM on the host machine. This is specified in bytes and may use Si notation. A negative value means unlimited. If the requested limit is greater than the limit already being imposed by a parent process, then the limit will be silently scaled down. When this limit is violated, mmap() will start returning ENOMEM which will trickle down into functions like malloc() failing.

-P PROCS

Sets process and thread limit. This applies user-wide. The default is the preexisting process count added to the cpu count. A negative value means unlimited. If the requested limit is greater than the limit already being imposed by a parent process, then the limit will be silently scaled down. If this limit is violated, functions like fork() will start returning EAGAIN.

-F BYTES

Sets resource limit on individual file sizes. The default is 256mb. This is specified in bytes and may use Si notation. A negative value means unlimited. If the requested limit is greater than the limit already being imposed by a parent process, then the limit will be silently scaled down. If this limit is violated, then a SIGXFSZ signal is sent to your program. If the limit is 150% exceeded then SIGKILL is used.

-O COUNT

Sets file descriptor limit. If this limit is violated, functions like open() will start returning EMFILE.

Implicitly Unveiled Paths

The pledge.com program will automatically unveil the following paths for your convenience when certain conditions are met. In most cases, we use the categories you've pledged as a hint as to what needs unveiling. Please note that this automatic unveiling does not apply to the Linux C API interface for pledge(), where unveil() must be called explicitly. However OpenBSD will unveil some key paths for things like stdio. The files we've chosen below are a superset of what OpenBSD does, intended to conform to the same principles adapted for Linux.

pledge("stdio"): -v /dev/fd
-v w:/dev/log
-v /dev/zero
-v rw:/dev/null
-v rw:/dev/full
-v rw:/dev/stdin
-v rw:/dev/stdout
-v rw:/dev/stderr
-v /dev/urandom
-v /etc/localtime
-v rw:/proc/self/fd
-v /proc/self/stat
-v /proc/self/status
-v /usr/share/locale
-v /proc/self/cmdline
-v /usr/share/zoneinfo
-v /proc/sys/kernel/version
-v /usr/share/common-licenses
-v /proc/sys/kernel/ngroups_max
-v /proc/sys/kernel/cap_last_cap
-v /proc/sys/vm/overcommit_memory
pledge("rpath"): -v /proc/filesystems
pledge("inet"): -v /etc/ssl/certs/ca-certificates.crt
pledge("dns"): -v /etc/hosts
-v /etc/hostname
-v /etc/services
-v /etc/protocols
-v /etc/resolv.conf
pledge("tty"): -v rw:$PTY
-v rw:/dev/tty
-v rw:/dev/console
-v /etc/terminfo
-v /usr/lib/terminfo
-v /usr/share/terminfo
pledge("prot_exec"): -v rx:/usr/bin/ape
pledge("vminfo"): -v /proc/stat
-v /proc/meminfo
-v /proc/cpuinfo
-v /proc/diskstats
-v /proc/self/maps
-v /sys/devices/system/cpu
pledge("tmppath"): -v rwc:/tmp
-v rwc:$TMPPATH
for dynamic executables only: -v rx:/lib
-v rx:/lib64
-v rx:/usr/lib
-v rx:/usr/lib64
-v rx:/usr/local/lib
-v rx:/usr/local/lib64
-v /etc/ld-musl-x86_64.path
-v /etc/ld.so.conf
-v /etc/ld.so.cache
-v /etc/ld.so.conf.d
-v /etc/ld.so.preload

Securing APE Binaries

Actually Portable Executables should be written to call pledge() internally. But if you want to secure an APE binary that doesn't, using the pledge.com command, then you need to convert (or "assimilate") it into the ELF format beforehand. You can usually do this by saying:

$ file redbean.com
redbean.com: DOS/MBR boot sector
$ ./redbean.com --assimilate
$ file redbean.com
redbean.com: ELF 64-bit LSB executable

Please note that won't work if you're using the binfmt_misc with the new APE Loader then you can't run the APE shell script to assimilate your binary. We instead provide a new assimilate.com program which can be used to convert APE programs to ELF or Mach-O.

assimilate.com
Works on x86-64 Linux+Mac+Windows+FreeBSD+NetBSD+OpenBSD
92kb - PE+ELF+MachO+ZIP+SH executable (debug data, source code)
Written by Justine Alexandra Roberts Tunney (Twitter, GitHub, LinkedIn)
593a8119049e9e8a88d29f80af83bfdbb5fcdd8a4cbad934af05dd6a5145ce77

C API

Pledge works best when developing software using Cosmpolitan Libc. You can get started relatively easily writing pledge() programs using the cosmopolitan monorepo. The zero config solution is to just plop this program file into the examples folder. Start by cloning the repo:

$ git clone https://github.com/jart/cosmopolitan
$ cd cosmopolitan
$ nano examples/mypledge.c

You can then copy and paste this code:

#include "libc/calls/calls.h"
#include "libc/stdio/stdio.h"

int main() {
  pledge("stdio", 0);
  printf("hello world\n");
}

You can then build and run your program as follows:

$ make -j8 o//examples/mypledge.com
$ o//examples/mypledge.com
hello world

One of the things you may have noticed about the pledge.com command, is its most restrictive mode (pledge.com -p "" cmd...) can't actually be used. Your program will just crash. That's because it's intended for the C API. What it means is that your process or thread won't be able to call any system call except exit. Such a program might sound impossible, but you can actually communicate between processes using shared memory. For example, here's how you'd do it with threads.

int enclave(void *arg, int tid) {
  if (pledge("", 0)) return 1;
  int *job = arg;            // get job
  job[0] = job[0] + job[1];  // do work
  return 0;                  // exit
}
int main() {
  struct spawn worker;
  int job[2] = {2, 2};            // create workload
  _spawn(enclave, job, &worker);  // create worker
  _join(&worker);                 // wait for exit
  assert(job[0] == 4);            // check result
}

The above example shows an enclaved worker doing some kind of computational task, possibly executing untrusted code, and then storing the result to some memory location that the parent thread can see when the worker has finished executing. It works great and is fast.

One of the disadvantages of the above example, is that the enclaved worker has unfettered access to your stack memory and might make a mess of things. That's potentially creepy and not very enclaved. One way to fix that is to use fork() instead of threads. In that case, you can explicitly whitelist which memory is shared.

int ws;
// create small shared memory region
int *job = mmap(0, FRAMESIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
job[0] = 2;  // create workload
job[1] = 2;
if (!fork()) {  // create enclaved worker
  if (pledge("", 0)) _Exit(1);
  job[0] = job[0] + job[1];  // do work
  _Exit(0);
}
wait(&ws);  // wait for worker
assert(WIFEXITED(ws));
assert(WEXITSTATUS(ws) == 0);
assert(job[0] == 4);  // check result
munmap(job, FRAMESIZE);

Most of our the Cosmopolitan Libc unit tests have been set up to use pledge() these days. Not necessarily because we're concerned about them being compromised, but because the pledge function has outstanding documentation value in helping people understand our tests, since it readily communicates what system functionality they need. For example, our tests for the access() filesystem function says:

__attribute__((__constructor__)) static void init(void) {
  pledge("stdio rpath wpath cpath fattr", 0);
  errno = 0;
}

System Call Origin Verification

When you write your own Actually Portable Executables, you also get some added security benefits compared to pledge.com. For example, another famous OpenBSD system call is msyscall() which causes the kernel to validate the RIP register of anything that issues a system call. In Cosmopolitan, calling pledge() will polyfill that feature too automatically, to only allow functions which are annotated with the privileged keyword to use SYSCALL. What that means is if someone manages to compromise your server to inject executable code into your program's memory, then that code effectively will have pledge("", 0) privileges, even if when your app called pledge(), it specified something much broader. The redbean web server's unix.pledge() function is also able to take advantage of this.

Caveats

File system access is a blind spot [update 2022-07-22: we now have unveil() thanks to the Landlock system calls introduced a one year ago]. OpenBSD solves this with another famous system call called unveil(), which lets users control file system paths too. Right now there's no clear way to implement that for Linux. However our pledge() polyfill does do a reasonable job in restricting which file system operations are possible. But once you permit the file system ops, the ops are allowed to happen on pretty much any file the user has access to.

I personally don't view this as a problem. What I love about pledge.com is it tells me if the programs I run that I downloaded from random strangers on the Internet, are actually the good little command line citizens that they claim to be. For example, if I download a tool for computing some math, or compressing a file, then it really shouldn't need any access except -p "stdio rpath" especially if I'm able to use pipes. So I can use pledge.com to make sure the command keeps its promise and lets me know if there's any surprising behaviors. So this is great security if you're dealing with command line programs that are written in a conscientious manner. If it's only able to read files and can't talk to the Internet, then seriously, what could it possibly do? It's such a simple pareto-optimized niche that I can't believe no one's made it easily addressable until now.

However, there's always going to be that one program you want that's power hungry, possibly due to bloated frameworks and dependencies. In that case, we may want access to some (but not all) of the file system. pledge.com is able to address the need somewhat using chroot(). It's worth noting though that chroot() has weaknesses that kernel devs have refused to fix for decades. Most of the docs on this subject are unprofessional and crazy. For example, the chroot(2) man page is probably the only category 2 man page I've ever seen that uses shell script code to describe its functionality. As far as I can tell, the only convincing weakness with chroot() is that the jail is only locked from the inside. If you take away the freedom of a process by putting it in a chroot jail, then another process that's free can use its freedom to bust its friend out of jail. For example, here's how root can leave a backdoor that lets the process escape:

mkdir("/tmp/mydir", 0755);
// privileged user opens a backdoor
int dirfd = open("/tmp", O_RDONLY | O_DIRECTORY);
// process enters chroot jail
chdir("/tmp/mydir");
chroot("/tmp/mydir");
// process escapes jail
fchdir(dirfd);
chdir("..");
// list root directory
struct dirent *e;
DIR *d = opendir(".");
while ((e = readdir(d))) {
  printf("%s\n", e->d_name);
}
closedir(d);

The Linux devs could fix that if they wanted to. However I personally don't see why it's a total dealbreaker, pledge.com helps avoid it by closing rogue file descriptors at startup using poll(). What even more surprising is that this weakness is also exploitable on OpenBSD, since they too seem to have given up on securing the traditional chroot() call. But at least OpenBSD provides an alternative that's easy to use, called unveil(). It'd be great to see that leadership from the Linux kernel, but instead we just see blog posts from companies like RedHat saying that having chroot() will make us more insecure than having no security at all. It's like banning locks because lockpick kits exist. RedHat must be experts at mental gymnastics to publish such communiqués. It's also comical that Linux addresses the problem by restricting chroot() to the root user account, since clearly something which is so "insecure" will become more secure if you only do it from the most privileged user. What an unfortunate state of affairs, since many of us have needed to look elsewhere for answers, and the only folks offering those right now is bloatware like Docker that locks-in your filesystem with a bunch of cryptically named tar files. And they say that Docker isn't a security layer too! Even though it's based things like cgroups which are even more elite and difficult to understand than SECCOMP BPF. We can only guess why the kernel devs do it. Maybe they're afraid of issue workload burnout and figure people won't complain about security if no one understands it! That's something we're working to change.

It should also be noted that there's some features OpenBSD bakes into pledge() that we're not able to polyfill with Linux SECCOMP BPF. One of the things OpenBSD does is it can check file system paths, in order to loosen up restrictions around things like accessing the time zone database. This isn't a problem if you're a Cosmopolitan Libc user. Because APE binaries don't read tzdata from the filesystem and instead embed time zone data inside the ZIP structure of the binary. However it could potentially be problematic if you're using pledge.com to launch binaries that are provided by your distro. Ask your friendly distro maintainers to improve their security solutions. If they can't, then you can always switch to Cosmopolitan Libc.

Another caveat is that, so far, I've only implemented the things described in the OpenBSD pledge(2) manual page. We still need to reconcile this properly with the primary materials which would be the OpenBSD pledge() kernel source code. We also need more community feedback to make sure there aren't things we haven't considered. For example, Linux has a lot of sneaky capabilities in a shifting landscape that aren't always widely understood, which can potentially bite the authors of security tools, even when they've done due diligence.

I've also only really tested this on console applications. If you want a pledge() that's likely to work with GUIs, then, knowing the way the Linux desktop goes, you really should consider SerenityOS since Andreas added pledge() support a couple years ago.

Pledge Documentation

Pledging causes most system calls to become unavailable. Your system call policy is enforced by the kernel, which means it can propagate across execve() if permitted. This system call is supported on OpenBSD and Linux where it's polyfilled using SECCOMP BPF. The way it works on Linux is verboten system calls will raise EPERM whereas OpenBSD just kills the process while logging a helpful message to /var/log/messages explaining which promise category you needed.

By default exit() is allowed. This is useful for processes that perform pure computation and interface with the parent via shared memory. On Linux we mean sys_exit (_Exit1), not sys_exit_group (_Exit). The difference is effectively meaningless, since _Exit() will attempt both. All it means is that, if you're using threads, then a pledge("", 0) thread can't kill all your threads unless you pledge("stdio").

Once pledge is in effect, the chmod functions (if allowed) will not permit the sticky/setuid/setgid bits to change. Linux will EPERM here and OpenBSD should ignore those three bits rather than crashing.

User and group IDs can't be changed once pledge is in effect. OpenBSD should ignore chown without crashing; whereas Linux will just EPERM.

Memory functions won't permit creating executable code after pledge. Restrictions on origin of SYSCALL instructions will become enforced on Linux (cf. msyscall()) after pledge too, which means the process gets killed if SYSCALL is used outside the .privileged section. One exception is if the "exec" group is specified, in which case these restrictions need to be loosened.

Using pledge is irreversible. On Linux it causes PR_SET_NO_NEW_PRIVS to be set on your process; however, if "id" or "recvfd" are allowed then then they theoretically could permit the gaining of some new privileges. You may call pledge() multiple times if "stdio" is allowed. In that case, the process can only move towards a more restrictive state.

pledge() can't filter file system paths or internet addresses. For example, if you enable a category like "inet" then your process will be able to talk to any internet address. The same applies to categories like "wpath" and "cpath"; if enabled, any path the effective user id is permitted to change will be changeable.

The Linux pledge() polyfill isn't able to support the OpenBSD execpromises parameter.

Promises

Your promises is a string that may include any of the following groups delimited by spaces.

stdio: allows exit_group, close, dup, dup2, dup3, fchdir, fstat, fsync, fdatasync, ftruncate, getdents, getegid, getrandom, geteuid, getgid, getgroups, getitimer, getpgid, getpgrp, getpid, getppid, getresgid, getresuid, getrlimit, getsid, wait4, gettimeofday, getuid, lseek, madvise, brk, arch_prctl, uname, set_tid_address, clock_getres, clock_gettime, clock_nanosleep, mmap (PROT_EXEC and weird flags aren't allowed), mprotect (PROT_EXEC isn't allowed), msync, munmap, nanosleep, pipe, pipe2, read, readv, pread, recv, poll, recvfrom, preadv, write, writev, pwrite, pwritev, select, send, sendto (only if addr is null), setitimer, shutdown, sigaction (but SIGSYS is forbidden), sigaltstack, sigprocmask, sigreturn, sigsuspend, umask, socketpair, ioctl(FIONREAD), ioctl(FIONBIO), ioctl(FIOCLEX), ioctl(FIONCLEX), fcntl(F_GETFD), fcntl(F_SETFD), fcntl(F_GETFL), fcntl(F_SETFL).
rpath: (read-only path ops) allows chdir, getcwd, open(O_RDONLY), openat(O_RDONLY), stat, fstat, lstat, fstatat, access, faccessat, readlink, readlinkat, statfs, fstatfs.
wpath: (write path ops) allows getcwd, open(O_WRONLY), openat(O_WRONLY), stat, fstat, lstat, fstatat, access, faccessat, readlink, readlinkat, chmod, fchmod, fchmodat.
cpath: (create path ops) allows open(O_CREAT), openat(O_CREAT), rename, renameat, renameat2, link, linkat, symlink, symlinkat, unlink, rmdir, unlinkat, mkdir, mkdirat.
dpath: (create special path ops) allows mknod, mknodat, mkfifo.
chown: (file ownership changes) allows chown, fchown, lchown, fchownat.
flock: allows flock, fcntl(F_GETLK), fcntl(F_SETLK), fcntl(F_SETLKW).
tty: allows ioctl(TIOCGWINSZ), ioctl(TCGETS), ioctl(TCSETS), ioctl(TCSETSW), ioctl(TCSETSF).
recvfd: allows recvmsg(SCM_RIGHTS).
fattr: allows chmod, fchmod, fchmodat, utime, utimes, futimens, utimensat.
inet: allows socket(AF_INET), listen, bind, connect, accept, accept4, getpeername, getsockname, setsockopt, getsockopt, sendto.
unix: allows socket(AF_UNIX), listen, bind, connect, accept, accept4, getpeername, getsockname, setsockopt, getsockopt.
dns: allows socket(AF_INET), sendto, recvfrom, connect.
proc: allows fork, vfork, kill, getpriority, setpriority, prlimit, setrlimit, setpgid, setsid, sched_getscheduler, sched_setscheduler, sched_get_priority_min, sched_get_priority_max, sched_get_param, sched_set_param.
thread: allows clone, futex, and permits PROT_EXEC in mprotect.
id: allows setuid, setreuid, setresuid, setgid, setregid, setresgid, setgroups, prlimit, setrlimit, getpriority, setpriority, setfsuid, setfsgid.
exec: Allows execve, execveat. If the executable in question needs a loader, then you'll need rpath and prot_exec too. However that's not needed if you assimilate your APE binary beforehand, because security is strongest for static binaries; use the --assimilate flag or assimilate.com program.
tmppath: Allows unlink, unlinkat, and lstat. When this promise is used, certain paths will be automatically unveiled too, e.g. /tmp.
vminfo: OpenBSD intended this promise to be used by tools like `htop`. Using this causes paths such as /proc/stat to be automatically unveiled.

Funding

Funding for the development of pledge() on Linux was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers. Your support is what makes projects like Cosmopolitan Libc possible. Thank you.