int capabilities() { fprintf(stderr, "=> dropping capabilities...");
CAP_AUDIT_CONTROL
, _READ
, and _WRITE
allow access to the audit
system of the kernel (i.e. functions like audit_set_enabled
, usually
used with auditctl
). The kernel prevents messages that normally
require CAP_AUDIT_CONTROL
outside of the first pid namespace, but it
does allow messages that would require CAP_AUDIT_READ
and
CAP_AUDIT_WRITE
from any namespace.12 So
let's drop them all. We especially want to drop CAP_AUDIT_READ
,
since it isn't namespaced13 and may contain important
information, but CAP_AUDIT_WRITE
may also allow the contained
process to falsify logs or DOS the audit system.
int drop_caps[] = { CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE,
CAP_BLOCK_SUSPEND
lets programs prevent the system from suspending,
either with EPOLLWAKEUP
or
/proc/sys/wake_lock.14 Supend isn't namespaced, so
we'd like to prevent this.
CAP_BLOCK_SUSPEND,
CAP_DAC_READ_SEARCH
lets programs call open_by_handle_at
with an
arbitrary struct file_handle *
. struct file_handle
is in theory an
opaque type, but in practice it corresponds to inode numbers. So it's
easy to brute-force them, and read arbitrary files. This was used by
Sebastian Krahmer to write a program to read arbitrary system files
from within Docker in 2014.15
CAP_DAC_READ_SEARCH,
CAP_FSETID
, without user namespacing, allows the process to modify a
setuid executable without removing the setuid bit. This is pretty
dangerous! It means that if we include a setuid binary in a container,
it's easy for us to accidentally leave a dangerous setuid root binary
on our disk, which any user can use to escalate
privileges.16
CAP_FSETID,
CAP_IPC_LOCK
can be used to lock more of a process' own memory than
would normally be allowed17, which could be a way to deny service.
CAP_IPC_LOCK,
CAP_MAC_ADMIN
and CAP_MAC_OVERRIDE
are used by the mandatory acess
control systems Apparmor, SELinux, and SMACK to restrict access to
their settings. These aren't namespaced, so they could be used by the
contained programs to circumvent system-wide access control.
CAP_MAC_ADMIN, CAP_MAC_OVERRIDE,
CAP_MKNOD
, without user namespacing, allows programs to create
device files corresponding to real-world devices. This includes
creating new device files for existing hardware. If this capability
were not dropped, a contained process could re-create the hard disk
device, remount it, and read or write to it.18
CAP_MKNOD,
I was worried that CAP_SETFCAP
could be used to add a capability to
an executable and execve
it, but it's not actually possible for a
process to set capabilities it doesn't have19. But!
An executable altered this way could be executed by any unsandboxed
user, so I think it unacceptably undermines the security of the
system.
CAP_SETFCAP,
CAP_SYSLOG
lets users perform destructive actions against the
syslog. Importantly, it doesn't prevent contained processes from
reading the syslog, which could be risky. It also exposes kernel
addresses, which could be used to circumvent kernel address layout
randomization20.
CAP_SYSLOG,
CAP_SYS_ADMIN
allows many behaviors! We don't want most of them
(mount
, vm86
, etc). Some would be nice to have (sethostname
,
mount
for bind mounts…) but the extra complexity doesn't seem
worth it.
CAP_SYS_ADMIN,
CAP_SYS_BOOT
allows programs to restart the system (the reboot
syscall) and load new kernels (the kexec_load
and kexec_file
syscalls)21. We absolutely don't want
this. reboot
is user-namespaced, and the kexec*
functions only work
in the root user namespace, but neither of those help us.
CAP_SYS_BOOT,
CAP_SYS_MODULE
is used by the syscalls delete_module
,
init_module
, finit_module
22, by the code for kmod
23,
and by the code for loading device modules with ioctl24.
CAP_SYS_MODULE,
CAP_SYS_NICE
allows processes to set higher priority on given pids
than the default25. The default kernel scheduler
doesn't know anything about pid namespaces, so it's possible for a
contained process to deny service to the rest of the system26.
CAP_SYS_NICE,
CAP_SYS_RAWIO
allows full access to the host systems memory with
/proc/kcore
, /dev/mem
, and /dev/kmem
27, but a
contained process would need mknod
to access these within the
namespace.28. But it also allows things like iopl
and ioperm
, which give raw access to the IO ports29.
CAP_SYS_RAWIO,
CAP_SYS_RESOURCE
specifically allows circumventing kernel-wide
limits, so we probably should drop it30. But I
don't think this can do more than DOS the
kernel, in general31.
CAP_SYS_RESOURCE,
CAP_SYS_TIME
: setting the time isn't namespaced, so we should prevent
contained processes from altering the system-wide
time32.
CAP_SYS_TIME,
CAP_WAKE_ALARM
, like CAP_BLOCK_SUSPEND
, lets the contained process
interfere with suspend33, and we'd like to prevent that.
CAP_WAKE_ALARM };
size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps); fprintf(stderr, "bounding..."); for (size_t i = 0; i < num_caps; i++) { if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) { fprintf(stderr, "prctl failed: %m\n"); return 1; } } fprintf(stderr, "inheritable..."); cap_t caps = NULL; if (!(caps = cap_get_proc()) || cap_set_flag(caps, CAP_INHERITABLE, num_caps, drop_caps, CAP_CLEAR) || cap_set_proc(caps)) { fprintf(stderr, "failed: %m\n"); if (caps) cap_free(caps); return 1; } cap_free(caps); fprintf(stderr, "done.\n"); return 0; }