Security of Docker and Rocket

Rocket Security

Enforces seccomp filtering on containers in pods rkt leverages systemd seccomp features to strengthen container isolation by denying unsafe system calls and privilege escalation. Integrated with SELinux Support for SVirt in addition to a default SELinux policy.

Docker Security

AppArmor

AppArmor security profiles for Docker

AppArmor (Application Armor) is a Linux security module that protects an operating system and its applications from security threats. To use it, a system administrator associates an AppArmor security profile with each program. Docker expects to find an AppArmor policy loaded and enforced.

Docker automatically loads container profiles. The Docker binary installs a docker-default profile in the /etc/apparmor.d/docker file. This profile is used on containers, not on the Docker Daemon.

A profile for the Docker Engine Daemon exists but it is not currently installed with the deb packages. If you are interested in the source for the Daemon profile, it is located in contrib/apparmor in the Docker Engine source repository.

Understand the policies

The docker-default profile is the default for running containers. It is moderately protective while providing wide application compatibility. The profile is the following:

#include <tunables/global>


profile docker-default flags=(attach_disconnected,mediate_deleted) {

  #include <abstractions/base>


  network,
  capability,
  file,
  umount,

  deny @{PROC}/{*,**^[0-9*],sys/kernel/shm*} wkx,
  deny @{PROC}/sysrq-trigger rwklx,
  deny @{PROC}/mem rwklx,
  deny @{PROC}/kmem rwklx,
  deny @{PROC}/kcore rwklx,

  deny mount,

  deny /sys/[^f]*/** wklx,
  deny /sys/f[^s]*/** wklx,
  deny /sys/fs/[^c]*/** wklx,
  deny /sys/fs/c[^g]*/** wklx,
  deny /sys/fs/cg[^r]*/** wklx,
  deny /sys/firmware/efi/efivars/** rwklx,
  deny /sys/kernel/security/** rwklx,
}

When you run a container, it uses the docker-default policy unless you override it with the security-opt option. For example, the following explicitly specifies the default policy:

$ docker run --rm -it --security-opt apparmor=docker-default hello-world
Loading and Unloading Profiles

To load a new profile into AppArmor, for use with containers:

$ apparmor_parser -r -W /path/to/your_profile

Then you can run the custom profile with --security-opt like so:

$ docker run --rm -it --security-opt apparmor=your_profile hello-world

To unload a profile from AppArmor:

# stop apparmor
$ /etc/init.d/apparmor stop
# unload the profile
$ apparmor_parser -R /path/to/profile
# start apparmor
$ /etc/init.d/apparmor start

Seccomp

Seccomp security profiles for Docker

Secure computing mode (Seccomp) is a Linux kernel feature. You can use it to restrict the actions available within the container. The seccomp() system call operates on the seccomp state of the calling process. You can use this feature to restrict your application’s access.

This feature is available only if Docker has been built with seccomp and the kernel is configured with CONFIG_SECCOMP enabled. To check if your kernel supports seccomp:

$ cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
CONFIG_SECCOMP=y

Note: seccomp profiles require seccomp 2.2.1 and are only available starting with Debian 9 “Stretch”, Ubuntu 15.10 “Wily”, and Fedora 22. To use this feature on Ubuntu 14.04, Debian Wheezy, or Debian Jessie, you must download the latest static Docker Linux binary. This feature is currently not available on other distributions.

Passing a profile for a container

The default seccomp profile provides a sane default for running containers with seccomp and disables around 44 system calls out of 300+. It is moderately protective while providing wide application compatibility. The default Docker profile (found here has a JSON layout in the following form:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "name": "accept",
            "action": "SCMP_ACT_ALLOW",
            "args": []
        },
        {
            "name": "accept4",
            "action": "SCMP_ACT_ALLOW",
            "args": []
        },
        ...
    ]
}

When you run a container, it uses the default profile unless you override it with the security-opt option. For example, the following explicitly specifies the default policy:

$ docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json hello-world

Significant syscalls blocked by the default profile

Docker’s default seccomp profile is a whitelist which specifies the calls that are allowed. The table below lists the significant (but not all) syscalls that are effectively blocked because they are not on the whitelist. The table includes the reason each syscall is blocked rather than white-listed.

Syscall Description

acct    Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT.
add_key    Prevent containers from using the kernel keyring, which is not namespaced.
adjtimex    Similar to clock_settime and settimeofday, time/date is not namespaced.
bpf    Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN.
clock_adjtime    Time/date is not namespaced.
clock_settime    Time/date is not namespaced.
clone    Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
create_module    Deny manipulation and functions on kernel modules.
delete_module    Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
finit_module    Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
get_kernel_syms    Deny retrieval of exported kernel and module symbols.
get_mempolicy    Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
init_module    Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
ioperm    Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
iopl    Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
kcmp    Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
kexec_file_load    Sister syscall of kexec_load that does the same thing, slightly different arguments.
kexec_load    Deny loading a new kernel for later execution.
keyctl    Prevent containers from using the kernel keyring, which is not namespaced.
lookup_dcookie    Tracing/profiling syscall, which could leak a lot of information on the host.
mbind    Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
modify_ldt    Old syscall only used in 16-bit code and a potential information leak.
mount    Deny mounting, already gated by CAP_SYS_ADMIN.
move_pages    Syscall that modifies kernel memory and NUMA settings.
name_to_handle_at    Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE.
nfsservctl    Deny interaction with the kernel nfs daemon.
open_by_handle_at    Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH.
perf_event_open    Tracing/profiling syscall, which could leak a lot of information on the host.
personality    Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns.
pivot_root    Deny pivot_root, should be privileged operation.
process_vm_readv    Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
process_vm_writev    Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
ptrace    Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.
query_module    Deny manipulation and functions on kernel modules.
quotactl    Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN.
reboot    Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT.
request_key    Prevent containers from using the kernel keyring, which is not namespaced.
set_mempolicy    Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
setns    Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN.
settimeofday    Time/date is not namespaced. Also gated by CAP_SYS_TIME.
stime    Time/date is not namespaced. Also gated by CAP_SYS_TIME.
swapon    Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
swapoff    Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
sysfs    Obsolete syscall.
_sysctl    Obsolete, replaced by /proc/sys.
umount    Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
umount2    Should be a privileged operation.
unshare    Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare --user.
uselib    Older syscall related to shared libraries, unused for a long time.
userfaultfd    Userspace page fault handling, largely needed for process migration.
ustat    Obsolete syscall.
vm86    In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.
vm86old    In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.

Run without the default seccomp profile

You can pass unconfined to run a container without the default seccomp profile.

$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
    unshare --map-root-user --user sh -c whoami