--- layout: post slug: lxc-drop-capsysadmin status: published sitemap: true title: Drop CAP_SYS_ADMIN in LXC category: developpement tags: - security - linux - containers --- Hardening Linux Containers, and more especially [LXC containers](https://linuxcontainers.org/fr/lxc/introduction/), is needed to prevent a malicious user to escape your container. However, even hardened, a container can't be considered totally safe today, so don't rely solely on this article for your security! Instead, you should consider it as part of your [defense in depth strategy](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)). To understand how a container can be hardened, we must understand how they work under the hood. Jessie Frazelle describes magnificently their concepts in her blog post [Setting the Record Straight: containers vs. Zones vs. Jails vs. VMs](https://blog.jessfraz.com/post/containers-zones-jails-vms/). The critical point is that containers in Linux are not a top level design like Zone in Solaris and Jails in BSD: > A "container" is just a term people use to describe a combination of Linux [capabilities,] namespaces and cgroups. Linux [capabilities,] namespaces and cgroups ARE first class objects. NOT containers. In this article, we will focus on one first class object, capabilities, in a specific context, an LXC container. The challenge when it comes to hardening a LXC container, compared to other solutions, is that there is a great probability that you'll run systemd in your container. Because systemd is a powerful init system, it assumes it also requires many permissions: we will see here how to start here despite our capability hardening. If you feel a bit lost with containers, a good start is the reading of this whitepaper by the NCCGroup: [Understanding and Hardening Linux Containers](https://www.nccgroup.trust/us/our-research/understanding-and-hardening-linux-containers/). This post is also inspired by the article written by Christian Seiler, [LXC containers without CAP\_SYS\_ADMIN under Debian Jessie](https://blog.iwakd.de/lxc-cap_sys_admin-jessie), but we'll see that, due to evolutions in the Linux kernel, the proposed configuration does not work anymore out of the box. ## Creating a standard LXC container Before starting, you'll need at least lxc-2.0.9. In any case, compiling LXC is quite straightforward. Here is a quick reminder on how to compile LXC: ```bash git clone https://github.com/lxc/lxc cd lxc ./autogen.sh ./configure make -j8 sudo make install ``` Now let's create a basic container (we'll use Fedora but the instructions should work for every distributions): ```bash sudo lxc-create -n harden -t fedora ``` As you'll need to debug the launch of your container, I can only recommend you this command line : ```bash sudo lxc-start -n harden -lDEBUG -F ``` This command will launch your container in foreground (so you'll be able to see systemd logs at boot) and it will log many useful informations in the `/var/log/lxc/harden.log` file. ## Capabilities: split the root Historically, there is a huge difference between the root user (with uid 0) which bypasses any access control and the other users of the system which must pass every control. So, if you want to send an ICMP request via the `ping` command for example, you must run the command as root (with the magic of [setuid](https://en.wikipedia.org/wiki/Setuid) to enable non privileged users to launch it). As the command is launched as root for everyone, ping can load a kernel module, change the time on your system, erase every files, etc. That's dangerous, particularly if someone finds a vulnerability in your command and uses it to do a [privilege escalation](https://en.wikipedia.org/wiki/Privilege_escalation). A good idea would be to only allow the ping command to execute actions related to network as root, not everything. You can do that with capabilities, by giving the `CAP_NET_RAW` capability to your ping command. But capabilities, and more precisely **capability bounding set**, can also be used to reduce the capabilities that any process of your container can inquire. Indeed, if you allow a process in your container to load kernel modules, what prevent it to load a faulty module enabling the attacker to escape the container? So, one way to prevent this catastrophic scenario is to drop `CAP_SYS_MODULE` from the capability bounding set. When you use `lxc.cap.keep` and `lxc.cap.drop`, you're modifying this capability bounding set of your container. Let's start by displaying your current **capability bounding set**: ```bash capsh --print ``` Over all the existing capabilities, one is a bit special: `CAP_SYS_ADMIN`. It is considered by somes as ["the new root"](https://lwn.net/Articles/486306/) because of its large and not strictly defined scope. This capability is also very useful because it is needed to mount filesystems from the container. Unfortunately, it enables interactions with critical API of the kernel like ioctl, IPC resources, namespaces, etc. Considering the power of this capability, we want to drop it in out container. But can we only do it? ```ini # /var/lib/harden/config lxc.cap.drop = sys_admin ``` Now try to restart your container... and enjoy the crash: ```raw Failed to mount tmpfs at /dev/shm: Operation not permitted Failed to mount tmpfs at /run: Operation not permitted Failed to mount tmpfs at /sys/fs/cgroup: Operation not permitted Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory [!!!!!!] Failed to mount API filesystems, freezing. Freezing execution. ``` It looks like the only solution is to manually mount these folders before systemd execution. The operation will be slightly different from what Christian Seiler wrote as our kernel supports the cgroup namespace. Indeed, the following directive will do nothing: ```ini # /var/lib/harden/config lxc.mount.entry = cgroup:mixed ``` Here is why in the code: ```c /* /src/lxc/cgroups/cgfsng.c /src/lxc/cgroups/cgfs.c */ static bool cgfsng_mount(void *hdata, const char *root, int type) { /* some initializations */ if (cgns_supported()) return true; /* rest of the function */ } ``` Developpers put this condition as, with the cgroup namespace, we can safely mount the cgroup hierarchy like any other filesystem in our LXC configuration file:
# /var/lib/harden/config lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0 lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0 lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0 lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0 lxc.mount.entry = tmpfs sys/fs/cgroup tmpfs rw,nosuid,nodev,create=dir 0 0But to mount our cgroup hierarchy (we only need one, for systemd), we need to create the mount point first... We can't put the following line:
# /var/lib/harden/config lxc.mount.entry = cgroup sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0Instead, the only solution I found was to create a (simple) [LXC mount hook](https://linuxcontainers.org/lxc/manpages//man5/lxc.container.conf.5.html#lbBC): ```bash #!/bin/bash # /usr/local/bin/mount-cgroup on the host mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd mount cgroup $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd \ -t cgroup \ -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd ``` Now, we call this script from our configuration: ```ini # /var/lib/harden/config lxc.hook.mount = /usr/local/bin/mount-cgroup ``` And finally your container is working ! But one more thing: instead of creating a capabilities blacklist, can we create a more secure whitelist ? The answer is yes: ```ini lxc.cap.keep = lxc.cap.keep = chown ipc_lock ipc_owner kill net_admin net_bind_service ``` If you want to dig the question further, you can find the whole capability list in the dedicated man page [capabilities(7)](http://man7.org/linux/man-pages/man7/capabilities.7.html) and how to use them with LXC in the LXC man page [lxc.container.conf(5)](https://linuxcontainers.org/fr/lxc/manpages//man5/lxc.container.conf.5.html#lbAV). Have fun!