quentin.dufour.io/_posts/2017-08-22-hardening-lxc-containers.md

---
layout: post
slug: hardening-lxc-containers-running-systemd
status: published
sitemap: true
title: Some LXC hardening
description: An epic trying to drop CAP\_SYS\_ADMIN
category: developpement
tags:
- security
- linux
- containers
---

Hardening Linux Containers, and more especially [LXC containers](https://linuxcontainers.org/fr/lxc/introduction/), is needed to prevent a malicious user to escape your container. However, even hardened, a container can't be considered totally safe today. You can consider this article as part of a [defence in depth strategy](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)). But before starting, we need to understand how containers work under the hood.

As said by Jessie Frazelle in her blog post [Setting the Record Straight: containers vs. Zones vs. Jails vs. VMs](https://blog.jessfraz.com/post/containers-zones-jails-vms/), containers in Linux are not a top level design like Zone in Solaris and Jails in BSD.

> A "container" is just a term people use to describe a combination of Linux namespaces and cgroups. Linux namespaces and cgroups ARE first class objects. NOT containers.

In this article, we'll discuss the different "primitives" exposed by the Linux kernel like *namespaces*, *cgroups*, *Linux Security Modules*, *capabilities* and *seccomp*. Our container tool like LXC or Docker, which is a user space binary, will interact with these primitives. We'll see that we can interact with them through the LXC configuration file to improve (or worsen) the security of our LXC container.

The challenge when it comes to hardening a LXC container, compared to other solutions, is that there is a great probability that you'll run systemd in your container. And systemd heavily uses the primitives quoted before. Especially, systemd rely on *cgroups* to handle its services. We can also mention that many systemd daemon will be provided with a configuration that need to interact with the *capabilities*.

If you feel a bit lost with all these terms, a good start is the reading of this whitepaper by the NCCGroup: [Understanding and Hardening Linux Containers](https://www.nccgroup.trust/us/our-research/understanding-and-hardening-linux-containers/). This post is also inspired by the article written by Christian Seiler, [LXC containers without CAP\_SYS\_ADMIN under Debian Jessie](https://blog.iwakd.de/lxc-cap_sys_admin-jessie), but we'll see that, due to evolutions in the Linux kernel, the proposed configuration does not work anymore out of the box.

## Creating a standard LXC container

Before starting, you'll need a very recent version of LXC, at least lxc-2.0.9 (not yet released as of this writing). Fortunately, you can compile it from its master branch. We'll see later why we need a such recent version.
Here is a quick reminder on how to compile LXC:

```bash
git clone https://github.com/lxc/lxc
cd lxc
./autogen.sh
./configure
make -j8
sudo make install
```

Now let's create a basic container (we'll use Fedora but the instructions should work for every distributions):

```bash
sudo lxc-create -n harden -t fedora
```

As you'll need to debug the launch of your container, I can only recommend you this command line :

```bash
sudo lxc-start -n harden -lDEBUG -F
```

It will launch your container in foreground (so you'll be able to see systemd logs at boot) and it will log many useful informations in the `/var/log/lxc/harden.log` file.

## Capabilities: split the root

Historically, there is a huge difference between the root user (with uid 0) which bypass any access control and the other users of the system which must pass every control. So, if you want to send an ICMP request via the `ping` command for example, you must run the command as root (with the magic of [setuid](https://en.wikipedia.org/wiki/Setuid) to enable non privileged users to launch it). As the command is launched as root for everyone, ping can load a kernel module, change the time on your system, erase every files, etc. That's dangerous, particularly if someone find a vulnerability in your command and use it to do a [privilege escalation](https://en.wikipedia.org/wiki/Privilege_escalation).

A good idea would be to only allow the ping command to execute actions related to network as root, not everything. You can do that by using capabilities, by giving the `CAP_NET_RAW` capability to your ping command.

But capabilities, and more precisely **capability bounding set**, can also be used to reduce the capabilities that any process of your container can inquire. Indeed, if you allow a process in your container to load kernel modules, what prevent him to load a faulty module enabling him to escape the container ? So, one way to prevent this catastrophic scenario is to drop `CAP_SYS_MODULE` from the capability bounding set. When you use `lxc.cap.keep` and `lxc.cap.drop`, you're modifying the capability bounding set of your container.
You can show your current **capability bounding set** with the following command:

```bash
capsh --print
```

One capability is a bit special, `CAP_SYS_ADMIN`, as it is sometimes considered as ["the new root"](https://lwn.net/Articles/486306/) because of its large and not strictly defined scope. This capability is very useful because it permits to mount filesystems from the container. Unfortunately, it also enables interaction with ioctl, IPC resources, namespaces, etc. So, we want to drop this capability. Can we just drop it ?

```ini
# /var/lib/harden/config
lxc.cap.drop = sys_admin
```

Now try to restart your container... we can't just drop the capability:

```raw
Failed to mount tmpfs at /dev/shm: Operation not permitted
Failed to mount tmpfs at /run: Operation not permitted
Failed to mount tmpfs at /sys/fs/cgroup: Operation not permitted
Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory
[!!!!!!] Failed to mount API filesystems, freezing.
Freezing execution.
```

It looks like the only solution is to manually mount these folders before systemd execution.
The operation will be slightly different from what Christian Seiler wrote as our kernel supports the cgroup namespace.
Indeed, the following directive will do nothing:

```ini
# /var/lib/harden/config
lxc.mount.entry = cgroup:mixed
```

Here is why in the code:

```c
/*
 /src/lxc/cgroups/cgfsng.c
 /src/lxc/cgroups/cgfs.c
*/
static bool cgfsng_mount(void *hdata, const char *root, int type)
{
  /* some initializations */
  if (cgns_supported())
    return true;
  /* rest of the function */
}
```

Developpers put this condition as, with the cgroup namespace, we can safely mount the cgroup hierarchy like any other filesystem in our LXC configuration file:

<pre style="white-space: pre">
# /var/lib/harden/config
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.mount.entry = tmpfs sys/fs/cgroup tmpfs rw,nosuid,nodev,create=dir 0 0
</pre>

But to mount our cgroup hierarchy (we only need one, for systemd), we need to create the mount point first... We can't put the following line:

<pre style="white-space: pre">
# /var/lib/harden/config
lxc.mount.entry = cgroup sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
</pre>

Instead, the only solution I found was to create a (simple) [LXC mount hook](https://linuxcontainers.org/lxc/manpages//man5/lxc.container.conf.5.html#lbBC):

```bash
#!/bin/bash
# /usr/local/bin/mount-cgroup on the host
mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd
mount cgroup $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd \
  -t cgroup \
  -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
```

Now, we call this script from our configuration:

```ini
# /var/lib/harden/config
lxc.hook.mount = /usr/local/bin/mount-cgroup
```

And now, your container is working !

But instead of creating a capabilities blacklist, can we create a capabilities whitelist ?

```ini
lxc.cap.keep =
lxc.cap.keep = chown ipc_lock ipc_owner kill net_admin net_bind_service
```

You can find the whole capability list in the dedicated man page [capabilities(7)](http://man7.org/linux/man-pages/man7/capabilities.7.html) and how to use them with LXC in the LXC man page [lxc.container.conf(5)](https://linuxcontainers.org/fr/lxc/manpages//man5/lxc.container.conf.5.html#lbAV).

## cgroups: group your processes

[Wikipedia](https://en.wikipedia.org/wiki/Cgroups) proposes the following definition:

> cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

It might not be totally clear at the first read, but cgroups are two differents things:

  1. A method to create groups of processus
  2. A method to apply limitation, accounting, etc. on these groups

If you want to read more on this, the article [Control Groups vs. Control Groups](http://0pointer.de/blog/projects/cgroups-vs-cgroups.html) by Lennart Poettering explains how systemd uses cgroups and why the distinction is crucial.

## Namespaces: isolate your system resources

Michael Kerrisk wrote an interesting [serie of articles about namespaces](https://lwn.net/Articles/531114/) on LWN. I find its definition of namespaces particularly interesting:

> The purpose of each namespace is to wrap a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.
WIP containers 2017-08-22 07:28:42 +00:00			`---`
			`layout: post`
			`slug: hardening-lxc-containers-running-systemd`
Refactor articles 2021-07-14 15:13:17 +00:00			`status: published`
WIP containers 2017-08-22 07:28:42 +00:00			`sitemap: true`
Refactor articles 2021-07-14 15:13:17 +00:00			`title: Some LXC hardening`
WIP 3 LXC hardening 2017-08-22 09:47:44 +00:00			`description: An epic trying to drop CAP\_SYS\_ADMIN`
Refactor articles 2021-07-14 15:13:17 +00:00			`category: developpement`
WIP containers 2017-08-22 07:28:42 +00:00			`tags:`
			`- security`
			`- linux`
			`- containers`
			`---`

WIP 4 LXC hardening 2017-08-22 13:18:24 +00:00			`Hardening Linux Containers, and more especially [LXC containers](https://linuxcontainers.org/fr/lxc/introduction/), is needed to prevent a malicious user to escape your container. However, even hardened, a container can't be considered totally safe today. You can consider this article as part of a [defence in depth strategy](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)). But before starting, we need to understand how containers work under the hood.`
WIP containers 2017-08-22 07:28:42 +00:00
			`As said by Jessie Frazelle in her blog post [Setting the Record Straight: containers vs. Zones vs. Jails vs. VMs](https://blog.jessfraz.com/post/containers-zones-jails-vms/), containers in Linux are not a top level design like Zone in Solaris and Jails in BSD.`

			`> A "container" is just a term people use to describe a combination of Linux namespaces and cgroups. Linux namespaces and cgroups ARE first class objects. NOT containers.`

			`In this article, we'll discuss the different "primitives" exposed by the Linux kernel like namespaces, cgroups, Linux Security Modules, capabilities and seccomp. Our container tool like LXC or Docker, which is a user space binary, will interact with these primitives. We'll see that we can interact with them through the LXC configuration file to improve (or worsen) the security of our LXC container.`

			`The challenge when it comes to hardening a LXC container, compared to other solutions, is that there is a great probability that you'll run systemd in your container. And systemd heavily uses the primitives quoted before. Especially, systemd rely on cgroups to handle its services. We can also mention that many systemd daemon will be provided with a configuration that need to interact with the capabilities.`

			If you feel a bit lost with all these terms, a good start is the reading of this whitepaper by the NCCGroup: [Understanding and Hardening Linux Containers](https://www.nccgroup.trust/us/our-research/understanding-and-hardening-linux-containers/). This post is also inspired by the article written by Christian Seiler, [LXC containers without CAP\_SYS\_ADMIN under Debian Jessie](https://blog.iwakd.de/lxc-cap_sys_admin-jessie), but we'll see that, due to evolutions in the Linux kernel, the proposed configuration does not work anymore out of the box.

			`## Creating a standard LXC container`

WIP 2 hardening LXC 2017-08-22 08:43:35 +00:00			`Before starting, you'll need a very recent version of LXC, at least lxc-2.0.9 (not yet released as of this writing). Fortunately, you can compile it from its master branch. We'll see later why we need a such recent version.`
			`Here is a quick reminder on how to compile LXC:`

			```bash
			`git clone https://github.com/lxc/lxc`
			`cd lxc`
			`./autogen.sh`
			`./configure`
			`make -j8`
			`sudo make install`
			```

			`Now let's create a basic container (we'll use Fedora but the instructions should work for every distributions):`

			```bash
			`sudo lxc-create -n harden -t fedora`
			```

			`As you'll need to debug the launch of your container, I can only recommend you this command line :`

			```bash
			`sudo lxc-start -n harden -lDEBUG -F`
			```

			It will launch your container in foreground (so you'll be able to see systemd logs at boot) and it will log many useful informations in the `/var/log/lxc/harden.log` file.

WIP 3 LXC hardening 2017-08-22 09:47:44 +00:00			`## Capabilities: split the root`

WIP 4 LXC hardening 2017-08-22 13:18:24 +00:00			Historically, there is a huge difference between the root user (with uid 0) which bypass any access control and the other users of the system which must pass every control. So, if you want to send an ICMP request via the `ping` command for example, you must run the command as root (with the magic of [setuid](https://en.wikipedia.org/wiki/Setuid) to enable non privileged users to launch it). As the command is launched as root for everyone, ping can load a kernel module, change the time on your system, erase every files, etc. That's dangerous, particularly if someone find a vulnerability in your command and use it to do a [privilege escalation](https://en.wikipedia.org/wiki/Privilege_escalation).

			A good idea would be to only allow the ping command to execute actions related to network as root, not everything. You can do that by using capabilities, by giving the `CAP_NET_RAW` capability to your ping command.

			But capabilities, and more precisely capability bounding set, can also be used to reduce the capabilities that any process of your container can inquire. Indeed, if you allow a process in your container to load kernel modules, what prevent him to load a faulty module enabling him to escape the container ? So, one way to prevent this catastrophic scenario is to drop `CAP_SYS_MODULE` from the capability bounding set. When you use `lxc.cap.keep` and `lxc.cap.drop`, you're modifying the capability bounding set of your container.
WIP hardening 2017-10-09 14:53:10 +00:00			`You can show your current capability bounding set with the following command:`
WIP 4 LXC hardening 2017-08-22 13:18:24 +00:00
WIP hardening 2017-10-09 14:53:10 +00:00			```bash
			`capsh --print`
			```

			One capability is a bit special, `CAP_SYS_ADMIN`, as it is sometimes considered as ["the new root"](https://lwn.net/Articles/486306/) because of its large and not strictly defined scope. This capability is very useful because it permits to mount filesystems from the container. Unfortunately, it also enables interaction with ioctl, IPC resources, namespaces, etc. So, we want to drop this capability. Can we just drop it ?
WIP 4 LXC hardening 2017-08-22 13:18:24 +00:00
			```ini
			`# /var/lib/harden/config`
			`lxc.cap.drop = sys_admin`
			```

WIP hardening 2017-10-09 14:53:10 +00:00			`Now try to restart your container... we can't just drop the capability:`

			```raw
			`Failed to mount tmpfs at /dev/shm: Operation not permitted`
			`Failed to mount tmpfs at /run: Operation not permitted`
			`Failed to mount tmpfs at /sys/fs/cgroup: Operation not permitted`
			`Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory`
			`[!!!!!!] Failed to mount API filesystems, freezing.`
			`Freezing execution.`
			```

			`It looks like the only solution is to manually mount these folders before systemd execution.`
			`The operation will be slightly different from what Christian Seiler wrote as our kernel supports the cgroup namespace.`
			`Indeed, the following directive will do nothing:`

			```ini
			`# /var/lib/harden/config`
			`lxc.mount.entry = cgroup:mixed`
			```

			`Here is why in the code:`

			```c
			`/*`
			`/src/lxc/cgroups/cgfsng.c`
			`/src/lxc/cgroups/cgfs.c`
			`*/`
			`static bool cgfsng_mount(void hdata, const char root, int type)`
			`{`
			`/* some initializations */`
			`if (cgns_supported())`
			`return true;`
			`/* rest of the function */`
			`}`
			```

			`Developpers put this condition as, with the cgroup namespace, we can safely mount the cgroup hierarchy like any other filesystem in our LXC configuration file:`

			`<pre style="white-space: pre">`
			`# /var/lib/harden/config`
			`lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0`
			`lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0`
			`lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0`
			`lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0`
			`lxc.mount.entry = tmpfs sys/fs/cgroup tmpfs rw,nosuid,nodev,create=dir 0 0`
			`</pre>`

			`But to mount our cgroup hierarchy (we only need one, for systemd), we need to create the mount point first... We can't put the following line:`

			`<pre style="white-space: pre">`
			`# /var/lib/harden/config`
			`lxc.mount.entry = cgroup sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0`
			`</pre>`

			`Instead, the only solution I found was to create a (simple) [LXC mount hook](https://linuxcontainers.org/lxc/manpages//man5/lxc.container.conf.5.html#lbBC):`

			```bash
			`#!/bin/bash`
			`# /usr/local/bin/mount-cgroup on the host`
			`mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd`
			`mount cgroup $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd \`
			`-t cgroup \`
			`-o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd`
			```

			`Now, we call this script from our configuration:`

			```ini
			`# /var/lib/harden/config`
			`lxc.hook.mount = /usr/local/bin/mount-cgroup`
			```

			`And now, your container is working !`

			`But instead of creating a capabilities blacklist, can we create a capabilities whitelist ?`

			```ini
			`lxc.cap.keep =`
			`lxc.cap.keep = chown ipc_lock ipc_owner kill net_admin net_bind_service`
			```
WIP 4 LXC hardening 2017-08-22 13:18:24 +00:00
			`You can find the whole capability list in the dedicated man page [capabilities(7)](http://man7.org/linux/man-pages/man7/capabilities.7.html) and how to use them with LXC in the LXC man page [lxc.container.conf(5)](https://linuxcontainers.org/fr/lxc/manpages//man5/lxc.container.conf.5.html#lbAV).`

WIP 2 hardening LXC 2017-08-22 08:43:35 +00:00			`## cgroups: group your processes`

			`[Wikipedia](https://en.wikipedia.org/wiki/Cgroups) proposes the following definition:`

			`> cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.`

			`It might not be totally clear at the first read, but cgroups are two differents things:`

			`1. A method to create groups of processus`
			`2. A method to apply limitation, accounting, etc. on these groups`

			`If you want to read more on this, the article [Control Groups vs. Control Groups](http://0pointer.de/blog/projects/cgroups-vs-cgroups.html) by Lennart Poettering explains how systemd uses cgroups and why the distinction is crucial.`

			`## Namespaces: isolate your system resources`

			`Michael Kerrisk wrote an interesting [serie of articles about namespaces](https://lwn.net/Articles/531114/) on LWN. I find its definition of namespaces particularly interesting:`

			`> The purpose of each namespace is to wrap a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.`

WIP 4 LXC hardening 2017-08-22 13:18:24 +00:00