Re: How we use cgroups in rkt

From: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
To: "Iago López Galeiras" <iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Subject: Re: How we use cgroups in rkt
Date: Wed, 17 Jun 2015 20:30:24 +0000	[thread overview]
Message-ID: <20150617203024.GI10949@ubuntumail> (raw)
In-Reply-To: <55815556.4030304-973cpzSjLbNWk0Htik3J/w@public.gmane.org>

Quoting Iago López Galeiras (iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org):
> Hi everyone,
> 
> We are working on rkt[1] and we want to ask for feedback about the way we use
> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> so I guess the best way to start is explaining how this is handled in
> systemd-nspawn.
> 
> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> the container except the part that corresponds to it inside the systemd
> controller. It is done this way because allowing the container to modify the
> other controllers is considered unsafe[2].
> 
> This is how bind mounts look like:
> 
> /sys/fs/cgroup/devices RO
> [...]
> /sys/fs/cgroup/memory RO
> /sys/fs/cgroup/systemd RO
> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> 
> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> container, each running in its own chroot. To implement this concept, we start a
> systemd-nspawn container with a minimal systemd installation that starts each
> app as a service.
> 
> We want to be able to apply different restrictions to each app of a pod using
> cgroups and the straightforward way we thought was delegating to systemd inside
> the container. Initially, this didn't work because, as mentioned earlier, the
> cgroup controllers are mounted read-only.
> 
> The way we solved this problem was mounting the cgroup hierarchy (with the
> directories expected by systemd) outside the container. The difference with
> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> we leave the knobs we need in each of the application’s subcgroups read-write.
> 
> For example, if we want to restrict the memory usage of an application we leave
> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}

Who exactly does the writing to those files?  Do the applications want to
change them, or only rkt itself?  If rkt, then it seems like you should be
able to use a systemd api to update the values (over dbus), right?
systemctl set-property machine-a-scope MemoryLimit=1G or something.

Now I'm pretty sure that systemd doesn't yet support being able to do
this from inside the container in a delegated way.  That was cgmanager's
reason for being, and I'm interested in working on a proper API for that
for systemd.

> read-write so systemd inside the container can set the appropriate restrictions
> but the rest of /sys/fs/cgroup/memory/ is still read-only.
> 
> We know this doesn’t provide perfect isolation but we assume non-malicious
> applications. We also know we’ll have to rework this when systemd starts using
> the unified hierarchy.
> 
> What do you think about our approach?
> 
> Cheers.
> 
> [1]: https://github.com/coreos/rkt
> [2]: http://lists.freedesktop.org/archives/systemd-devel/2015-April/031191.html
> [3]: https://github.com/appc/spec/blob/master/spec/pods.md
> 
> -- 
> 
> Iago López Galeiras
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers