All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* How we use cgroups in rkt
@ 2015-06-17 11:09 Iago López Galeiras
       [not found] ` <55815556.4030304-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Iago López Galeiras @ 2015-06-17 11:09 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi everyone,

We are working on rkt[1] and we want to ask for feedback about the way we use
cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
so I guess the best way to start is explaining how this is handled in
systemd-nspawn.

The approach taken by nspawn is mounting the cgroup controllers read-only inside
the container except the part that corresponds to it inside the systemd
controller. It is done this way because allowing the container to modify the
other controllers is considered unsafe[2].

This is how bind mounts look like:

/sys/fs/cgroup/devices RO
[...]
/sys/fs/cgroup/memory RO
/sys/fs/cgroup/systemd RO
/sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW

In rkt we have a concept called pod[3] which is a list of apps that run inside a
container, each running in its own chroot. To implement this concept, we start a
systemd-nspawn container with a minimal systemd installation that starts each
app as a service.

We want to be able to apply different restrictions to each app of a pod using
cgroups and the straightforward way we thought was delegating to systemd inside
the container. Initially, this didn't work because, as mentioned earlier, the
cgroup controllers are mounted read-only.

The way we solved this problem was mounting the cgroup hierarchy (with the
directories expected by systemd) outside the container. The difference with
systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
we leave the knobs we need in each of the application’s subcgroups read-write.

For example, if we want to restrict the memory usage of an application we leave
/sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}
read-write so systemd inside the container can set the appropriate restrictions
but the rest of /sys/fs/cgroup/memory/ is still read-only.

We know this doesn’t provide perfect isolation but we assume non-malicious
applications. We also know we’ll have to rework this when systemd starts using
the unified hierarchy.

What do you think about our approach?

Cheers.

[1]: https://github.com/coreos/rkt
[2]: http://lists.freedesktop.org/archives/systemd-devel/2015-April/031191.html
[3]: https://github.com/appc/spec/blob/master/spec/pods.md

-- 

Iago López Galeiras
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How we use cgroups in rkt
       [not found] ` <55815556.4030304-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
  2015-06-17 20:30   ` Serge Hallyn
@ 2015-06-17 20:30   ` Serge Hallyn
  1 sibling, 0 replies; 6+ messages in thread
From: Serge Hallyn @ 2015-06-17 20:30 UTC (permalink / raw)
  To: Iago López Galeiras
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Iago López Galeiras (iago@endocode.com):
> Hi everyone,
> 
> We are working on rkt[1] and we want to ask for feedback about the way we use
> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> so I guess the best way to start is explaining how this is handled in
> systemd-nspawn.
> 
> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> the container except the part that corresponds to it inside the systemd
> controller. It is done this way because allowing the container to modify the
> other controllers is considered unsafe[2].
> 
> This is how bind mounts look like:
> 
> /sys/fs/cgroup/devices RO
> [...]
> /sys/fs/cgroup/memory RO
> /sys/fs/cgroup/systemd RO
> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> 
> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> container, each running in its own chroot. To implement this concept, we start a
> systemd-nspawn container with a minimal systemd installation that starts each
> app as a service.
> 
> We want to be able to apply different restrictions to each app of a pod using
> cgroups and the straightforward way we thought was delegating to systemd inside
> the container. Initially, this didn't work because, as mentioned earlier, the
> cgroup controllers are mounted read-only.
> 
> The way we solved this problem was mounting the cgroup hierarchy (with the
> directories expected by systemd) outside the container. The difference with
> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> we leave the knobs we need in each of the application’s subcgroups read-write.
> 
> For example, if we want to restrict the memory usage of an application we leave
> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}

Who exactly does the writing to those files?  Do the applications want to
change them, or only rkt itself?  If rkt, then it seems like you should be
able to use a systemd api to update the values (over dbus), right?
systemctl set-property machine-a-scope MemoryLimit=1G or something.

Now I'm pretty sure that systemd doesn't yet support being able to do
this from inside the container in a delegated way.  That was cgmanager's
reason for being, and I'm interested in working on a proper API for that
for systemd.

> read-write so systemd inside the container can set the appropriate restrictions
> but the rest of /sys/fs/cgroup/memory/ is still read-only.
> 
> We know this doesn’t provide perfect isolation but we assume non-malicious
> applications. We also know we’ll have to rework this when systemd starts using
> the unified hierarchy.
> 
> What do you think about our approach?
> 
> Cheers.
> 
> [1]: https://github.com/coreos/rkt
> [2]: http://lists.freedesktop.org/archives/systemd-devel/2015-April/031191.html
> [3]: https://github.com/appc/spec/blob/master/spec/pods.md
> 
> -- 
> 
> Iago López Galeiras
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How we use cgroups in rkt
       [not found] ` <55815556.4030304-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
@ 2015-06-17 20:30   ` Serge Hallyn
  2015-06-18  8:57     ` Alban Crequy
  2015-06-17 20:30   ` Serge Hallyn
  1 sibling, 1 reply; 6+ messages in thread
From: Serge Hallyn @ 2015-06-17 20:30 UTC (permalink / raw)
  To: Iago López Galeiras
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Iago López Galeiras (iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org):
> Hi everyone,
> 
> We are working on rkt[1] and we want to ask for feedback about the way we use
> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> so I guess the best way to start is explaining how this is handled in
> systemd-nspawn.
> 
> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> the container except the part that corresponds to it inside the systemd
> controller. It is done this way because allowing the container to modify the
> other controllers is considered unsafe[2].
> 
> This is how bind mounts look like:
> 
> /sys/fs/cgroup/devices RO
> [...]
> /sys/fs/cgroup/memory RO
> /sys/fs/cgroup/systemd RO
> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> 
> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> container, each running in its own chroot. To implement this concept, we start a
> systemd-nspawn container with a minimal systemd installation that starts each
> app as a service.
> 
> We want to be able to apply different restrictions to each app of a pod using
> cgroups and the straightforward way we thought was delegating to systemd inside
> the container. Initially, this didn't work because, as mentioned earlier, the
> cgroup controllers are mounted read-only.
> 
> The way we solved this problem was mounting the cgroup hierarchy (with the
> directories expected by systemd) outside the container. The difference with
> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> we leave the knobs we need in each of the application’s subcgroups read-write.
> 
> For example, if we want to restrict the memory usage of an application we leave
> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}

Who exactly does the writing to those files?  Do the applications want to
change them, or only rkt itself?  If rkt, then it seems like you should be
able to use a systemd api to update the values (over dbus), right?
systemctl set-property machine-a-scope MemoryLimit=1G or something.

Now I'm pretty sure that systemd doesn't yet support being able to do
this from inside the container in a delegated way.  That was cgmanager's
reason for being, and I'm interested in working on a proper API for that
for systemd.

> read-write so systemd inside the container can set the appropriate restrictions
> but the rest of /sys/fs/cgroup/memory/ is still read-only.
> 
> We know this doesn’t provide perfect isolation but we assume non-malicious
> applications. We also know we’ll have to rework this when systemd starts using
> the unified hierarchy.
> 
> What do you think about our approach?
> 
> Cheers.
> 
> [1]: https://github.com/coreos/rkt
> [2]: http://lists.freedesktop.org/archives/systemd-devel/2015-April/031191.html
> [3]: https://github.com/appc/spec/blob/master/spec/pods.md
> 
> -- 
> 
> Iago López Galeiras
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How we use cgroups in rkt
  2015-06-17 20:30   ` Serge Hallyn
@ 2015-06-18  8:57     ` Alban Crequy
       [not found]       ` <CALdWxcsxWCyLyH9H+BrAYW4gY-oAyNrp_Rf726tFtbHPy2_M9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Alban Crequy @ 2015-06-18  8:57 UTC (permalink / raw)
  To: Serge Hallyn; +Cc: Linux Containers, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 17, 2015 at 10:30 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Iago López Galeiras (iago@endocode.com):
>> Hi everyone,
>>
>> We are working on rkt[1] and we want to ask for feedback about the way we use
>> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
>> so I guess the best way to start is explaining how this is handled in
>> systemd-nspawn.
>>
>> The approach taken by nspawn is mounting the cgroup controllers read-only inside
>> the container except the part that corresponds to it inside the systemd
>> controller. It is done this way because allowing the container to modify the
>> other controllers is considered unsafe[2].
>>
>> This is how bind mounts look like:
>>
>> /sys/fs/cgroup/devices RO
>> [...]
>> /sys/fs/cgroup/memory RO
>> /sys/fs/cgroup/systemd RO
>> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
>>
>> In rkt we have a concept called pod[3] which is a list of apps that run inside a
>> container, each running in its own chroot. To implement this concept, we start a
>> systemd-nspawn container with a minimal systemd installation that starts each
>> app as a service.
>>
>> We want to be able to apply different restrictions to each app of a pod using
>> cgroups and the straightforward way we thought was delegating to systemd inside
>> the container. Initially, this didn't work because, as mentioned earlier, the
>> cgroup controllers are mounted read-only.
>>
>> The way we solved this problem was mounting the cgroup hierarchy (with the
>> directories expected by systemd) outside the container. The difference with
>> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
>> we leave the knobs we need in each of the application’s subcgroups read-write.
>>
>> For example, if we want to restrict the memory usage of an application we leave
>> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}
>
> Who exactly does the writing to those files?

First, rkt prepares systemd a .service file for each application in
the container with "CPUQuota=" and "MemoryLimit=". The .service files
are not used by systemd outside the container. Then, rkt uses
systemd-nspawn to start systemd as pid 1 in the container. Finally,
systemd inside the container writes to the cgroup files
{memory.limit_in_bytes,cgroup.procs}.

We call those limits the "per-app isolators". It's not a security
boundary because all the apps run in the same container (in the same
pid/mount/net namespaces). The apps run in different chroots, but
that's easily escapable.

> Do the applications want to change them, or only rkt itself?

At the moment, the limits are statically defined in the app container
image, so neither rkt or the apps inside the container change them. I
don't know of a use case where we would need to change them
dynamically.

>  If rkt, then it seems like you should be
> able to use a systemd api to update the values (over dbus), right?
> systemctl set-property machine-a-scope MemoryLimit=1G or something.

In addition to the "per-app isolators" described above, rkt can have
"pod-level isolators" that are applied on the machine slice (the
cgroup parent directory) rather than at the leaves of the cgroup tree.
They are defined when rkt itself is started by a systemd .service
file, and applied by systemd outside of the container. E.g.

[Service]
CPUShares=512
MemoryLimit=1G
ExecStart=/usr/bin/rkt run myapp.com/myapp-1.3.4

Updating the pod-level isolators with systemctl on the host should work.

But systemd inside the container or the apps don't have access to the
required cgroup knob files: they are mounted read-only.

> Now I'm pretty sure that systemd doesn't yet support being able to do
> this from inside the container in a delegated way.

Indeed by default nspawn/systemd does not support delegating that. It
only works because rkt prepared the cgroup bind mounts for the
container.

> That was cgmanager's
> reason for being, and I'm interested in working on a proper API for that
> for systemd.

Do you mean patching systemd so it does not write to the cgroup
filesystem directly but talk to the cgmanager/cgproxy socket instead?

>> read-write so systemd inside the container can set the appropriate restrictions
>> but the rest of /sys/fs/cgroup/memory/ is still read-only.
>>
>> We know this doesn’t provide perfect isolation but we assume non-malicious
>> applications. We also know we’ll have to rework this when systemd starts using
>> the unified hierarchy.
>>
>> What do you think about our approach?
>>
>> Cheers.
>>
>> [1]: https://github.com/coreos/rkt
>> [2]: http://lists.freedesktop.org/archives/systemd-devel/2015-April/031191.html
>> [3]: https://github.com/appc/spec/blob/master/spec/pods.md
>>
>> --
>>
>> Iago López Galeiras
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How we use cgroups in rkt
       [not found]       ` <CALdWxcsxWCyLyH9H+BrAYW4gY-oAyNrp_Rf726tFtbHPy2_M9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-18 14:40         ` Serge Hallyn
  2015-06-18 14:40         ` Serge Hallyn
  1 sibling, 0 replies; 6+ messages in thread
From: Serge Hallyn @ 2015-06-18 14:40 UTC (permalink / raw)
  To: Alban Crequy; +Cc: Linux Containers, cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Alban Crequy (alban@endocode.com):
> On Wed, Jun 17, 2015 at 10:30 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> > Quoting Iago López Galeiras (iago@endocode.com):
> >> Hi everyone,
> >>
> >> We are working on rkt[1] and we want to ask for feedback about the way we use
> >> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> >> so I guess the best way to start is explaining how this is handled in
> >> systemd-nspawn.
> >>
> >> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> >> the container except the part that corresponds to it inside the systemd
> >> controller. It is done this way because allowing the container to modify the
> >> other controllers is considered unsafe[2].
> >>
> >> This is how bind mounts look like:
> >>
> >> /sys/fs/cgroup/devices RO
> >> [...]
> >> /sys/fs/cgroup/memory RO
> >> /sys/fs/cgroup/systemd RO
> >> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> >>
> >> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> >> container, each running in its own chroot. To implement this concept, we start a
> >> systemd-nspawn container with a minimal systemd installation that starts each
> >> app as a service.
> >>
> >> We want to be able to apply different restrictions to each app of a pod using
> >> cgroups and the straightforward way we thought was delegating to systemd inside
> >> the container. Initially, this didn't work because, as mentioned earlier, the
> >> cgroup controllers are mounted read-only.
> >>
> >> The way we solved this problem was mounting the cgroup hierarchy (with the
> >> directories expected by systemd) outside the container. The difference with
> >> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> >> we leave the knobs we need in each of the application’s subcgroups read-write.
> >>
> >> For example, if we want to restrict the memory usage of an application we leave
> >> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}
> >
> > Who exactly does the writing to those files?
> 
> First, rkt prepares systemd a .service file for each application in
> the container with "CPUQuota=" and "MemoryLimit=". The .service files
> are not used by systemd outside the container. Then, rkt uses
> systemd-nspawn to start systemd as pid 1 in the container. Finally,
> systemd inside the container writes to the cgroup files
> {memory.limit_in_bytes,cgroup.procs}.
> 
> We call those limits the "per-app isolators". It's not a security
> boundary because all the apps run in the same container (in the same
> pid/mount/net namespaces). The apps run in different chroots, but
> that's easily escapable.
> 
> > Do the applications want to change them, or only rkt itself?
> 
> At the moment, the limits are statically defined in the app container
> image, so neither rkt or the apps inside the container change them. I
> don't know of a use case where we would need to change them
> dynamically.
> 
> >  If rkt, then it seems like you should be
> > able to use a systemd api to update the values (over dbus), right?
> > systemctl set-property machine-a-scope MemoryLimit=1G or something.
> 
> In addition to the "per-app isolators" described above, rkt can have
> "pod-level isolators" that are applied on the machine slice (the
> cgroup parent directory) rather than at the leaves of the cgroup tree.
> They are defined when rkt itself is started by a systemd .service
> file, and applied by systemd outside of the container. E.g.
> 
> [Service]
> CPUShares=512
> MemoryLimit=1G
> ExecStart=/usr/bin/rkt run myapp.com/myapp-1.3.4
> 
> Updating the pod-level isolators with systemctl on the host should work.
> 
> But systemd inside the container or the apps don't have access to the
> required cgroup knob files: they are mounted read-only.
> 
> > Now I'm pretty sure that systemd doesn't yet support being able to do
> > this from inside the container in a delegated way.
> 
> Indeed by default nspawn/systemd does not support delegating that. It
> only works because rkt prepared the cgroup bind mounts for the
> container.
> 
> > That was cgmanager's
> > reason for being, and I'm interested in working on a proper API for that
> > for systemd.
> 
> Do you mean patching systemd so it does not write to the cgroup
> filesystem directly but talk to the cgmanager/cgproxy socket instead?

More likely, patch it so it can talk to a systemd-owned unix socket
bind-mounted into the container.  So systemd would need to be patched
at both ends.  But that's something I was hoping would happen upstream
anyway.  I would be very happy to help out in that effort.

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How we use cgroups in rkt
       [not found]       ` <CALdWxcsxWCyLyH9H+BrAYW4gY-oAyNrp_Rf726tFtbHPy2_M9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-06-18 14:40         ` Serge Hallyn
@ 2015-06-18 14:40         ` Serge Hallyn
  1 sibling, 0 replies; 6+ messages in thread
From: Serge Hallyn @ 2015-06-18 14:40 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Iago López Galeiras, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Linux Containers

Quoting Alban Crequy (alban-973cpzSjLbNWk0Htik3J/w@public.gmane.org):
> On Wed, Jun 17, 2015 at 10:30 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> > Quoting Iago López Galeiras (iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org):
> >> Hi everyone,
> >>
> >> We are working on rkt[1] and we want to ask for feedback about the way we use
> >> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> >> so I guess the best way to start is explaining how this is handled in
> >> systemd-nspawn.
> >>
> >> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> >> the container except the part that corresponds to it inside the systemd
> >> controller. It is done this way because allowing the container to modify the
> >> other controllers is considered unsafe[2].
> >>
> >> This is how bind mounts look like:
> >>
> >> /sys/fs/cgroup/devices RO
> >> [...]
> >> /sys/fs/cgroup/memory RO
> >> /sys/fs/cgroup/systemd RO
> >> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> >>
> >> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> >> container, each running in its own chroot. To implement this concept, we start a
> >> systemd-nspawn container with a minimal systemd installation that starts each
> >> app as a service.
> >>
> >> We want to be able to apply different restrictions to each app of a pod using
> >> cgroups and the straightforward way we thought was delegating to systemd inside
> >> the container. Initially, this didn't work because, as mentioned earlier, the
> >> cgroup controllers are mounted read-only.
> >>
> >> The way we solved this problem was mounting the cgroup hierarchy (with the
> >> directories expected by systemd) outside the container. The difference with
> >> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> >> we leave the knobs we need in each of the application’s subcgroups read-write.
> >>
> >> For example, if we want to restrict the memory usage of an application we leave
> >> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}
> >
> > Who exactly does the writing to those files?
> 
> First, rkt prepares systemd a .service file for each application in
> the container with "CPUQuota=" and "MemoryLimit=". The .service files
> are not used by systemd outside the container. Then, rkt uses
> systemd-nspawn to start systemd as pid 1 in the container. Finally,
> systemd inside the container writes to the cgroup files
> {memory.limit_in_bytes,cgroup.procs}.
> 
> We call those limits the "per-app isolators". It's not a security
> boundary because all the apps run in the same container (in the same
> pid/mount/net namespaces). The apps run in different chroots, but
> that's easily escapable.
> 
> > Do the applications want to change them, or only rkt itself?
> 
> At the moment, the limits are statically defined in the app container
> image, so neither rkt or the apps inside the container change them. I
> don't know of a use case where we would need to change them
> dynamically.
> 
> >  If rkt, then it seems like you should be
> > able to use a systemd api to update the values (over dbus), right?
> > systemctl set-property machine-a-scope MemoryLimit=1G or something.
> 
> In addition to the "per-app isolators" described above, rkt can have
> "pod-level isolators" that are applied on the machine slice (the
> cgroup parent directory) rather than at the leaves of the cgroup tree.
> They are defined when rkt itself is started by a systemd .service
> file, and applied by systemd outside of the container. E.g.
> 
> [Service]
> CPUShares=512
> MemoryLimit=1G
> ExecStart=/usr/bin/rkt run myapp.com/myapp-1.3.4
> 
> Updating the pod-level isolators with systemctl on the host should work.
> 
> But systemd inside the container or the apps don't have access to the
> required cgroup knob files: they are mounted read-only.
> 
> > Now I'm pretty sure that systemd doesn't yet support being able to do
> > this from inside the container in a delegated way.
> 
> Indeed by default nspawn/systemd does not support delegating that. It
> only works because rkt prepared the cgroup bind mounts for the
> container.
> 
> > That was cgmanager's
> > reason for being, and I'm interested in working on a proper API for that
> > for systemd.
> 
> Do you mean patching systemd so it does not write to the cgroup
> filesystem directly but talk to the cgmanager/cgproxy socket instead?

More likely, patch it so it can talk to a systemd-owned unix socket
bind-mounted into the container.  So systemd would need to be patched
at both ends.  But that's something I was hoping would happen upstream
anyway.  I would be very happy to help out in that effort.

-serge

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-06-18 14:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-17 11:09 How we use cgroups in rkt Iago López Galeiras
     [not found] ` <55815556.4030304-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
2015-06-17 20:30   ` Serge Hallyn
2015-06-18  8:57     ` Alban Crequy
     [not found]       ` <CALdWxcsxWCyLyH9H+BrAYW4gY-oAyNrp_Rf726tFtbHPy2_M9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-18 14:40         ` Serge Hallyn
2015-06-18 14:40         ` Serge Hallyn
2015-06-17 20:30   ` Serge Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.