Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: "Stas Sergeev" <stsp2@yandex.ru>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	linux-kernel@vger.kernel.org,
	"Stefan Metzmacher" <metze@samba.org>,
	"Eric Biederman" <ebiederm@xmission.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Christian Brauner" <brauner@kernel.org>,
	"Jan Kara" <jack@suse.cz>, "Jeff Layton" <jlayton@kernel.org>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Alexander Aring" <alex.aring@gmail.com>,
	"David Laight" <David.Laight@aculab.com>,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Christian Göttsche" <cgzones@googlemail.com>
Subject: Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
Date: Mon, 6 May 2024 10:29:52 -0700	[thread overview]
Message-ID: <CALCETrU2VwCF-o7E5sc8FN_LBs3Q-vNMBf7N4rm0PAWFRo5QWw@mail.gmail.com> (raw)
In-Reply-To: <20240428.171236-tangy.giblet.idle.helpline-y9LqufL7EAAV@cyphar.com>

Replying to a couple emails at once...

On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote:
> > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > > It is needed to perform an open operation with the creds that were in
> > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > > flag. This allows the process to pre-open some dirs and switch eUID
> > > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > > the possibility to open/create files within the pre-opened directory set.
> > >
> >
> > I’ve been contemplating this, and I want to propose a different solution.
> >
> > First, the problem Stas is solving is quite narrow and doesn’t
> > actually need kernel support: if I want to write a user program that
> > sandboxes itself, I have at least three solutions already.  I can make
> > a userns and a mountns; I can use landlock; and I can have a separate
> > process that brokers filesystem access using SCM_RIGHTS.
> >
> > But what if I want to run a container, where the container can access
> > a specific host directory, and the contained application is not aware
> > of the exact technology being used?  I recently started using
> > containers in anger in a production setting, and “anger” was
> > definitely the right word: binding part of a filesystem in is
> > *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
> > Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> > actually gave up on making one of my use cases work on a Fedora
> > system.
> >
> > Here’s what I wanted to do, logically, in production: pick a host
> > directory, pick a host *principal* (UID, GID, label, etc), and have
> > the *entire container* access the directory as that principal. This is
> > what happens automatically if I run the whole container as a userns
> > with only a single UID mapped, but I don’t really want to do that for
> > a whole variety and of reasons.
> >
> > So maybe reimagining Stas’ feature a bit can actually solve this
> > problem.  Instead of a special dirfd, what if there was a special
> > subtree (in the sense of open_tree) that captures a set of creds and
> > does all opens inside the subtree using those creds?
> >
> > This isn’t a fully formed proposal, but I *think* it should be
> > generally fairly safe for even an unprivileged user to clone a subtree
> > with a specific flag set to do this. Maybe a capability would be
> > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> > this to a daemon if a privilege is needed, and getting the API right
> > might be a bit tricky.
>
> Tying this to an actual mount rather than a file handle sounds like a
> more plausible proposal than OA2_CRED_INHERIT, but it just seems that
> this is going to re-create all of the work that went into id-mapped
> mounts but with the extra-special step of making the generic VFS
> permissions no longer work normally (unless the idea is that everything
> would pretend to be owned by current_fsuid()?).

I was assuming that the owner uid and gid would be show to stat, etc
as usual.  But the permission checks would be done against the
captured creds.

>
> IMHO it also isn't enough to just make open work, you need to make all
> operations work (which leads to a non-trivial amount of
> filesystem-specific handling), which is just idmapped mounts. A lot of
> work was put into making sure that is safe, and collapsing owners seems
> like it will cause a lot of headaches.
>
> I also find it somewhat amusing that this proposal is to basically give
> up on multi-user permissions for this one directory tree because it's
> too annoying to deal with. In that case, isn't chmod 777 a simpler
> solution? (I'm being a bit flippant, of course there is a difference,
> but the net result is that all users in the container would have the
> same permissions with all of the fun issues that implies.)
>
> In short, AFAICS idmapped mounts pretty much solve this problem (minus
> the ability to collapse users, which I suspect is not a good idea in
> general)?
>

With my kernel hat on, maybe I agree.  But with my *user* hat on, I
think I pretty strongly disagree.  Look, idmapis lousy for
unprivileged use:

$ install -m 0700 -d test_directory
$ echo 'hi there' >test_directory/file
$ podman run -it --rm
--mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
# cat /tmp/file
hi there

<-- Hey, look, this kind of works!

# setpriv --reuid=1 ls /tmp
ls: cannot open directory '/tmp': Permission denied

<-- Gee, thanks, Linux!


Obviously this is a made up example.  But it's quite analogous to a
real example.  Suppose I want to make a directory that will contain
some MySQL data.  I don't want to share this directory with anyone
else, so I set its mode to 0700.  Then I want to fire up an
unprivileged MySQL container, so I build or download it, and then I
run it and bind my directory to /var/lib/mysql and I run it.  I don't
need to think about UIDs or anything because it's 2024 and containers
just work.  Okay, I need to setenforce 0 because I'm on Fedora and
SELinux makes absolutely no sense in a container world, but I can live
with that.

Except that it doesn't work!  Because unless I want to manually futz
with the idmaps to get mysql to have access to the directory inside
the container, only *root* gets to get in.  But I bet that even
futzing with the idmap doesn't work, because software like mysql often
expects that root *and* a user can access data.  And some software
even does privilege separation and uses more than one UID.

So I want a way to give *an entire container* access to a directory.
Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
could learn a way to squash multiple ids down to one.  Or maybe
something like my silly credential-capturing mount proposal could
work.  But the status quo is not actually amazing IMO.

I haven't looked at the idmap implementation nearly enough to have any
opinion as to whether squashing UID is practical or whether there's
any sensible way to specify it in the configuration.

> On Apr 29, 2024, at 2:12 AM, Christian Brauner <brauner@kernel.org> wrote:
>
> Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
> move_mount(). I rewrote the bind-mount logic in systemd based on that
> and util-linux uses that as well now.
> https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
>

Yep, I remember that.

>> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
>> actually gave up on making one of my use cases work on a Fedora
>> system.
>>
>> Here’s what I wanted to do, logically, in production: pick a host
>> directory, pick a host *principal* (UID, GID, label, etc), and have
>> the *entire container* access the directory as that principal. This is
>> what happens automatically if I run the whole container as a userns
>> with only a single UID mapped, but I don’t really want to do that for
>> a whole variety and of reasons.
>
> You're describing idmapped mounts for the most part which are upstream
> and are used in exactly that way by a lot of userspace.
>

See above...

>>
>> So maybe reimagining Stas’ feature a bit can actually solve this
>> problem.  Instead of a special dirfd, what if there was a special
>> subtree (in the sense of open_tree) that captures a set of creds and
>> does all opens inside the subtree using those creds?
>
> That would mean override creds in the VFS layer when accessing a
> specific subtree which is a terrible idea imho. Not just because it will
> quickly become a potential dos when you do that with a lot of subtrees
> it will also have complex interactions with overlayfs.

I was deliberately talking about semantics, not implementation. This
may well be impossible to implement straightforwardly.

  reply	other threads:[~2024-05-06 17:30 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
2024-04-26 13:33 ` [PATCH v5 1/3] fs: reorganize path_openat() Stas Sergeev
2024-04-26 13:33 ` [PATCH v5 2/3] open: add O_CRED_ALLOW flag Stas Sergeev
2024-04-27  2:12   ` kernel test robot
2024-04-26 13:33 ` [PATCH v5 3/3] openat2: add OA2_CRED_INHERIT flag Stas Sergeev
2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
2024-04-28 17:39   ` stsp
2024-04-28 19:15     ` stsp
2024-04-28 20:19     ` Andy Lutomirski
2024-04-28 21:14       ` stsp
2024-04-28 21:30         ` Andy Lutomirski
2024-04-28 22:12           ` stsp
2024-04-29  1:12             ` stsp
2024-04-29  9:12   ` Christian Brauner
2024-05-06  7:13   ` Aleksa Sarai
2024-05-06 17:29     ` Andy Lutomirski [this message]
2024-05-06 17:34       ` Andy Lutomirski
2024-05-06 19:34       ` David Laight
2024-05-06 21:53         ` Andy Lutomirski
2024-05-07  7:42       ` Christian Brauner
2024-05-07 20:38         ` Andy Lutomirski
2024-05-08  7:32           ` Christian Brauner
2024-05-08 17:30             ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrU2VwCF-o7E5sc8FN_LBs3Q-vNMBf7N4rm0PAWFRo5QWw@mail.gmail.com \
    --to=luto@amacapital.net \
    --cc=David.Laight@aculab.com \
    --cc=alex.aring@gmail.com \
    --cc=brauner@kernel.org \
    --cc=cgzones@googlemail.com \
    --cc=chuck.lever@oracle.com \
    --cc=cyphar@cyphar.com \
    --cc=ebiederm@xmission.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=metze@samba.org \
    --cc=pbonzini@redhat.com \
    --cc=serge@hallyn.com \
    --cc=stsp2@yandex.ru \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).