[PATCH v3 0/4] querying mount attributes

Linux-api Archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] querying mount attributes
@ 2023-09-28 13:01 Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 1/4] add unique mount ID Miklos Szeredi
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Miklos Szeredi @ 2023-09-28 13:01 UTC (permalink / raw
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, linux-man, linux-security-module,
	Karel Zak, Ian Kent, David Howells, Linus Torvalds, Al Viro,
	Christian Brauner, Amir Goldstein, Matthew House, Florian Weimer,
	Arnd Bergmann

Implement mount querying syscalls agreed on at LSF/MM 2023.

Features:

 - statx-like want/got mask

 - allows returning ascii strings (fs type, root, mount point)

 - returned buffer is relocatable (no pointers)


Still missing:

 - man pages

 - kselftest

 - syscalls on non-x86 archs


Please find the test utility at the end of this mail.

  Usage: statmnt [-l] (mnt_id|path)


Git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git#statmount-v3


Changes v1..v3:

 - rename statmnt(2) -> statmount(2)

 - rename listmnt(2) -> listmount(2)

 - make ABI 32bit compatible by passing 64bit args in a struct (tested on
   i386 and x32)

 - only accept new 64bit mount IDs

 - fix compile on !CONFIG_PROC_FS

 - call security_sb_statfs() in both syscalls

 - make lookup_mnt_in_ns() static

 - add LISTMOUNT_UNREACHABLE flag to listmnt() to explicitly ask for
   listing unreachable mounts

 - remove .sb_opts

 - remove subtype from .fs_type

 - return the number of bytes used (including strings) in .size

 - rename .mountpoint -> .mnt_point

 - point strings by an offset against char[] VLA at the end of the struct.
   E.g. printf("fs_type: %s\n", st->str + st->fs_type);

 - don't save string lengths

 - extend spare space in struct statmnt (complete size is now 512 bytes)


---
Miklos Szeredi (4):
  add unique mount ID
  namespace: extract show_path() helper
  add statmount(2) syscall
  add listmount(2) syscall

 arch/x86/entry/syscalls/syscall_32.tbl |   2 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 fs/internal.h                          |   2 +
 fs/mount.h                             |   3 +-
 fs/namespace.c                         | 365 +++++++++++++++++++++++++
 fs/proc_namespace.c                    |  10 +-
 fs/stat.c                              |   9 +-
 fs/statfs.c                            |   1 +
 include/linux/syscalls.h               |   8 +
 include/uapi/asm-generic/unistd.h      |   8 +-
 include/uapi/linux/mount.h             |  59 ++++
 include/uapi/linux/stat.h              |   1 +
 12 files changed, 459 insertions(+), 11 deletions(-)

-- 
2.41.0

=== statmnt.c ===
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <err.h>

/*
 * Structure for getting mount/superblock/filesystem info with statmount(2).
 *
 * The interface is similar to statx(2): individual fields or groups can be
 * selected with the @mask argument of statmount().  Kernel will set the @mask
 * field according to the supported fields.
 *
 * If string fields are selected, then the caller needs to pass a buffer that
 * has space after the fixed part of the structure.  Nul terminated strings are
 * copied there and offsets relative to @str are stored in the relevant fields.
 * If the buffer is too small, then EOVERFLOW is returned.  The actually used
 * size is returned in @size.
 */
struct statmnt {
	__u32 size;		/* Total size, including strings */
	__u32 __spare1;
	__u64 mask;		/* What results were written */
	__u32 sb_dev_major;	/* Device ID */
	__u32 sb_dev_minor;
	__u64 sb_magic;		/* ..._SUPER_MAGIC */
	__u32 sb_flags;		/* MS_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
	__u32 fs_type;		/* [str] Filesystem type */
	__u64 mnt_id;		/* Unique ID of mount */
	__u64 mnt_parent_id;	/* Unique ID of parent (for root == mnt_id) */
	__u32 mnt_id_old;	/* Reused IDs used in proc/.../mountinfo */
	__u32 mnt_parent_id_old;
	__u64 mnt_attr;		/* MOUNT_ATTR_... */
	__u64 mnt_propagation;	/* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
	__u64 mnt_peer_group;	/* ID of shared peer group */
	__u64 mnt_master;	/* Mount receives propagation from this ID */
	__u64 propagate_from;	/* Propagation from in current namespace */
	__u32 mnt_root;		/* [str] Root of mount relative to root of fs */
	__u32 mnt_point;	/* [str] Mountpoint relative to current root */
	__u64 __spare2[50];
	char str[];		/* Variable size part containing strings */
};

/*
 * To be used on the kernel ABI only for passing 64bit arguments to statmount(2)
 */
struct __mount_arg {
	__u64 mnt_id;
	__u64 request_mask;
};

/*
 * @mask bits for statmount(2)
 */
#define STMT_SB_BASIC		0x00000001U     /* Want/got sb_... */
#define STMT_MNT_BASIC		0x00000002U	/* Want/got mnt_... */
#define STMT_PROPAGATE_FROM	0x00000004U	/* Want/got propagate_from */
#define STMT_MNT_ROOT		0x00000008U	/* Want/got mnt_root  */
#define STMT_MNT_POINT		0x00000010U	/* Want/got mnt_point */
#define STMT_FS_TYPE		0x00000020U	/* Want/got fs_type */

/* listmount(2) flags */
#define LISTMOUNT_UNREACHABLE	0x01	/* List unreachable mounts too */

#define __NR_statmount   454
#define __NR_listmount   455

#define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */


static void free_if_neq(void *p, const void *q)
{
	if (p != q)
		free(p);
}

static struct statmnt *statmount(uint64_t mnt_id, uint64_t mask, unsigned int flags)
{
	struct __mount_arg arg = {
		.mnt_id = mnt_id,
		.request_mask = mask,
	};
	union {
		struct statmnt m;
		char s[4096];
	} buf;
	struct statmnt *ret, *mm = &buf.m;
	size_t bufsize = sizeof(buf);

	while (syscall(__NR_statmount, &arg, mm, bufsize, flags) == -1) {
		free_if_neq(mm, &buf.m);
		if (errno != EOVERFLOW)
			return NULL;
		bufsize = MAX(1 << 15, bufsize << 1);
		mm = malloc(bufsize);
		if (!mm)
			return NULL;
	}
	ret = malloc(mm->size);
	if (ret)
		memcpy(ret, mm, mm->size);
	free_if_neq(mm, &buf.m);

	return ret;
}

static int listmount(uint64_t mnt_id, uint64_t **listp, unsigned int flags)
{
	struct __mount_arg arg = {
		.mnt_id = mnt_id,
	};
	uint64_t buf[512];
	size_t bufsize = sizeof(buf);
	uint64_t *ret, *ll = buf;
	long len;

	while ((len = syscall(__NR_listmount, &arg, ll, bufsize / sizeof(buf[0]), flags)) == -1) {
		free_if_neq(ll, buf);
		if (errno != EOVERFLOW)
			return -1;
		bufsize = MAX(1 << 15, bufsize << 1);
		ll = malloc(bufsize);
		if (!ll)
			return -1;
	}
	bufsize = len * sizeof(buf[0]);
	ret = malloc(bufsize);
	if (!ret)
		return -1;

	*listp = ret;
	memcpy(ret, ll, bufsize);
	free_if_neq(ll, buf);

	return len;
}


int main(int argc, char *argv[])
{
	struct statmnt *st;
	char *end;
	const char *arg = argv[1];
	int res;
	int list = 0;
	uint64_t mask = STMT_SB_BASIC | STMT_MNT_BASIC | STMT_PROPAGATE_FROM | STMT_MNT_ROOT | STMT_MNT_POINT | STMT_FS_TYPE;
	uint64_t mnt_id;

	if (arg && strcmp(arg, "-l") == 0) {
		list = 1;
		arg = argv[2];
	}
	if (argc != list + 2)
		errx(1, "usage: %s [-l] (mnt_id|path)", argv[0]);

	mnt_id = strtoll(arg, &end, 0);
	if (!mnt_id || *end != '\0') {
		struct statx sx;

		res = statx(AT_FDCWD, arg, 0, STATX_MNT_ID_UNIQUE, &sx);
		if (res == -1)
			err(1, "%s", arg);

		if (!(sx.stx_mask & (STATX_MNT_ID | STATX_MNT_ID_UNIQUE)))
			errx(1, "Sorry, no mount ID");

		mnt_id = sx.stx_mnt_id;
	}

	if (list) {
		uint64_t *list;
		int num, i;

		res = listmount(mnt_id, &list, LISTMOUNT_UNREACHABLE);
		if (res == -1)
			err(1, "listmnt(%llu)", mnt_id);

		num = res;
		for (i = 0; i < num; i++) {
			printf("0x%llx", list[i]);

			st = statmount(list[i], STMT_MNT_POINT, 0);
			if (!st) {
				printf("\t[%s]\n", strerror(errno));
			} else {
				printf("\t%s\n", (st->mask & STMT_MNT_POINT) ? st->str + st->mnt_point : "???");
			}
			free(st);
		}
		free(list);

		return 0;
	}

	st = statmount(mnt_id, mask, 0);
	if (!st)
		err(1, "statmnt(%llu)", mnt_id);

	printf("size: %u\n", st->size);
	printf("mask: 0x%llx\n", st->mask);
	if (st->mask & STMT_SB_BASIC) {
		printf("sb_dev_major: %u\n", st->sb_dev_major);
		printf("sb_dev_minor: %u\n", st->sb_dev_minor);
		printf("sb_magic: 0x%llx\n", st->sb_magic);
		printf("sb_flags: 0x%08x\n", st->sb_flags);
	}
	if (st->mask & STMT_MNT_BASIC) {
		printf("mnt_id: 0x%llx\n", st->mnt_id);
		printf("mnt_parent_id: 0x%llx\n", st->mnt_parent_id);
		printf("mnt_id_old: %u\n", st->mnt_id_old);
		printf("mnt_parent_id_old: %u\n", st->mnt_parent_id_old);
		printf("mnt_attr: 0x%08llx\n", st->mnt_attr);
		printf("mnt_propagation: %s%s%s%s\n",
		       st->mnt_propagation & MS_SHARED ? "shared," : "",
		       st->mnt_propagation & MS_SLAVE ? "slave," : "",
		       st->mnt_propagation & MS_UNBINDABLE ? "unbindable," : "",
		       st->mnt_propagation & MS_PRIVATE ? "private" : "");
		printf("mnt_peer_group: %llu\n", st->mnt_peer_group);
		printf("mnt_master: %llu\n", st->mnt_master);
	}
	if (st->mask & STMT_PROPAGATE_FROM)
		printf("propagate_from: %llu\n", st->propagate_from);
	if (st->mask & STMT_MNT_ROOT)
		printf("mnt_root: %u <%s>\n", st->mnt_root, st->str + st->mnt_root);
	if (st->mask & STMT_MNT_POINT)
		printf("mnt_point: %u <%s>\n", st->mnt_point, st->str + st->mnt_point);
	if (st->mask & STMT_FS_TYPE)
		printf("fs_type: %u <%s>\n", st->fs_type, st->str + st->fs_type);
	free(st);

	return 0;
}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/4] add unique mount ID
  2023-09-28 13:01 [PATCH v3 0/4] querying mount attributes Miklos Szeredi
@ 2023-09-28 13:01 ` Miklos Szeredi
  2023-10-05 15:52   ` Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 2/4] namespace: extract show_path() helper Miklos Szeredi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2023-09-28 13:01 UTC (permalink / raw
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, linux-man, linux-security-module,
	Karel Zak, Ian Kent, David Howells, Linus Torvalds, Al Viro,
	Christian Brauner, Amir Goldstein, Matthew House, Florian Weimer,
	Arnd Bergmann

If a mount is released then its mnt_id can immediately be reused.  This is
bad news for user interfaces that want to uniquely identify a mount.

Implementing a unique mount ID is trivial (use a 64bit counter).
Unfortunately userspace assumes 32bit size and would overflow after the
counter reaches 2^32.

Introduce a new 64bit ID alongside the old one.  Initialize the counter to
2^32, this guarantees that the old and new IDs are never mixed up.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/mount.h                | 3 ++-
 fs/namespace.c            | 4 ++++
 fs/stat.c                 | 9 +++++++--
 include/uapi/linux/stat.h | 1 +
 4 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 130c07c2f8d2..a14f762b3f29 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,7 +72,8 @@ struct mount {
 	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
 #endif
-	int mnt_id;			/* mount identifier */
+	int mnt_id;			/* mount identifier, reused */
+	u64 mnt_id_unique;		/* mount ID unique until reboot */
 	int mnt_group_id;		/* peer group identifier */
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
diff --git a/fs/namespace.c b/fs/namespace.c
index e157efc54023..e02bc5f41c7b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -68,6 +68,9 @@ static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
 
+/* Don't allow confusion with old 32bit mount ID */
+static atomic64_t mnt_id_ctr = ATOMIC64_INIT(1ULL << 32);
+
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
@@ -131,6 +134,7 @@ static int mnt_alloc_id(struct mount *mnt)
 	if (res < 0)
 		return res;
 	mnt->mnt_id = res;
+	mnt->mnt_id_unique = atomic64_inc_return(&mnt_id_ctr);
 	return 0;
 }
 
diff --git a/fs/stat.c b/fs/stat.c
index 6e60389d6a15..e61e0172e191 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -280,8 +280,13 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
 
 	error = vfs_getattr(&path, stat, request_mask, flags);
 
-	stat->mnt_id = real_mount(path.mnt)->mnt_id;
-	stat->result_mask |= STATX_MNT_ID;
+	if (request_mask & STATX_MNT_ID_UNIQUE) {
+		stat->mnt_id = real_mount(path.mnt)->mnt_id_unique;
+		stat->result_mask |= STATX_MNT_ID_UNIQUE;
+	} else {
+		stat->mnt_id = real_mount(path.mnt)->mnt_id;
+		stat->result_mask |= STATX_MNT_ID;
+	}
 
 	if (path.mnt->mnt_root == path.dentry)
 		stat->attributes |= STATX_ATTR_MOUNT_ROOT;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..2f2ee82d5517 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -154,6 +154,7 @@ struct statx {
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
+#define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] add unique mount ID
  2023-09-28 13:01 ` [PATCH v3 1/4] add unique mount ID Miklos Szeredi
@ 2023-10-05 15:52   ` Miklos Szeredi
  2023-10-06 11:44     ` Amir Goldstein
  0 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2023-10-05 15:52 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, Ian Kent, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann, Paul Moore

On Thu, 28 Sept 2023 at 15:03, Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> If a mount is released then its mnt_id can immediately be reused.  This is
> bad news for user interfaces that want to uniquely identify a mount.
>
> Implementing a unique mount ID is trivial (use a 64bit counter).
> Unfortunately userspace assumes 32bit size and would overflow after the
> counter reaches 2^32.
>
> Introduce a new 64bit ID alongside the old one.  Initialize the counter to
> 2^32, this guarantees that the old and new IDs are never mixed up.

It occurred to me that it might make sense to make this counter
per-namespace.  That would allow more separation between namespaces,
like preventing the observation of mount creations in other
namespaces.

Does a global number make any sense?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] add unique mount ID
  2023-10-05 15:52   ` Miklos Szeredi
@ 2023-10-06 11:44     ` Amir Goldstein
  2023-10-06 12:48       ` Miklos Szeredi
  0 siblings, 1 reply; 22+ messages in thread
From: Amir Goldstein @ 2023-10-06 11:44 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, Ian Kent, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Matthew House,
	Florian Weimer, Arnd Bergmann, Paul Moore

On Thu, Oct 5, 2023 at 6:52 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 28 Sept 2023 at 15:03, Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > If a mount is released then its mnt_id can immediately be reused.  This is
> > bad news for user interfaces that want to uniquely identify a mount.
> >
> > Implementing a unique mount ID is trivial (use a 64bit counter).
> > Unfortunately userspace assumes 32bit size and would overflow after the
> > counter reaches 2^32.
> >
> > Introduce a new 64bit ID alongside the old one.  Initialize the counter to
> > 2^32, this guarantees that the old and new IDs are never mixed up.
>
> It occurred to me that it might make sense to make this counter
> per-namespace.  That would allow more separation between namespaces,
> like preventing the observation of mount creations in other
> namespaces.
>

Preventing the observation of mount creations in other mount namespaces
is independent of whether a global mntid namespace is used.

> Does a global number make any sense?
>

I think global mntid namepsace makes notifications API a lot easier.
A process (e.g. systemd) may set marks to watch new mounts on
different mount namespaces.

If mntid could collide in different mount namepsaces, we will either
need to describe the mount namespace in the event or worse,
map child mount namespace mntid to parent mount namespace
like with uids.

If we use a global mntid namespace, multi mount namespace
watching becomes much much easier.

Regarding the possible scopes for watching added/removed mounts
we could support:
- watch parent mount for children mounts (akin to inotify directory watch)
- watch all mounts of a filesystem
- watch all mounts in mount namespace
- watch on entire system

Not sure which of the above we will end up implementing, but the
first two can use existing fanotify mount/sb marks.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] add unique mount ID
  2023-10-06 11:44     ` Amir Goldstein
@ 2023-10-06 12:48       ` Miklos Szeredi
  0 siblings, 0 replies; 22+ messages in thread
From: Miklos Szeredi @ 2023-10-06 12:48 UTC (permalink / raw
  To: Amir Goldstein
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, Ian Kent, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Matthew House,
	Florian Weimer, Arnd Bergmann, Paul Moore

On Fri, 6 Oct 2023 at 13:44, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Oct 5, 2023 at 6:52 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Thu, 28 Sept 2023 at 15:03, Miklos Szeredi <mszeredi@redhat.com> wrote:
> > >
> > > If a mount is released then its mnt_id can immediately be reused.  This is
> > > bad news for user interfaces that want to uniquely identify a mount.
> > >
> > > Implementing a unique mount ID is trivial (use a 64bit counter).
> > > Unfortunately userspace assumes 32bit size and would overflow after the
> > > counter reaches 2^32.
> > >
> > > Introduce a new 64bit ID alongside the old one.  Initialize the counter to
> > > 2^32, this guarantees that the old and new IDs are never mixed up.
> >
> > It occurred to me that it might make sense to make this counter
> > per-namespace.  That would allow more separation between namespaces,
> > like preventing the observation of mount creations in other
> > namespaces.
> >
>
> Preventing the observation of mount creations in other mount namespaces
> is independent of whether a global mntid namespace is used.

A local ID space makes observation of mount creations in other
namespaces impossible.  While a global ID space does not.  ID
allocation could be designed in a way that makes observation
impossible, but that basically boils down to allocating separate
ranges to each namespace, which is equivalent to identifying mounts by
an {NSID, MNTID} pair.

>
> > Does a global number make any sense?
> >
>
> I think global mntid namepsace makes notifications API a lot easier.
> A process (e.g. systemd) may set marks to watch new mounts on
> different mount namespaces.
>
> If mntid could collide in different mount namepsaces, we will either
> need to describe the mount namespace in the event or worse,
> map child mount namespace mntid to parent mount namespace
> like with uids.

Mntids are quite different from uids in that a certain ID is only
valid in a certain namespace.   There's no inheritance or sharing of
mounts.  Also mounts cannot be moved from one namespace to another
(with the exception of mounts created in an anonymous namespace).

> If we use a global mntid namespace, multi mount namespace
> watching becomes much much easier.

If the watcher wants to interpret the received event it will have to
know the namespace anyway.  Otherwise it's just a meaningless number.

> Regarding the possible scopes for watching added/removed mounts
> we could support:
> - watch parent mount for children mounts (akin to inotify directory watch)

This probably makes sense.

> - watch all mounts of a filesystem

I don't see a use case.

> - watch all mounts in mount namespace

Yes.

> - watch on entire system

Again, I don't see how this could make sense.  Each time
unshare(CLONE_NEWNS) is invoked a storm of mount creation events will
happen.

I'd imagine that watching mounts is primarily to complement the
directory tree notifications, so that all changes to the path lookup
namespace of a process could be monitored.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 2/4] namespace: extract show_path() helper
  2023-09-28 13:01 [PATCH v3 0/4] querying mount attributes Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 1/4] add unique mount ID Miklos Szeredi
@ 2023-09-28 13:01 ` Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 3/4] add statmount(2) syscall Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 4/4] add listmount(2) syscall Miklos Szeredi
  3 siblings, 0 replies; 22+ messages in thread
From: Miklos Szeredi @ 2023-09-28 13:01 UTC (permalink / raw
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, linux-man, linux-security-module,
	Karel Zak, Ian Kent, David Howells, Linus Torvalds, Al Viro,
	Christian Brauner, Amir Goldstein, Matthew House, Florian Weimer,
	Arnd Bergmann

To be used by the statmount(2) syscall as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/internal.h       |  2 ++
 fs/namespace.c      |  9 +++++++++
 fs/proc_namespace.c | 10 +++-------
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index d64ae03998cc..0c4f4cf2ff5a 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -83,6 +83,8 @@ int path_mount(const char *dev_name, struct path *path,
 		const char *type_page, unsigned long flags, void *data_page);
 int path_umount(struct path *path, int flags);
 
+int show_path(struct seq_file *m, struct dentry *root);
+
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index e02bc5f41c7b..c3a41200fe70 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4678,6 +4678,15 @@ SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path,
 	return err;
 }
 
+int show_path(struct seq_file *m, struct dentry *root)
+{
+	if (root->d_sb->s_op->show_path)
+		return root->d_sb->s_op->show_path(m, root);
+
+	seq_dentry(m, root, " \t\n\\");
+	return 0;
+}
+
 static void __init init_mount_tree(void)
 {
 	struct vfsmount *mnt;
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 250eb5bf7b52..5638ad419f52 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -142,13 +142,9 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 
 	seq_printf(m, "%i %i %u:%u ", r->mnt_id, r->mnt_parent->mnt_id,
 		   MAJOR(sb->s_dev), MINOR(sb->s_dev));
-	if (sb->s_op->show_path) {
-		err = sb->s_op->show_path(m, mnt->mnt_root);
-		if (err)
-			goto out;
-	} else {
-		seq_dentry(m, mnt->mnt_root, " \t\n\\");
-	}
+	err = show_path(m, mnt->mnt_root);
+	if (err)
+		goto out;
 	seq_putc(m, ' ');
 
 	/* mountpoints outside of chroot jail will give SEQ_SKIP on this */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/4] add statmount(2) syscall
  2023-09-28 13:01 [PATCH v3 0/4] querying mount attributes Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 1/4] add unique mount ID Miklos Szeredi
  2023-09-28 13:01 ` [PATCH v3 2/4] namespace: extract show_path() helper Miklos Szeredi
@ 2023-09-28 13:01 ` Miklos Szeredi
  2023-09-29  0:42   ` Ian Kent
  2023-10-04 19:26   ` Paul Moore
  2023-09-28 13:01 ` [PATCH v3 4/4] add listmount(2) syscall Miklos Szeredi
  3 siblings, 2 replies; 22+ messages in thread
From: Miklos Szeredi @ 2023-09-28 13:01 UTC (permalink / raw
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, linux-man, linux-security-module,
	Karel Zak, Ian Kent, David Howells, Linus Torvalds, Al Viro,
	Christian Brauner, Amir Goldstein, Matthew House, Florian Weimer,
	Arnd Bergmann

Add a way to query attributes of a single mount instead of having to parse
the complete /proc/$PID/mountinfo, which might be huge.

Lookup the mount the new 64bit mount ID.  If a mount needs to be queried
based on path, then statx(2) can be used to first query the mount ID
belonging to the path.

Design is based on a suggestion by Linus:

  "So I'd suggest something that is very much like "statfsat()", which gets
   a buffer and a length, and returns an extended "struct statfs" *AND*
   just a string description at the end."

The interface closely mimics that of statx.

Handle ASCII attributes by appending after the end of the structure (as per
above suggestion).  Pointers to strings are stored in u64 members to make
the structure the same regardless of pointer size.  Strings are nul
terminated.

Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/namespace.c                         | 283 +++++++++++++++++++++++++
 fs/statfs.c                            |   1 +
 include/linux/syscalls.h               |   5 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/mount.h             |  56 +++++
 7 files changed, 351 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..317b1320ad18 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,4 @@
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	i386	cachestat		sys_cachestat
 452	i386	fchmodat2		sys_fchmodat2
+454	i386	statmount		sys_statmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1d6eee30eceb..7312c440978f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -375,6 +375,7 @@
 451	common	cachestat		sys_cachestat
 452	common	fchmodat2		sys_fchmodat2
 453	64	map_shadow_stack	sys_map_shadow_stack
+454	common	statmount		sys_statmount
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/namespace.c b/fs/namespace.c
index c3a41200fe70..3326ba2b2810 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4687,6 +4687,289 @@ int show_path(struct seq_file *m, struct dentry *root)
 	return 0;
 }
 
+static struct vfsmount *lookup_mnt_in_ns(u64 id, struct mnt_namespace *ns)
+{
+	struct mount *mnt;
+	struct vfsmount *res = NULL;
+
+	lock_ns_list(ns);
+	list_for_each_entry(mnt, &ns->list, mnt_list) {
+		if (!mnt_is_cursor(mnt) && id == mnt->mnt_id_unique) {
+			res = &mnt->mnt;
+			break;
+		}
+	}
+	unlock_ns_list(ns);
+	return res;
+}
+
+struct stmt_state {
+	struct statmnt __user *const buf;
+	size_t const bufsize;
+	struct vfsmount *const mnt;
+	u64 const mask;
+	struct seq_file seq;
+	struct path root;
+	struct statmnt sm;
+	size_t pos;
+	int err;
+};
+
+typedef int (*stmt_func_t)(struct stmt_state *);
+
+static int stmt_string_seq(struct stmt_state *s, stmt_func_t func)
+{
+	size_t rem = s->bufsize - s->pos - sizeof(s->sm);
+	struct seq_file *seq = &s->seq;
+	int ret;
+
+	seq->count = 0;
+	seq->size = min(seq->size, rem);
+	seq->buf = kvmalloc(seq->size, GFP_KERNEL_ACCOUNT);
+	if (!seq->buf)
+		return -ENOMEM;
+
+	ret = func(s);
+	if (ret)
+		return ret;
+
+	if (seq_has_overflowed(seq)) {
+		if (seq->size == rem)
+			return -EOVERFLOW;
+		seq->size *= 2;
+		if (seq->size > MAX_RW_COUNT)
+			return -ENOMEM;
+		kvfree(seq->buf);
+		return 0;
+	}
+
+	/* Done */
+	return 1;
+}
+
+static void stmt_string(struct stmt_state *s, u64 mask, stmt_func_t func,
+		       u32 *str)
+{
+	int ret = s->pos + sizeof(s->sm) >= s->bufsize ? -EOVERFLOW : 0;
+	struct statmnt *sm = &s->sm;
+	struct seq_file *seq = &s->seq;
+
+	if (s->err || !(s->mask & mask))
+		return;
+
+	seq->size = PAGE_SIZE;
+	while (!ret)
+		ret = stmt_string_seq(s, func);
+
+	if (ret < 0) {
+		s->err = ret;
+	} else {
+		seq->buf[seq->count++] = '\0';
+		if (copy_to_user(s->buf->str + s->pos, seq->buf, seq->count)) {
+			s->err = -EFAULT;
+		} else {
+			*str = s->pos;
+			s->pos += seq->count;
+		}
+	}
+	kvfree(seq->buf);
+	sm->mask |= mask;
+}
+
+static void stmt_numeric(struct stmt_state *s, u64 mask, stmt_func_t func)
+{
+	if (s->err || !(s->mask & mask))
+		return;
+
+	s->err = func(s);
+	s->sm.mask |= mask;
+}
+
+static u64 mnt_to_attr_flags(struct vfsmount *mnt)
+{
+	unsigned int mnt_flags = READ_ONCE(mnt->mnt_flags);
+	u64 attr_flags = 0;
+
+	if (mnt_flags & MNT_READONLY)
+		attr_flags |= MOUNT_ATTR_RDONLY;
+	if (mnt_flags & MNT_NOSUID)
+		attr_flags |= MOUNT_ATTR_NOSUID;
+	if (mnt_flags & MNT_NODEV)
+		attr_flags |= MOUNT_ATTR_NODEV;
+	if (mnt_flags & MNT_NOEXEC)
+		attr_flags |= MOUNT_ATTR_NOEXEC;
+	if (mnt_flags & MNT_NODIRATIME)
+		attr_flags |= MOUNT_ATTR_NODIRATIME;
+	if (mnt_flags & MNT_NOSYMFOLLOW)
+		attr_flags |= MOUNT_ATTR_NOSYMFOLLOW;
+
+	if (mnt_flags & MNT_NOATIME)
+		attr_flags |= MOUNT_ATTR_NOATIME;
+	else if (mnt_flags & MNT_RELATIME)
+		attr_flags |= MOUNT_ATTR_RELATIME;
+	else
+		attr_flags |= MOUNT_ATTR_STRICTATIME;
+
+	if (is_idmapped_mnt(mnt))
+		attr_flags |= MOUNT_ATTR_IDMAP;
+
+	return attr_flags;
+}
+
+static u64 mnt_to_propagation_flags(struct mount *m)
+{
+	u64 propagation = 0;
+
+	if (IS_MNT_SHARED(m))
+		propagation |= MS_SHARED;
+	if (IS_MNT_SLAVE(m))
+		propagation |= MS_SLAVE;
+	if (IS_MNT_UNBINDABLE(m))
+		propagation |= MS_UNBINDABLE;
+	if (!propagation)
+		propagation |= MS_PRIVATE;
+
+	return propagation;
+}
+
+static int stmt_sb_basic(struct stmt_state *s)
+{
+	struct super_block *sb = s->mnt->mnt_sb;
+
+	s->sm.sb_dev_major = MAJOR(sb->s_dev);
+	s->sm.sb_dev_minor = MINOR(sb->s_dev);
+	s->sm.sb_magic = sb->s_magic;
+	s->sm.sb_flags = sb->s_flags & (SB_RDONLY|SB_SYNCHRONOUS|SB_DIRSYNC|SB_LAZYTIME);
+
+	return 0;
+}
+
+static int stmt_mnt_basic(struct stmt_state *s)
+{
+	struct mount *m = real_mount(s->mnt);
+
+	s->sm.mnt_id = m->mnt_id_unique;
+	s->sm.mnt_parent_id = m->mnt_parent->mnt_id_unique;
+	s->sm.mnt_id_old = m->mnt_id;
+	s->sm.mnt_parent_id_old = m->mnt_parent->mnt_id;
+	s->sm.mnt_attr = mnt_to_attr_flags(&m->mnt);
+	s->sm.mnt_propagation = mnt_to_propagation_flags(m);
+	s->sm.mnt_peer_group = IS_MNT_SHARED(m) ? m->mnt_group_id : 0;
+	s->sm.mnt_master = IS_MNT_SLAVE(m) ? m->mnt_master->mnt_group_id : 0;
+
+	return 0;
+}
+
+static int stmt_propagate_from(struct stmt_state *s)
+{
+	struct mount *m = real_mount(s->mnt);
+
+	if (!IS_MNT_SLAVE(m))
+		return 0;
+
+	s->sm.propagate_from = get_dominating_id(m, &current->fs->root);
+
+	return 0;
+}
+
+static int stmt_mnt_root(struct stmt_state *s)
+{
+	struct seq_file *seq = &s->seq;
+	int err = show_path(seq, s->mnt->mnt_root);
+
+	if (!err && !seq_has_overflowed(seq)) {
+		seq->buf[seq->count] = '\0';
+		seq->count = string_unescape_inplace(seq->buf, UNESCAPE_OCTAL);
+	}
+	return err;
+}
+
+static int stmt_mnt_point(struct stmt_state *s)
+{
+	struct vfsmount *mnt = s->mnt;
+	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
+	int err = seq_path_root(&s->seq, &mnt_path, &s->root, "");
+
+	return err == SEQ_SKIP ? 0 : err;
+}
+
+static int stmt_fs_type(struct stmt_state *s)
+{
+	struct seq_file *seq = &s->seq;
+	struct super_block *sb = s->mnt->mnt_sb;
+
+	seq_puts(seq, sb->s_type->name);
+	return 0;
+}
+
+static int do_statmount(struct stmt_state *s)
+{
+	struct statmnt *sm = &s->sm;
+	struct mount *m = real_mount(s->mnt);
+	size_t copysize = min_t(size_t, s->bufsize, sizeof(*sm));
+	int err;
+
+	err = security_sb_statfs(s->mnt->mnt_root);
+	if (err)
+		return err;
+
+	if (!capable(CAP_SYS_ADMIN) &&
+	    !is_path_reachable(m, m->mnt.mnt_root, &s->root))
+		return -EPERM;
+
+	stmt_numeric(s, STMT_SB_BASIC, stmt_sb_basic);
+	stmt_numeric(s, STMT_MNT_BASIC, stmt_mnt_basic);
+	stmt_numeric(s, STMT_PROPAGATE_FROM, stmt_propagate_from);
+	stmt_string(s, STMT_FS_TYPE, stmt_fs_type, &sm->fs_type);
+	stmt_string(s, STMT_MNT_ROOT, stmt_mnt_root, &sm->mnt_root);
+	stmt_string(s, STMT_MNT_POINT, stmt_mnt_point, &sm->mnt_point);
+
+	if (s->err)
+		return s->err;
+
+	/* Return the number of bytes copied to the buffer */
+	sm->size = copysize + s->pos;
+
+	if (copy_to_user(s->buf, sm, copysize))
+		return -EFAULT;
+
+	return 0;
+}
+
+SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
+		struct statmnt __user *, buf, size_t, bufsize,
+		unsigned int, flags)
+{
+	struct vfsmount *mnt;
+	struct __mount_arg kreq;
+	int ret;
+
+	if (flags)
+		return -EINVAL;
+
+	if (copy_from_user(&kreq, req, sizeof(kreq)))
+		return -EFAULT;
+
+	down_read(&namespace_sem);
+	mnt = lookup_mnt_in_ns(kreq.mnt_id, current->nsproxy->mnt_ns);
+	ret = -ENOENT;
+	if (mnt) {
+		struct stmt_state s = {
+			.mask = kreq.request_mask,
+			.buf = buf,
+			.bufsize = bufsize,
+			.mnt = mnt,
+		};
+
+		get_fs_root(current->fs, &s.root);
+		ret = do_statmount(&s);
+		path_put(&s.root);
+	}
+	up_read(&namespace_sem);
+
+	return ret;
+}
+
 static void __init init_mount_tree(void)
 {
 	struct vfsmount *mnt;
diff --git a/fs/statfs.c b/fs/statfs.c
index 96d1c3edf289..cc774c2e2c9a 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -9,6 +9,7 @@
 #include <linux/security.h>
 #include <linux/uaccess.h>
 #include <linux/compat.h>
+#include <uapi/linux/mount.h>
 #include "internal.h"
 
 static int flags_by_mnt(int mnt_flags)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..ba371024d902 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -74,6 +74,8 @@ struct landlock_ruleset_attr;
 enum landlock_rule_type;
 struct cachestat_range;
 struct cachestat;
+struct statmnt;
+struct __mount_arg;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -408,6 +410,9 @@ asmlinkage long sys_statfs64(const char __user *path, size_t sz,
 asmlinkage long sys_fstatfs(unsigned int fd, struct statfs __user *buf);
 asmlinkage long sys_fstatfs64(unsigned int fd, size_t sz,
 				struct statfs64 __user *buf);
+asmlinkage long sys_statmount(const struct __mount_arg __user *req,
+			      struct statmnt __user *buf, size_t bufsize,
+			      unsigned int flags);
 asmlinkage long sys_truncate(const char __user *path, long length);
 asmlinkage long sys_ftruncate(unsigned int fd, unsigned long length);
 #if BITS_PER_LONG == 32
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index abe087c53b4b..8f034e934a2e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -823,8 +823,11 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
 #define __NR_fchmodat2 452
 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
 
+#define __NR_statmount   454
+__SYSCALL(__NR_statmount, sys_statmount)
+
 #undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 455
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index bb242fdcfe6b..d2c988ab526b 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -138,4 +138,60 @@ struct mount_attr {
 /* List of all mount_attr versions. */
 #define MOUNT_ATTR_SIZE_VER0	32 /* sizeof first published struct */
 
+
+/*
+ * Structure for getting mount/superblock/filesystem info with statmount(2).
+ *
+ * The interface is similar to statx(2): individual fields or groups can be
+ * selected with the @mask argument of statmount().  Kernel will set the @mask
+ * field according to the supported fields.
+ *
+ * If string fields are selected, then the caller needs to pass a buffer that
+ * has space after the fixed part of the structure.  Nul terminated strings are
+ * copied there and offsets relative to @str are stored in the relevant fields.
+ * If the buffer is too small, then EOVERFLOW is returned.  The actually used
+ * size is returned in @size.
+ */
+struct statmnt {
+	__u32 size;		/* Total size, including strings */
+	__u32 __spare1;
+	__u64 mask;		/* What results were written */
+	__u32 sb_dev_major;	/* Device ID */
+	__u32 sb_dev_minor;
+	__u64 sb_magic;		/* ..._SUPER_MAGIC */
+	__u32 sb_flags;		/* MS_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
+	__u32 fs_type;		/* [str] Filesystem type */
+	__u64 mnt_id;		/* Unique ID of mount */
+	__u64 mnt_parent_id;	/* Unique ID of parent (for root == mnt_id) */
+	__u32 mnt_id_old;	/* Reused IDs used in proc/.../mountinfo */
+	__u32 mnt_parent_id_old;
+	__u64 mnt_attr;		/* MOUNT_ATTR_... */
+	__u64 mnt_propagation;	/* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
+	__u64 mnt_peer_group;	/* ID of shared peer group */
+	__u64 mnt_master;	/* Mount receives propagation from this ID */
+	__u64 propagate_from;	/* Propagation from in current namespace */
+	__u32 mnt_root;		/* [str] Root of mount relative to root of fs */
+	__u32 mnt_point;	/* [str] Mountpoint relative to current root */
+	__u64 __spare2[50];
+	char str[];		/* Variable size part containing strings */
+};
+
+/*
+ * To be used on the kernel ABI only for passing 64bit arguments to statmount(2)
+ */
+struct __mount_arg {
+	__u64 mnt_id;
+	__u64 request_mask;
+};
+
+/*
+ * @mask bits for statmount(2)
+ */
+#define STMT_SB_BASIC		0x00000001U     /* Want/got sb_... */
+#define STMT_MNT_BASIC		0x00000002U	/* Want/got mnt_... */
+#define STMT_PROPAGATE_FROM	0x00000004U	/* Want/got propagate_from */
+#define STMT_MNT_ROOT		0x00000008U	/* Want/got mnt_root  */
+#define STMT_MNT_POINT		0x00000010U	/* Want/got mnt_point */
+#define STMT_FS_TYPE		0x00000020U	/* Want/got fs_type */
+
 #endif /* _UAPI_LINUX_MOUNT_H */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] add statmount(2) syscall
  2023-09-28 13:01 ` [PATCH v3 3/4] add statmount(2) syscall Miklos Szeredi
@ 2023-09-29  0:42   ` Ian Kent
  2023-09-29  9:10     ` Miklos Szeredi
  2023-10-04 19:26   ` Paul Moore
  1 sibling, 1 reply; 22+ messages in thread
From: Ian Kent @ 2023-09-29  0:42 UTC (permalink / raw
  To: Miklos Szeredi, linux-fsdevel
  Cc: linux-kernel, linux-api, linux-man, linux-security-module,
	Karel Zak, David Howells, Linus Torvalds, Al Viro,
	Christian Brauner, Amir Goldstein, Matthew House, Florian Weimer,
	Arnd Bergmann

On 28/9/23 21:01, Miklos Szeredi wrote:
> Add a way to query attributes of a single mount instead of having to parse
> the complete /proc/$PID/mountinfo, which might be huge.
>
> Lookup the mount the new 64bit mount ID.  If a mount needs to be queried
> based on path, then statx(2) can be used to first query the mount ID
> belonging to the path.
>
> Design is based on a suggestion by Linus:
>
>    "So I'd suggest something that is very much like "statfsat()", which gets
>     a buffer and a length, and returns an extended "struct statfs" *AND*
>     just a string description at the end."
>
> The interface closely mimics that of statx.
>
> Handle ASCII attributes by appending after the end of the structure (as per
> above suggestion).  Pointers to strings are stored in u64 members to make
> the structure the same regardless of pointer size.  Strings are nul
> terminated.
>
> Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>   arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>   arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>   fs/namespace.c                         | 283 +++++++++++++++++++++++++
>   fs/statfs.c                            |   1 +
>   include/linux/syscalls.h               |   5 +
>   include/uapi/asm-generic/unistd.h      |   5 +-
>   include/uapi/linux/mount.h             |  56 +++++
>   7 files changed, 351 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 2d0b1bd866ea..317b1320ad18 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -457,3 +457,4 @@
>   450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
>   451	i386	cachestat		sys_cachestat
>   452	i386	fchmodat2		sys_fchmodat2
> +454	i386	statmount		sys_statmount
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 1d6eee30eceb..7312c440978f 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -375,6 +375,7 @@
>   451	common	cachestat		sys_cachestat
>   452	common	fchmodat2		sys_fchmodat2
>   453	64	map_shadow_stack	sys_map_shadow_stack
> +454	common	statmount		sys_statmount
>   
>   #
>   # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c3a41200fe70..3326ba2b2810 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -4687,6 +4687,289 @@ int show_path(struct seq_file *m, struct dentry *root)
>   	return 0;
>   }
>   
> +static struct vfsmount *lookup_mnt_in_ns(u64 id, struct mnt_namespace *ns)
> +{
> +	struct mount *mnt;
> +	struct vfsmount *res = NULL;
> +
> +	lock_ns_list(ns);
> +	list_for_each_entry(mnt, &ns->list, mnt_list) {
> +		if (!mnt_is_cursor(mnt) && id == mnt->mnt_id_unique) {
> +			res = &mnt->mnt;
> +			break;
> +		}
> +	}
> +	unlock_ns_list(ns);
> +	return res;
> +}

Seems like we might need to consider making (struct mnt_namespace)->list

a hashed list.


The number of mounts could be large, for example people using autofs direct

mount setups.


It's not common for people to have of the order of 8k map entries (for

which there is a trigger mount per entry plus any mounts that have been

automounted) but it does happen. A small setup would be of the order of

1k map entries plus automounted mounts so the benefit is likely still

there to some extent.


Ian

> +
> +struct stmt_state {
> +	struct statmnt __user *const buf;
> +	size_t const bufsize;
> +	struct vfsmount *const mnt;
> +	u64 const mask;
> +	struct seq_file seq;
> +	struct path root;
> +	struct statmnt sm;
> +	size_t pos;
> +	int err;
> +};
> +
> +typedef int (*stmt_func_t)(struct stmt_state *);
> +
> +static int stmt_string_seq(struct stmt_state *s, stmt_func_t func)
> +{
> +	size_t rem = s->bufsize - s->pos - sizeof(s->sm);
> +	struct seq_file *seq = &s->seq;
> +	int ret;
> +
> +	seq->count = 0;
> +	seq->size = min(seq->size, rem);
> +	seq->buf = kvmalloc(seq->size, GFP_KERNEL_ACCOUNT);
> +	if (!seq->buf)
> +		return -ENOMEM;
> +
> +	ret = func(s);
> +	if (ret)
> +		return ret;
> +
> +	if (seq_has_overflowed(seq)) {
> +		if (seq->size == rem)
> +			return -EOVERFLOW;
> +		seq->size *= 2;
> +		if (seq->size > MAX_RW_COUNT)
> +			return -ENOMEM;
> +		kvfree(seq->buf);
> +		return 0;
> +	}
> +
> +	/* Done */
> +	return 1;
> +}
> +
> +static void stmt_string(struct stmt_state *s, u64 mask, stmt_func_t func,
> +		       u32 *str)
> +{
> +	int ret = s->pos + sizeof(s->sm) >= s->bufsize ? -EOVERFLOW : 0;
> +	struct statmnt *sm = &s->sm;
> +	struct seq_file *seq = &s->seq;
> +
> +	if (s->err || !(s->mask & mask))
> +		return;
> +
> +	seq->size = PAGE_SIZE;
> +	while (!ret)
> +		ret = stmt_string_seq(s, func);
> +
> +	if (ret < 0) {
> +		s->err = ret;
> +	} else {
> +		seq->buf[seq->count++] = '\0';
> +		if (copy_to_user(s->buf->str + s->pos, seq->buf, seq->count)) {
> +			s->err = -EFAULT;
> +		} else {
> +			*str = s->pos;
> +			s->pos += seq->count;
> +		}
> +	}
> +	kvfree(seq->buf);
> +	sm->mask |= mask;
> +}
> +
> +static void stmt_numeric(struct stmt_state *s, u64 mask, stmt_func_t func)
> +{
> +	if (s->err || !(s->mask & mask))
> +		return;
> +
> +	s->err = func(s);
> +	s->sm.mask |= mask;
> +}
> +
> +static u64 mnt_to_attr_flags(struct vfsmount *mnt)
> +{
> +	unsigned int mnt_flags = READ_ONCE(mnt->mnt_flags);
> +	u64 attr_flags = 0;
> +
> +	if (mnt_flags & MNT_READONLY)
> +		attr_flags |= MOUNT_ATTR_RDONLY;
> +	if (mnt_flags & MNT_NOSUID)
> +		attr_flags |= MOUNT_ATTR_NOSUID;
> +	if (mnt_flags & MNT_NODEV)
> +		attr_flags |= MOUNT_ATTR_NODEV;
> +	if (mnt_flags & MNT_NOEXEC)
> +		attr_flags |= MOUNT_ATTR_NOEXEC;
> +	if (mnt_flags & MNT_NODIRATIME)
> +		attr_flags |= MOUNT_ATTR_NODIRATIME;
> +	if (mnt_flags & MNT_NOSYMFOLLOW)
> +		attr_flags |= MOUNT_ATTR_NOSYMFOLLOW;
> +
> +	if (mnt_flags & MNT_NOATIME)
> +		attr_flags |= MOUNT_ATTR_NOATIME;
> +	else if (mnt_flags & MNT_RELATIME)
> +		attr_flags |= MOUNT_ATTR_RELATIME;
> +	else
> +		attr_flags |= MOUNT_ATTR_STRICTATIME;
> +
> +	if (is_idmapped_mnt(mnt))
> +		attr_flags |= MOUNT_ATTR_IDMAP;
> +
> +	return attr_flags;
> +}
> +
> +static u64 mnt_to_propagation_flags(struct mount *m)
> +{
> +	u64 propagation = 0;
> +
> +	if (IS_MNT_SHARED(m))
> +		propagation |= MS_SHARED;
> +	if (IS_MNT_SLAVE(m))
> +		propagation |= MS_SLAVE;
> +	if (IS_MNT_UNBINDABLE(m))
> +		propagation |= MS_UNBINDABLE;
> +	if (!propagation)
> +		propagation |= MS_PRIVATE;
> +
> +	return propagation;
> +}
> +
> +static int stmt_sb_basic(struct stmt_state *s)
> +{
> +	struct super_block *sb = s->mnt->mnt_sb;
> +
> +	s->sm.sb_dev_major = MAJOR(sb->s_dev);
> +	s->sm.sb_dev_minor = MINOR(sb->s_dev);
> +	s->sm.sb_magic = sb->s_magic;
> +	s->sm.sb_flags = sb->s_flags & (SB_RDONLY|SB_SYNCHRONOUS|SB_DIRSYNC|SB_LAZYTIME);
> +
> +	return 0;
> +}
> +
> +static int stmt_mnt_basic(struct stmt_state *s)
> +{
> +	struct mount *m = real_mount(s->mnt);
> +
> +	s->sm.mnt_id = m->mnt_id_unique;
> +	s->sm.mnt_parent_id = m->mnt_parent->mnt_id_unique;
> +	s->sm.mnt_id_old = m->mnt_id;
> +	s->sm.mnt_parent_id_old = m->mnt_parent->mnt_id;
> +	s->sm.mnt_attr = mnt_to_attr_flags(&m->mnt);
> +	s->sm.mnt_propagation = mnt_to_propagation_flags(m);
> +	s->sm.mnt_peer_group = IS_MNT_SHARED(m) ? m->mnt_group_id : 0;
> +	s->sm.mnt_master = IS_MNT_SLAVE(m) ? m->mnt_master->mnt_group_id : 0;
> +
> +	return 0;
> +}
> +
> +static int stmt_propagate_from(struct stmt_state *s)
> +{
> +	struct mount *m = real_mount(s->mnt);
> +
> +	if (!IS_MNT_SLAVE(m))
> +		return 0;
> +
> +	s->sm.propagate_from = get_dominating_id(m, &current->fs->root);
> +
> +	return 0;
> +}
> +
> +static int stmt_mnt_root(struct stmt_state *s)
> +{
> +	struct seq_file *seq = &s->seq;
> +	int err = show_path(seq, s->mnt->mnt_root);
> +
> +	if (!err && !seq_has_overflowed(seq)) {
> +		seq->buf[seq->count] = '\0';
> +		seq->count = string_unescape_inplace(seq->buf, UNESCAPE_OCTAL);
> +	}
> +	return err;
> +}
> +
> +static int stmt_mnt_point(struct stmt_state *s)
> +{
> +	struct vfsmount *mnt = s->mnt;
> +	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
> +	int err = seq_path_root(&s->seq, &mnt_path, &s->root, "");
> +
> +	return err == SEQ_SKIP ? 0 : err;
> +}
> +
> +static int stmt_fs_type(struct stmt_state *s)
> +{
> +	struct seq_file *seq = &s->seq;
> +	struct super_block *sb = s->mnt->mnt_sb;
> +
> +	seq_puts(seq, sb->s_type->name);
> +	return 0;
> +}
> +
> +static int do_statmount(struct stmt_state *s)
> +{
> +	struct statmnt *sm = &s->sm;
> +	struct mount *m = real_mount(s->mnt);
> +	size_t copysize = min_t(size_t, s->bufsize, sizeof(*sm));
> +	int err;
> +
> +	err = security_sb_statfs(s->mnt->mnt_root);
> +	if (err)
> +		return err;
> +
> +	if (!capable(CAP_SYS_ADMIN) &&
> +	    !is_path_reachable(m, m->mnt.mnt_root, &s->root))
> +		return -EPERM;
> +
> +	stmt_numeric(s, STMT_SB_BASIC, stmt_sb_basic);
> +	stmt_numeric(s, STMT_MNT_BASIC, stmt_mnt_basic);
> +	stmt_numeric(s, STMT_PROPAGATE_FROM, stmt_propagate_from);
> +	stmt_string(s, STMT_FS_TYPE, stmt_fs_type, &sm->fs_type);
> +	stmt_string(s, STMT_MNT_ROOT, stmt_mnt_root, &sm->mnt_root);
> +	stmt_string(s, STMT_MNT_POINT, stmt_mnt_point, &sm->mnt_point);
> +
> +	if (s->err)
> +		return s->err;
> +
> +	/* Return the number of bytes copied to the buffer */
> +	sm->size = copysize + s->pos;
> +
> +	if (copy_to_user(s->buf, sm, copysize))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
> +		struct statmnt __user *, buf, size_t, bufsize,
> +		unsigned int, flags)
> +{
> +	struct vfsmount *mnt;
> +	struct __mount_arg kreq;
> +	int ret;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&kreq, req, sizeof(kreq)))
> +		return -EFAULT;
> +
> +	down_read(&namespace_sem);
> +	mnt = lookup_mnt_in_ns(kreq.mnt_id, current->nsproxy->mnt_ns);
> +	ret = -ENOENT;
> +	if (mnt) {
> +		struct stmt_state s = {
> +			.mask = kreq.request_mask,
> +			.buf = buf,
> +			.bufsize = bufsize,
> +			.mnt = mnt,
> +		};
> +
> +		get_fs_root(current->fs, &s.root);
> +		ret = do_statmount(&s);
> +		path_put(&s.root);
> +	}
> +	up_read(&namespace_sem);
> +
> +	return ret;
> +}
> +
>   static void __init init_mount_tree(void)
>   {
>   	struct vfsmount *mnt;
> diff --git a/fs/statfs.c b/fs/statfs.c
> index 96d1c3edf289..cc774c2e2c9a 100644
> --- a/fs/statfs.c
> +++ b/fs/statfs.c
> @@ -9,6 +9,7 @@
>   #include <linux/security.h>
>   #include <linux/uaccess.h>
>   #include <linux/compat.h>
> +#include <uapi/linux/mount.h>
>   #include "internal.h"
>   
>   static int flags_by_mnt(int mnt_flags)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 22bc6bc147f8..ba371024d902 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -74,6 +74,8 @@ struct landlock_ruleset_attr;
>   enum landlock_rule_type;
>   struct cachestat_range;
>   struct cachestat;
> +struct statmnt;
> +struct __mount_arg;
>   
>   #include <linux/types.h>
>   #include <linux/aio_abi.h>
> @@ -408,6 +410,9 @@ asmlinkage long sys_statfs64(const char __user *path, size_t sz,
>   asmlinkage long sys_fstatfs(unsigned int fd, struct statfs __user *buf);
>   asmlinkage long sys_fstatfs64(unsigned int fd, size_t sz,
>   				struct statfs64 __user *buf);
> +asmlinkage long sys_statmount(const struct __mount_arg __user *req,
> +			      struct statmnt __user *buf, size_t bufsize,
> +			      unsigned int flags);
>   asmlinkage long sys_truncate(const char __user *path, long length);
>   asmlinkage long sys_ftruncate(unsigned int fd, unsigned long length);
>   #if BITS_PER_LONG == 32
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index abe087c53b4b..8f034e934a2e 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -823,8 +823,11 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
>   #define __NR_fchmodat2 452
>   __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
>   
> +#define __NR_statmount   454
> +__SYSCALL(__NR_statmount, sys_statmount)
> +
>   #undef __NR_syscalls
> -#define __NR_syscalls 453
> +#define __NR_syscalls 455
>   
>   /*
>    * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index bb242fdcfe6b..d2c988ab526b 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -138,4 +138,60 @@ struct mount_attr {
>   /* List of all mount_attr versions. */
>   #define MOUNT_ATTR_SIZE_VER0	32 /* sizeof first published struct */
>   
> +
> +/*
> + * Structure for getting mount/superblock/filesystem info with statmount(2).
> + *
> + * The interface is similar to statx(2): individual fields or groups can be
> + * selected with the @mask argument of statmount().  Kernel will set the @mask
> + * field according to the supported fields.
> + *
> + * If string fields are selected, then the caller needs to pass a buffer that
> + * has space after the fixed part of the structure.  Nul terminated strings are
> + * copied there and offsets relative to @str are stored in the relevant fields.
> + * If the buffer is too small, then EOVERFLOW is returned.  The actually used
> + * size is returned in @size.
> + */
> +struct statmnt {
> +	__u32 size;		/* Total size, including strings */
> +	__u32 __spare1;
> +	__u64 mask;		/* What results were written */
> +	__u32 sb_dev_major;	/* Device ID */
> +	__u32 sb_dev_minor;
> +	__u64 sb_magic;		/* ..._SUPER_MAGIC */
> +	__u32 sb_flags;		/* MS_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
> +	__u32 fs_type;		/* [str] Filesystem type */
> +	__u64 mnt_id;		/* Unique ID of mount */
> +	__u64 mnt_parent_id;	/* Unique ID of parent (for root == mnt_id) */
> +	__u32 mnt_id_old;	/* Reused IDs used in proc/.../mountinfo */
> +	__u32 mnt_parent_id_old;
> +	__u64 mnt_attr;		/* MOUNT_ATTR_... */
> +	__u64 mnt_propagation;	/* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
> +	__u64 mnt_peer_group;	/* ID of shared peer group */
> +	__u64 mnt_master;	/* Mount receives propagation from this ID */
> +	__u64 propagate_from;	/* Propagation from in current namespace */
> +	__u32 mnt_root;		/* [str] Root of mount relative to root of fs */
> +	__u32 mnt_point;	/* [str] Mountpoint relative to current root */
> +	__u64 __spare2[50];
> +	char str[];		/* Variable size part containing strings */
> +};
> +
> +/*
> + * To be used on the kernel ABI only for passing 64bit arguments to statmount(2)
> + */
> +struct __mount_arg {
> +	__u64 mnt_id;
> +	__u64 request_mask;
> +};
> +
> +/*
> + * @mask bits for statmount(2)
> + */
> +#define STMT_SB_BASIC		0x00000001U     /* Want/got sb_... */
> +#define STMT_MNT_BASIC		0x00000002U	/* Want/got mnt_... */
> +#define STMT_PROPAGATE_FROM	0x00000004U	/* Want/got propagate_from */
> +#define STMT_MNT_ROOT		0x00000008U	/* Want/got mnt_root  */
> +#define STMT_MNT_POINT		0x00000010U	/* Want/got mnt_point */
> +#define STMT_FS_TYPE		0x00000020U	/* Want/got fs_type */
> +
>   #endif /* _UAPI_LINUX_MOUNT_H */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] add statmount(2) syscall
  2023-09-29  0:42   ` Ian Kent
@ 2023-09-29  9:10     ` Miklos Szeredi
  2023-09-30  1:16       ` Ian Kent
  0 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2023-09-29  9:10 UTC (permalink / raw
  To: Ian Kent
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, David Howells, Linus Torvalds,
	Al Viro, Christian Brauner, Amir Goldstein, Matthew House,
	Florian Weimer, Arnd Bergmann

On Fri, 29 Sept 2023 at 02:42, Ian Kent <raven@themaw.net> wrote:
>
> On 28/9/23 21:01, Miklos Szeredi wrote:

> > +static struct vfsmount *lookup_mnt_in_ns(u64 id, struct mnt_namespace *ns)
> > +{
> > +     struct mount *mnt;
> > +     struct vfsmount *res = NULL;
> > +
> > +     lock_ns_list(ns);
> > +     list_for_each_entry(mnt, &ns->list, mnt_list) {
> > +             if (!mnt_is_cursor(mnt) && id == mnt->mnt_id_unique) {
> > +                     res = &mnt->mnt;
> > +                     break;
> > +             }
> > +     }
> > +     unlock_ns_list(ns);
> > +     return res;
> > +}
>
> Seems like we might need to consider making (struct mnt_namespace)->list
>
> a hashed list.

Yes, linear search needs to go.  A hash table is probably the easiest solution.

But I'd also consider replacing ns->list with an rbtree.  Not as
trivial as adding a system hash table and probably also slightly
slower, but it would have some advantages:

 - most space efficient (no overhead of hash buckets)

 - cursor can go away (f_pos can just contain last ID)

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] add statmount(2) syscall
  2023-09-29  9:10     ` Miklos Szeredi
@ 2023-09-30  1:16       ` Ian Kent
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Kent @ 2023-09-30  1:16 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, David Howells, Linus Torvalds,
	Al Viro, Christian Brauner, Amir Goldstein, Matthew House,
	Florian Weimer, Arnd Bergmann

On 29/9/23 17:10, Miklos Szeredi wrote:
> On Fri, 29 Sept 2023 at 02:42, Ian Kent <raven@themaw.net> wrote:
>> On 28/9/23 21:01, Miklos Szeredi wrote:
>>> +static struct vfsmount *lookup_mnt_in_ns(u64 id, struct mnt_namespace *ns)
>>> +{
>>> +     struct mount *mnt;
>>> +     struct vfsmount *res = NULL;
>>> +
>>> +     lock_ns_list(ns);
>>> +     list_for_each_entry(mnt, &ns->list, mnt_list) {
>>> +             if (!mnt_is_cursor(mnt) && id == mnt->mnt_id_unique) {
>>> +                     res = &mnt->mnt;
>>> +                     break;
>>> +             }
>>> +     }
>>> +     unlock_ns_list(ns);
>>> +     return res;
>>> +}
>> Seems like we might need to consider making (struct mnt_namespace)->list
>>
>> a hashed list.
> Yes, linear search needs to go.  A hash table is probably the easiest solution.
>
> But I'd also consider replacing ns->list with an rbtree.  Not as
> trivial as adding a system hash table and probably also slightly
> slower, but it would have some advantages:
>
>   - most space efficient (no overhead of hash buckets)
>
>   - cursor can go away (f_pos can just contain last ID)

I guess that would be ok.

Avoiding the cursor is a big plus.


An rbtree is used in kernfs and its readdir function is rather painful so

I wonder what the implications might be for other enumeration needs.


Ian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] add statmount(2) syscall
  2023-09-28 13:01 ` [PATCH v3 3/4] add statmount(2) syscall Miklos Szeredi
  2023-09-29  0:42   ` Ian Kent
@ 2023-10-04 19:26   ` Paul Moore
  1 sibling, 0 replies; 22+ messages in thread
From: Paul Moore @ 2023-10-04 19:26 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, Ian Kent, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann

On Thu, Sep 28, 2023 at 9:03 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> Add a way to query attributes of a single mount instead of having to parse
> the complete /proc/$PID/mountinfo, which might be huge.
>
> Lookup the mount the new 64bit mount ID.  If a mount needs to be queried
> based on path, then statx(2) can be used to first query the mount ID
> belonging to the path.
>
> Design is based on a suggestion by Linus:
>
>   "So I'd suggest something that is very much like "statfsat()", which gets
>    a buffer and a length, and returns an extended "struct statfs" *AND*
>    just a string description at the end."
>
> The interface closely mimics that of statx.
>
> Handle ASCII attributes by appending after the end of the structure (as per
> above suggestion).  Pointers to strings are stored in u64 members to make
> the structure the same regardless of pointer size.  Strings are nul
> terminated.
>
> Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/namespace.c                         | 283 +++++++++++++++++++++++++
>  fs/statfs.c                            |   1 +
>  include/linux/syscalls.h               |   5 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/mount.h             |  56 +++++
>  7 files changed, 351 insertions(+), 1 deletion(-)

...

> diff --git a/fs/namespace.c b/fs/namespace.c
> index c3a41200fe70..3326ba2b2810 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c

...

> +static int do_statmount(struct stmt_state *s)
> +{
> +       struct statmnt *sm = &s->sm;
> +       struct mount *m = real_mount(s->mnt);
> +       size_t copysize = min_t(size_t, s->bufsize, sizeof(*sm));
> +       int err;
> +
> +       err = security_sb_statfs(s->mnt->mnt_root);
> +       if (err)
> +               return err;

Thank you for adding the security_sb_statfs() call to this operation,
however I believe we want to place it *after* the capability check to
be consistent with other LSM calls.

> +       if (!capable(CAP_SYS_ADMIN) &&
> +           !is_path_reachable(m, m->mnt.mnt_root, &s->root))
> +               return -EPERM;
> +
> +       stmt_numeric(s, STMT_SB_BASIC, stmt_sb_basic);
> +       stmt_numeric(s, STMT_MNT_BASIC, stmt_mnt_basic);
> +       stmt_numeric(s, STMT_PROPAGATE_FROM, stmt_propagate_from);
> +       stmt_string(s, STMT_FS_TYPE, stmt_fs_type, &sm->fs_type);
> +       stmt_string(s, STMT_MNT_ROOT, stmt_mnt_root, &sm->mnt_root);
> +       stmt_string(s, STMT_MNT_POINT, stmt_mnt_point, &sm->mnt_point);
> +
> +       if (s->err)
> +               return s->err;
> +
> +       /* Return the number of bytes copied to the buffer */
> +       sm->size = copysize + s->pos;
> +
> +       if (copy_to_user(s->buf, sm, copysize))
> +               return -EFAULT;
> +
> +       return 0;
> +}

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 4/4] add listmount(2) syscall
  2023-09-28 13:01 [PATCH v3 0/4] querying mount attributes Miklos Szeredi
                   ` (2 preceding siblings ...)
  2023-09-28 13:01 ` [PATCH v3 3/4] add statmount(2) syscall Miklos Szeredi
@ 2023-09-28 13:01 ` Miklos Szeredi
  2023-10-04 19:37   ` Paul Moore
  3 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2023-09-28 13:01 UTC (permalink / raw
  To: linux-fsdevel
  Cc: linux-kernel, linux-api, linux-man, linux-security-module,
	Karel Zak, Ian Kent, David Howells, Linus Torvalds, Al Viro,
	Christian Brauner, Amir Goldstein, Matthew House, Florian Weimer,
	Arnd Bergmann

Add way to query the children of a particular mount.  This is a more
flexible way to iterate the mount tree than having to parse the complete
/proc/self/mountinfo.

Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
based on path, then statx(2) can be used to first query the mount ID
belonging to the path.

Return an array of new (64bit) mount ID's.  Without privileges only mounts
are listed which are reachable from the task's root.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/namespace.c                         | 69 ++++++++++++++++++++++++++
 include/linux/syscalls.h               |  3 ++
 include/uapi/asm-generic/unistd.h      |  5 +-
 include/uapi/linux/mount.h             |  3 ++
 6 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 317b1320ad18..65e0185b47a9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -458,3 +458,4 @@
 451	i386	cachestat		sys_cachestat
 452	i386	fchmodat2		sys_fchmodat2
 454	i386	statmount		sys_statmount
+455	i386	listmount		sys_listmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7312c440978f..a1b3ce7d22cc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -376,6 +376,7 @@
 452	common	fchmodat2		sys_fchmodat2
 453	64	map_shadow_stack	sys_map_shadow_stack
 454	common	statmount		sys_statmount
+455	common	listmount		sys_listmount
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/namespace.c b/fs/namespace.c
index 3326ba2b2810..050e2d2af110 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4970,6 +4970,75 @@ SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
 	return ret;
 }
 
+static long do_listmount(struct vfsmount *mnt, u64 __user *buf, size_t bufsize,
+			 const struct path *root, unsigned int flags)
+{
+	struct mount *r, *m = real_mount(mnt);
+	struct path rootmnt = {
+		.mnt = root->mnt,
+		.dentry = root->mnt->mnt_root
+	};
+	long ctr = 0;
+	bool reachable_only = true;
+	int err;
+
+	err = security_sb_statfs(mnt->mnt_root);
+	if (err)
+		return err;
+
+	if (flags & LISTMOUNT_UNREACHABLE) {
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		reachable_only = false;
+	}
+
+	if (reachable_only && !is_path_reachable(m, mnt->mnt_root, &rootmnt))
+		return capable(CAP_SYS_ADMIN) ? 0 : -EPERM;
+
+	list_for_each_entry(r, &m->mnt_mounts, mnt_child) {
+		if (reachable_only &&
+		    !is_path_reachable(r, r->mnt.mnt_root, root))
+			continue;
+
+		if (ctr >= bufsize)
+			return -EOVERFLOW;
+		if (put_user(r->mnt_id_unique, buf + ctr))
+			return -EFAULT;
+		ctr++;
+		if (ctr < 0)
+			return -ERANGE;
+	}
+	return ctr;
+}
+
+SYSCALL_DEFINE4(listmount, const struct __mount_arg __user *, req,
+		u64 __user *, buf, size_t, bufsize, unsigned int, flags)
+{
+	struct __mount_arg kreq;
+	struct vfsmount *mnt;
+	struct path root;
+	long err;
+
+	if (flags & ~LISTMOUNT_UNREACHABLE)
+		return -EINVAL;
+
+	if (copy_from_user(&kreq, req, sizeof(kreq)))
+		return -EFAULT;
+
+	down_read(&namespace_sem);
+	mnt = lookup_mnt_in_ns(kreq.mnt_id, current->nsproxy->mnt_ns);
+	err = -ENOENT;
+	if (mnt) {
+		get_fs_root(current->fs, &root);
+		err = do_listmount(mnt, buf, bufsize, &root, flags);
+		path_put(&root);
+	}
+	up_read(&namespace_sem);
+
+	return err;
+}
+
+
 static void __init init_mount_tree(void)
 {
 	struct vfsmount *mnt;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ba371024d902..38f3da7e04d1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -413,6 +413,9 @@ asmlinkage long sys_fstatfs64(unsigned int fd, size_t sz,
 asmlinkage long sys_statmount(const struct __mount_arg __user *req,
 			      struct statmnt __user *buf, size_t bufsize,
 			      unsigned int flags);
+asmlinkage long sys_listmount(const struct __mount_arg __user *req,
+			      u64 __user *buf, size_t bufsize,
+			      unsigned int flags);
 asmlinkage long sys_truncate(const char __user *path, long length);
 asmlinkage long sys_ftruncate(unsigned int fd, unsigned long length);
 #if BITS_PER_LONG == 32
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 8f034e934a2e..8df6a747e21a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -826,8 +826,11 @@ __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
 #define __NR_statmount   454
 __SYSCALL(__NR_statmount, sys_statmount)
 
+#define __NR_listmount   455
+__SYSCALL(__NR_listmount, sys_listmount)
+
 #undef __NR_syscalls
-#define __NR_syscalls 455
+#define __NR_syscalls 456
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index d2c988ab526b..7aa9916659d2 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -194,4 +194,7 @@ struct __mount_arg {
 #define STMT_MNT_POINT		0x00000010U	/* Want/got mnt_point */
 #define STMT_FS_TYPE		0x00000020U	/* Want/got fs_type */
 
+/* listmount(2) flags */
+#define LISTMOUNT_UNREACHABLE	0x01	/* List unreachable mounts too */
+
 #endif /* _UAPI_LINUX_MOUNT_H */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-09-28 13:01 ` [PATCH v3 4/4] add listmount(2) syscall Miklos Szeredi
@ 2023-10-04 19:37   ` Paul Moore
  2023-10-05  4:01     ` Miklos Szeredi
  0 siblings, 1 reply; 22+ messages in thread
From: Paul Moore @ 2023-10-04 19:37 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, Ian Kent, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann

On Thu, Sep 28, 2023 at 9:04 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> Add way to query the children of a particular mount.  This is a more
> flexible way to iterate the mount tree than having to parse the complete
> /proc/self/mountinfo.
>
> Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
> based on path, then statx(2) can be used to first query the mount ID
> belonging to the path.
>
> Return an array of new (64bit) mount ID's.  Without privileges only mounts
> are listed which are reachable from the task's root.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  fs/namespace.c                         | 69 ++++++++++++++++++++++++++
>  include/linux/syscalls.h               |  3 ++
>  include/uapi/asm-generic/unistd.h      |  5 +-
>  include/uapi/linux/mount.h             |  3 ++
>  6 files changed, 81 insertions(+), 1 deletion(-)

...

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 3326ba2b2810..050e2d2af110 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -4970,6 +4970,75 @@ SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
>         return ret;
>  }
>
> +static long do_listmount(struct vfsmount *mnt, u64 __user *buf, size_t bufsize,
> +                        const struct path *root, unsigned int flags)
> +{
> +       struct mount *r, *m = real_mount(mnt);
> +       struct path rootmnt = {
> +               .mnt = root->mnt,
> +               .dentry = root->mnt->mnt_root
> +       };
> +       long ctr = 0;
> +       bool reachable_only = true;
> +       int err;
> +
> +       err = security_sb_statfs(mnt->mnt_root);
> +       if (err)
> +               return err;
> +
> +       if (flags & LISTMOUNT_UNREACHABLE) {
> +               if (!capable(CAP_SYS_ADMIN))
> +                       return -EPERM;
> +               reachable_only = false;
> +       }
> +
> +       if (reachable_only && !is_path_reachable(m, mnt->mnt_root, &rootmnt))
> +               return capable(CAP_SYS_ADMIN) ? 0 : -EPERM;
> +
> +       list_for_each_entry(r, &m->mnt_mounts, mnt_child) {
> +               if (reachable_only &&
> +                   !is_path_reachable(r, r->mnt.mnt_root, root))
> +                       continue;

I believe we would want to move the security_sb_statfs() call from
above to down here; something like this I think ...

  err = security_sb_statfs(r->mnt.mnt_root);
  if (err)
    /* if we can't access the mount, pretend it doesn't exist */
    continue;

> +               if (ctr >= bufsize)
> +                       return -EOVERFLOW;
> +               if (put_user(r->mnt_id_unique, buf + ctr))
> +                       return -EFAULT;
> +               ctr++;
> +               if (ctr < 0)
> +                       return -ERANGE;
> +       }
> +       return ctr;
> +}

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-04 19:37   ` Paul Moore
@ 2023-10-05  4:01     ` Miklos Szeredi
  2023-10-05  4:23       ` Ian Kent
  0 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2023-10-05  4:01 UTC (permalink / raw
  To: Paul Moore
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, Ian Kent, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann

On Wed, 4 Oct 2023 at 21:38, Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Sep 28, 2023 at 9:04 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > Add way to query the children of a particular mount.  This is a more
> > flexible way to iterate the mount tree than having to parse the complete
> > /proc/self/mountinfo.
> >
> > Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
> > based on path, then statx(2) can be used to first query the mount ID
> > belonging to the path.
> >
> > Return an array of new (64bit) mount ID's.  Without privileges only mounts
> > are listed which are reachable from the task's root.
> >
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  fs/namespace.c                         | 69 ++++++++++++++++++++++++++
> >  include/linux/syscalls.h               |  3 ++
> >  include/uapi/asm-generic/unistd.h      |  5 +-
> >  include/uapi/linux/mount.h             |  3 ++
> >  6 files changed, 81 insertions(+), 1 deletion(-)
>
> ...
>
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 3326ba2b2810..050e2d2af110 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -4970,6 +4970,75 @@ SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
> >         return ret;
> >  }
> >
> > +static long do_listmount(struct vfsmount *mnt, u64 __user *buf, size_t bufsize,
> > +                        const struct path *root, unsigned int flags)
> > +{
> > +       struct mount *r, *m = real_mount(mnt);
> > +       struct path rootmnt = {
> > +               .mnt = root->mnt,
> > +               .dentry = root->mnt->mnt_root
> > +       };
> > +       long ctr = 0;
> > +       bool reachable_only = true;
> > +       int err;
> > +
> > +       err = security_sb_statfs(mnt->mnt_root);
> > +       if (err)
> > +               return err;
> > +
> > +       if (flags & LISTMOUNT_UNREACHABLE) {
> > +               if (!capable(CAP_SYS_ADMIN))
> > +                       return -EPERM;
> > +               reachable_only = false;
> > +       }
> > +
> > +       if (reachable_only && !is_path_reachable(m, mnt->mnt_root, &rootmnt))
> > +               return capable(CAP_SYS_ADMIN) ? 0 : -EPERM;
> > +
> > +       list_for_each_entry(r, &m->mnt_mounts, mnt_child) {
> > +               if (reachable_only &&
> > +                   !is_path_reachable(r, r->mnt.mnt_root, root))
> > +                       continue;
>
> I believe we would want to move the security_sb_statfs() call from
> above to down here; something like this I think ...
>
>   err = security_sb_statfs(r->mnt.mnt_root);
>   if (err)
>     /* if we can't access the mount, pretend it doesn't exist */
>     continue;

Hmm.  Why is this specific to listing mounts (i.e. why doesn't readdir
have a similar filter)?

Also why hasn't this come up with regards to the proc interfaces that
list mounts?

I just want to understand the big picture here.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-05  4:01     ` Miklos Szeredi
@ 2023-10-05  4:23       ` Ian Kent
  2023-10-05 15:47         ` Miklos Szeredi
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Kent @ 2023-10-05  4:23 UTC (permalink / raw
  To: Miklos Szeredi, Paul Moore
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api, linux-man,
	linux-security-module, Karel Zak, David Howells, Linus Torvalds,
	Al Viro, Christian Brauner, Amir Goldstein, Matthew House,
	Florian Weimer, Arnd Bergmann

On 5/10/23 12:01, Miklos Szeredi wrote:
> On Wed, 4 Oct 2023 at 21:38, Paul Moore <paul@paul-moore.com> wrote:
>> On Thu, Sep 28, 2023 at 9:04 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>> Add way to query the children of a particular mount.  This is a more
>>> flexible way to iterate the mount tree than having to parse the complete
>>> /proc/self/mountinfo.
>>>
>>> Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
>>> based on path, then statx(2) can be used to first query the mount ID
>>> belonging to the path.
>>>
>>> Return an array of new (64bit) mount ID's.  Without privileges only mounts
>>> are listed which are reachable from the task's root.
>>>
>>> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
>>> ---
>>>   arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>>>   arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>>>   fs/namespace.c                         | 69 ++++++++++++++++++++++++++
>>>   include/linux/syscalls.h               |  3 ++
>>>   include/uapi/asm-generic/unistd.h      |  5 +-
>>>   include/uapi/linux/mount.h             |  3 ++
>>>   6 files changed, 81 insertions(+), 1 deletion(-)
>> ...
>>
>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>> index 3326ba2b2810..050e2d2af110 100644
>>> --- a/fs/namespace.c
>>> +++ b/fs/namespace.c
>>> @@ -4970,6 +4970,75 @@ SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
>>>          return ret;
>>>   }
>>>
>>> +static long do_listmount(struct vfsmount *mnt, u64 __user *buf, size_t bufsize,
>>> +                        const struct path *root, unsigned int flags)
>>> +{
>>> +       struct mount *r, *m = real_mount(mnt);
>>> +       struct path rootmnt = {
>>> +               .mnt = root->mnt,
>>> +               .dentry = root->mnt->mnt_root
>>> +       };
>>> +       long ctr = 0;
>>> +       bool reachable_only = true;
>>> +       int err;
>>> +
>>> +       err = security_sb_statfs(mnt->mnt_root);
>>> +       if (err)
>>> +               return err;
>>> +
>>> +       if (flags & LISTMOUNT_UNREACHABLE) {
>>> +               if (!capable(CAP_SYS_ADMIN))
>>> +                       return -EPERM;
>>> +               reachable_only = false;
>>> +       }
>>> +
>>> +       if (reachable_only && !is_path_reachable(m, mnt->mnt_root, &rootmnt))
>>> +               return capable(CAP_SYS_ADMIN) ? 0 : -EPERM;
>>> +
>>> +       list_for_each_entry(r, &m->mnt_mounts, mnt_child) {
>>> +               if (reachable_only &&
>>> +                   !is_path_reachable(r, r->mnt.mnt_root, root))
>>> +                       continue;
>> I believe we would want to move the security_sb_statfs() call from
>> above to down here; something like this I think ...
>>
>>    err = security_sb_statfs(r->mnt.mnt_root);
>>    if (err)
>>      /* if we can't access the mount, pretend it doesn't exist */
>>      continue;
> Hmm.  Why is this specific to listing mounts (i.e. why doesn't readdir
> have a similar filter)?
>
> Also why hasn't this come up with regards to the proc interfaces that
> list mounts?

The proc interfaces essentially use <mount namespace>->list to provide

the mounts that can be seen so it's filtered by mount namespace of the

task that's doing the open().


See fs/namespace.c:mnt_list_next() and just below the m_start(), m_next(),

etc.


Ian

>
> I just want to understand the big picture here.
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-05  4:23       ` Ian Kent
@ 2023-10-05 15:47         ` Miklos Szeredi
  2023-10-06  0:27           ` Ian Kent
  2023-10-06  2:56           ` Paul Moore
  0 siblings, 2 replies; 22+ messages in thread
From: Miklos Szeredi @ 2023-10-05 15:47 UTC (permalink / raw
  To: Ian Kent
  Cc: Paul Moore, Miklos Szeredi, linux-fsdevel, linux-kernel,
	linux-api, linux-man, linux-security-module, Karel Zak,
	David Howells, Linus Torvalds, Al Viro, Christian Brauner,
	Amir Goldstein, Matthew House, Florian Weimer, Arnd Bergmann

On Thu, 5 Oct 2023 at 06:23, Ian Kent <raven@themaw.net> wrote:

> The proc interfaces essentially use <mount namespace>->list to provide
>
> the mounts that can be seen so it's filtered by mount namespace of the
>
> task that's doing the open().
>
>
> See fs/namespace.c:mnt_list_next() and just below the m_start(), m_next(),

/proc/$PID/mountinfo will list the mount namespace of $PID.  Whether
current task has permission to do so is decided at open time.

listmount() will list the children of the given mount ID.  The mount
ID is looked up in the task's mount namespace, so this cannot be used
to list mounts of other namespaces.  It's a more limited interface.

I sort of understand the reasoning behind calling into a security hook
on entry to statmount() and listmount().  And BTW I also think that if
statmount() and listmount() is limited in this way, then the same
limitation should be applied to the proc interfaces.  But that needs
to be done real carefully because it might cause regressions.  OTOH if
it's only done on the new interfaces, then what is the point, since
the old interfaces will be available indefinitely?

Also I cannot see the point in hiding some mount ID's from the list.
It seems to me that the list is just an array of numbers that in
itself doesn't carry any information.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-05 15:47         ` Miklos Szeredi
@ 2023-10-06  0:27           ` Ian Kent
  2023-10-13  2:39             ` Ian Kent
  2023-10-06  2:56           ` Paul Moore
  1 sibling, 1 reply; 22+ messages in thread
From: Ian Kent @ 2023-10-06  0:27 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: Paul Moore, Miklos Szeredi, linux-fsdevel, linux-kernel,
	linux-api, linux-man, linux-security-module, Karel Zak,
	David Howells, Linus Torvalds, Al Viro, Christian Brauner,
	Amir Goldstein, Matthew House, Florian Weimer, Arnd Bergmann

On 5/10/23 23:47, Miklos Szeredi wrote:
> On Thu, 5 Oct 2023 at 06:23, Ian Kent <raven@themaw.net> wrote:
>
>> The proc interfaces essentially use <mount namespace>->list to provide
>>
>> the mounts that can be seen so it's filtered by mount namespace of the
>>
>> task that's doing the open().
>>
>>
>> See fs/namespace.c:mnt_list_next() and just below the m_start(), m_next(),
> /proc/$PID/mountinfo will list the mount namespace of $PID.  Whether
> current task has permission to do so is decided at open time.
>
> listmount() will list the children of the given mount ID.  The mount
> ID is looked up in the task's mount namespace, so this cannot be used
> to list mounts of other namespaces.  It's a more limited interface.

Yep. But isn't the ability to see these based on task privilege?


Is the proc style restriction actually what we need here (or some variation

of that implementation)?


An privileged task typically has the init namespace as its mount namespace

and mounts should propagate from there so it should be able to see all 
mounts.


If the file handle has been opened in a task that is using some other mount

namespace then presumably that's what the program author wants the task 
to see.

So I'm not sure I see a problem obeying the namespace of a given task.


Ian

>
> I sort of understand the reasoning behind calling into a security hook
> on entry to statmount() and listmount().  And BTW I also think that if
> statmount() and listmount() is limited in this way, then the same
> limitation should be applied to the proc interfaces.  But that needs
> to be done real carefully because it might cause regressions.  OTOH if
> it's only done on the new interfaces, then what is the point, since
> the old interfaces will be available indefinitely?
>
> Also I cannot see the point in hiding some mount ID's from the list.
> It seems to me that the list is just an array of numbers that in
> itself doesn't carry any information.
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-06  0:27           ` Ian Kent
@ 2023-10-13  2:39             ` Ian Kent
  2023-10-24 13:57               ` Miklos Szeredi
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Kent @ 2023-10-13  2:39 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: Paul Moore, Miklos Szeredi, linux-fsdevel, linux-kernel,
	linux-api, linux-man, linux-security-module, Karel Zak,
	David Howells, Linus Torvalds, Al Viro, Christian Brauner,
	Amir Goldstein, Matthew House, Florian Weimer, Arnd Bergmann

On 6/10/23 08:27, Ian Kent wrote:
> On 5/10/23 23:47, Miklos Szeredi wrote:
>> On Thu, 5 Oct 2023 at 06:23, Ian Kent <raven@themaw.net> wrote:
>>
>>> The proc interfaces essentially use <mount namespace>->list to provide
>>>
>>> the mounts that can be seen so it's filtered by mount namespace of the
>>>
>>> task that's doing the open().
>>>
>>>
>>> See fs/namespace.c:mnt_list_next() and just below the m_start(), 
>>> m_next(),
>> /proc/$PID/mountinfo will list the mount namespace of $PID. Whether
>> current task has permission to do so is decided at open time.
>>
>> listmount() will list the children of the given mount ID.  The mount
>> ID is looked up in the task's mount namespace, so this cannot be used
>> to list mounts of other namespaces.  It's a more limited interface.
>
> Yep. But isn't the ability to see these based on task privilege?
>
>
> Is the proc style restriction actually what we need here (or some 
> variation
>
> of that implementation)?
>
>
> An privileged task typically has the init namespace as its mount 
> namespace
>
> and mounts should propagate from there so it should be able to see all 
> mounts.
>
>
> If the file handle has been opened in a task that is using some other 
> mount
>
> namespace then presumably that's what the program author wants the 
> task to see.
>
> So I'm not sure I see a problem obeying the namespace of a given task.

I've had a look through the code we had in the old fsinfo() proposal

because I think we need to consider the use cases that are needed.


IIRC initially we had a flag FSINFO_ATTR_MOUNT_CHILDREN that essentially

enumerated the children of the given mount in much the same way as is

done now in this system call.


But because we needed to enumerate mounts in the same way as the proc file

system mount tables a flag FSINFO_ATTR_MOUNT_ALL was added that essentially

used the mount namespace mounts list in a similar way to the proc file

system so that a list of mounts for a mount namespace could be retrieved.


This later use case is what is used by processes that monitor mounts and

is what's needed more so than enumerating the children as we do now.


I'm still looking at the mount id lookup.


Ian

>
>
> Ian
>
>>
>> I sort of understand the reasoning behind calling into a security hook
>> on entry to statmount() and listmount().  And BTW I also think that if
>> statmount() and listmount() is limited in this way, then the same
>> limitation should be applied to the proc interfaces.  But that needs
>> to be done real carefully because it might cause regressions. OTOH if
>> it's only done on the new interfaces, then what is the point, since
>> the old interfaces will be available indefinitely?
>>
>> Also I cannot see the point in hiding some mount ID's from the list.
>> It seems to me that the list is just an array of numbers that in
>> itself doesn't carry any information.
>>
>> Thanks,
>> Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-13  2:39             ` Ian Kent
@ 2023-10-24 13:57               ` Miklos Szeredi
  0 siblings, 0 replies; 22+ messages in thread
From: Miklos Szeredi @ 2023-10-24 13:57 UTC (permalink / raw
  To: Ian Kent
  Cc: Paul Moore, Miklos Szeredi, linux-fsdevel, linux-kernel,
	linux-api, linux-man, linux-security-module, Karel Zak,
	David Howells, Linus Torvalds, Al Viro, Christian Brauner,
	Amir Goldstein, Matthew House, Florian Weimer, Arnd Bergmann

On Fri, 13 Oct 2023 at 04:40, Ian Kent <raven@themaw.net> wrote:

> But because we needed to enumerate mounts in the same way as the proc file
>
> system mount tables a flag FSINFO_ATTR_MOUNT_ALL was added that essentially
>
> used the mount namespace mounts list in a similar way to the proc file
>
> system so that a list of mounts for a mount namespace could be retrieved.

Makes sense.

Added a LISTMOUNT_RECURSIVE flag.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-05 15:47         ` Miklos Szeredi
  2023-10-06  0:27           ` Ian Kent
@ 2023-10-06  2:56           ` Paul Moore
  2023-10-06  8:53             ` Miklos Szeredi
  1 sibling, 1 reply; 22+ messages in thread
From: Paul Moore @ 2023-10-06  2:56 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: Ian Kent, Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api,
	linux-man, linux-security-module, Karel Zak, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann

On Thu, Oct 5, 2023 at 11:47 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, 5 Oct 2023 at 06:23, Ian Kent <raven@themaw.net> wrote:
> > The proc interfaces essentially use <mount namespace>->list to provide
> >
> > the mounts that can be seen so it's filtered by mount namespace of the
> >
> > task that's doing the open().
> >
> >
> > See fs/namespace.c:mnt_list_next() and just below the m_start(), m_next(),
>
> /proc/$PID/mountinfo will list the mount namespace of $PID.  Whether
> current task has permission to do so is decided at open time.
>
> listmount() will list the children of the given mount ID.  The mount
> ID is looked up in the task's mount namespace, so this cannot be used
> to list mounts of other namespaces.  It's a more limited interface.
>
> I sort of understand the reasoning behind calling into a security hook
> on entry to statmount() and listmount().  And BTW I also think that if
> statmount() and listmount() is limited in this way, then the same
> limitation should be applied to the proc interfaces.  But that needs
> to be done real carefully because it might cause regressions.  OTOH if
> it's only done on the new interfaces, then what is the point, since
> the old interfaces will be available indefinitely?

LSMs that are designed to enforce access controls on procfs interfaces
typically leverage the fact that the procfs interfaces are file based
and the normal file I/O access controls can be used.  In some cases,
e.g. /proc/self/attr, there may also be additional access controls
implemented via a dedicated set of LSM hooks.

> Also I cannot see the point in hiding some mount ID's from the list.
> It seems to me that the list is just an array of numbers that in
> itself doesn't carry any information.

I think it really comes down to the significance of the mount ID, and
I can't say I know enough of the details here to be entirely
comfortable taking a hard stance on this.  Can you help me understand
the mount ID concept a bit better?

While I'm reasonably confident that we want a security_sb_statfs()
control point in statmount(), it may turn out that we don't want/need
a call in the listmount() case.  Perhaps your original patch was
correct in the sense that we only want a single security_sb_statfs()
call for the root (implying that the child mount IDs are attributes of
the root/parent mount)?  Maybe it's something else entirely?

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-06  2:56           ` Paul Moore
@ 2023-10-06  8:53             ` Miklos Szeredi
  2023-10-06 23:07               ` Paul Moore
  0 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2023-10-06  8:53 UTC (permalink / raw
  To: Paul Moore
  Cc: Ian Kent, Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api,
	linux-man, linux-security-module, Karel Zak, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann

On Fri, 6 Oct 2023 at 04:56, Paul Moore <paul@paul-moore.com> wrote:

> > Also I cannot see the point in hiding some mount ID's from the list.
> > It seems to me that the list is just an array of numbers that in
> > itself doesn't carry any information.
>
> I think it really comes down to the significance of the mount ID, and
> I can't say I know enough of the details here to be entirely
> comfortable taking a hard stance on this.  Can you help me understand
> the mount ID concept a bit better?

Mount ID is a descriptor that allows referring to a specific struct
mount from userspace.

The old 32 bit mount id is allocated with IDA from a global pool.
Because it's non-referencing it doesn't allow uniquely identifying a
mount.  That was a design mistake that I made back in 2008, thinking
that the same sort of dense descriptor space as used for file
descriptors would work.  Originally it was used to identify the mount
and the parent mount in /proc/PID/mountinfo.  Later it was also added
to the following interfaces:

 - name_to_handle_at(2) returns 32 bit value
 - /proc/PID/FD/fdinfo
 - statx(2) returns 64 bit value

It was never used on the kernel interfaces as an input argument.

statmount(2) and listmount(2) require the mount to be identified by
userspace, so having a unique ID is important.  So the "[1/4] add
unique mount ID" adds a new 64 bit ID (still global) that is allocated
sequentially and only reused after reboot.   It is used as an input to
these syscalls.  It is returned by statx(2) if requested by
STATX_MNT_ID_UNIQUE and as an array of ID's by listmount(2).

I can see mild security problems with the global allocation, since a
task can observe mounts being done in other namespaces.  This doesn't
sound too serious, and the old ID has similar issues.  But I think
making the new ID be local to the mount namespace is also feasible.

> While I'm reasonably confident that we want a security_sb_statfs()
> control point in statmount(), it may turn out that we don't want/need
> a call in the listmount() case.  Perhaps your original patch was
> correct in the sense that we only want a single security_sb_statfs()
> call for the root (implying that the child mount IDs are attributes of
> the root/parent mount)?  Maybe it's something else entirely?

Mounts are arranged in a tree (I think it obvious how) and
listmount(2) just lists the IDs of the immediate children of a mount.

I don't see ID being an attribute of a mount, it's a descriptor.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] add listmount(2) syscall
  2023-10-06  8:53             ` Miklos Szeredi
@ 2023-10-06 23:07               ` Paul Moore
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Moore @ 2023-10-06 23:07 UTC (permalink / raw
  To: Miklos Szeredi
  Cc: Ian Kent, Miklos Szeredi, linux-fsdevel, linux-kernel, linux-api,
	linux-man, linux-security-module, Karel Zak, David Howells,
	Linus Torvalds, Al Viro, Christian Brauner, Amir Goldstein,
	Matthew House, Florian Weimer, Arnd Bergmann

On Fri, Oct 6, 2023 at 4:53 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, 6 Oct 2023 at 04:56, Paul Moore <paul@paul-moore.com> wrote:
>
> > > Also I cannot see the point in hiding some mount ID's from the list.
> > > It seems to me that the list is just an array of numbers that in
> > > itself doesn't carry any information.
> >
> > I think it really comes down to the significance of the mount ID, and
> > I can't say I know enough of the details here to be entirely
> > comfortable taking a hard stance on this.  Can you help me understand
> > the mount ID concept a bit better?
>
> Mount ID is a descriptor that allows referring to a specific struct
> mount from userspace.
>
> The old 32 bit mount id is allocated with IDA from a global pool.
> Because it's non-referencing it doesn't allow uniquely identifying a
> mount.  That was a design mistake that I made back in 2008, thinking
> that the same sort of dense descriptor space as used for file
> descriptors would work.  Originally it was used to identify the mount
> and the parent mount in /proc/PID/mountinfo.  Later it was also added
> to the following interfaces:
>
>  - name_to_handle_at(2) returns 32 bit value
>  - /proc/PID/FD/fdinfo
>  - statx(2) returns 64 bit value
>
> It was never used on the kernel interfaces as an input argument.

Thanks for the background.

> statmount(2) and listmount(2) require the mount to be identified by
> userspace, so having a unique ID is important.  So the "[1/4] add
> unique mount ID" adds a new 64 bit ID (still global) that is allocated
> sequentially and only reused after reboot.   It is used as an input to
> these syscalls.  It is returned by statx(2) if requested by
> STATX_MNT_ID_UNIQUE and as an array of ID's by listmount(2).
>
> I can see mild security problems with the global allocation, since a
> task can observe mounts being done in other namespaces.  This doesn't
> sound too serious, and the old ID has similar issues.  But I think
> making the new ID be local to the mount namespace is also feasible.

The LSM hook API is designed to operate independently from any of the
kernel namespaces; while some LSMs may choose to be aware of
namespaces and adjust their controls accordingly, there is no
requirement that they do so.  For that reason, I'm not too bothered
either way if the mount ID is global or tied to a namespace.

> > While I'm reasonably confident that we want a security_sb_statfs()
> > control point in statmount(), it may turn out that we don't want/need
> > a call in the listmount() case.  Perhaps your original patch was
> > correct in the sense that we only want a single security_sb_statfs()
> > call for the root (implying that the child mount IDs are attributes of
> > the root/parent mount)?  Maybe it's something else entirely?
>
> Mounts are arranged in a tree (I think it obvious how) and
> listmount(2) just lists the IDs of the immediate children of a mount.
>
> I don't see ID being an attribute of a mount, it's a descriptor.

In this case I think the approach you took originally in this thread
is likely what we want, call security_sb_statfs() against the root
mount in listmount().  Just please move it after the capability
checks.

If you look at the two LSMs which implement the security_sb_statfs(),
Smack and SELinux, you see that Smack treats this as a read operation
between the current process and the specified mount and SELinux treats
this as a filesystem:getattr operations between the current process
and the specified mount.  In both cases I can see that being the right
approach for reading a list of child mounts off of a root mount.

Does that sound good?  I'm guessing that's okay since that was how you
wrote it in your original patch, but there has been a lot of
discussion since then :)

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-10-24 13:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-28 13:01 [PATCH v3 0/4] querying mount attributes Miklos Szeredi
2023-09-28 13:01 ` [PATCH v3 1/4] add unique mount ID Miklos Szeredi
2023-10-05 15:52   ` Miklos Szeredi
2023-10-06 11:44     ` Amir Goldstein
2023-10-06 12:48       ` Miklos Szeredi
2023-09-28 13:01 ` [PATCH v3 2/4] namespace: extract show_path() helper Miklos Szeredi
2023-09-28 13:01 ` [PATCH v3 3/4] add statmount(2) syscall Miklos Szeredi
2023-09-29  0:42   ` Ian Kent
2023-09-29  9:10     ` Miklos Szeredi
2023-09-30  1:16       ` Ian Kent
2023-10-04 19:26   ` Paul Moore
2023-09-28 13:01 ` [PATCH v3 4/4] add listmount(2) syscall Miklos Szeredi
2023-10-04 19:37   ` Paul Moore
2023-10-05  4:01     ` Miklos Szeredi
2023-10-05  4:23       ` Ian Kent
2023-10-05 15:47         ` Miklos Szeredi
2023-10-06  0:27           ` Ian Kent
2023-10-13  2:39             ` Ian Kent
2023-10-24 13:57               ` Miklos Szeredi
2023-10-06  2:56           ` Paul Moore
2023-10-06  8:53             ` Miklos Szeredi
2023-10-06 23:07               ` Paul Moore

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).