Linux-api Archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2
@ 2023-09-14 23:54 Gregory Price
  2023-09-14 23:54 ` [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Gregory Price @ 2023-09-14 23:54 UTC (permalink / raw
  To: linux-mm
  Cc: linux-kernel, linux-arch, linux-api, linux-cxl, luto, tglx, mingo,
	bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

This patch set is a proposal for set_mempolicy2 and get_mempolicy2
system calls.  This is an extension to the existing mempolicy
syscalls that allow for a more extensible  mempolicy interface and
new, complex memory policies.

This RFC is broken into 3 patches for discussion:

  1) A refactor of do_set_mempolicy that allows code reuse for
     the new syscalls and centralizes the mempolicy swap code.

  2) The implementation of get_mempolicy2 and set_mempolicy2 which
     includes a new uapi type: "struct mempolicy_args" and denotes
     the original mempolicies as "legacy". This allows the existing
     policies to be routed through the original interface.

     (note: only implemented on x86 at this time, though can be
      hacked into other architectures somewhat trivially)

  3) The implementation of a sample mempolicy ("partial-interleave")
     which was not possible on the old interface.

  x) next planned patches: selftest/ltp test/example programs/etc.
     I wanted to start discussion before i went too deep.


Besides the obvious proposal of extending the mempolicy subsystem for
new policies, the core proposal is the addition of the new uapi type
"struct mempolicy". In this proposal, the get and set interfaces use
the same structure, and some fields may be ignored depending on the
requested operation.

This sample implementation of get_mempolicy allows for the retrieval
of all information that would have previously required multiple calls
to get_mempolicy, and implements an area for per-policy information.

The multiple err fields would allow for continuation of information
retrieval should one or more failures occur (though notably this is
probably not defensible, and should probably just error out - mostly
a debugging interface for now).

This allows for future extensibility, and would avoid the need for
additional syscalls into the future, so long as the args structure
is versioned or checked based on size.

struct mempolicy_args {
  int err;
  unsigned short mode;
  unsigned long *nodemask;
  unsigned long maxnode;
  unsigned short flags;
  struct {
    /* Memory allowed */
    struct {
      int err;
      unsigned long maxnode;
      unsigned long *nodemask;
    } allowed;
    /* Address information */
    struct {
      int err;
      unsigned long addr;
      unsigned long node;
      unsigned short mode;
      unsigned short flags;
    } addr;
  } get;
  union {
    /* Interleave */
    struct {
    unsigned long next_node; /* get only */
    } interleave;
    /* Partial Interleave */
    struct {
      unsigned long interval;  /* get and set */
      unsigned long next_node; /* get only */
    } part_int;
  };
};

In the third patch, we implement a sample Partial-Interleave
mempolicy that is not possible to implement given the existing
mempolicy interface - and would either require the exposure of
new interfaces to set the value described.

We extend the internal mempolicy structure to include to include
a new union area which can be used to host complex policy data.

Example:
union {
  /* Partial Interleave: Allocate local count, then interleave */
  struct {
    int interval; /* allocation interval at which to  interleave */
    int count; /* the current allocation count */
  } part_int;
};


Summary of Partial Interleave:
=============================
nodeset=0,1,2
interval=3
cpunode=0

The preferred node (cpunode) is taken by default to be the node on
which [interval] allocations are made before an interleave occurs.

Over 10 consecutive allocations, the following nodes will be selected:
[0,0,0,1,2,0,0,0,1,2]

In this example, there is a 60%/20%/20% distribution of memory across
the node set.


Some notes for discussion
=========================
0) Why?

  In the coming age of CXL and a many-numa-node system with memory
  hosted on the PCIe bus, new memory policies are likely to be
  beneficial to experiment with and ultimately implement new
  allocation-time placement policies.

  Presently, much focus is placed on memory-usage monitoring and data
  migration, but these methods steal performance to accomplish what
  could be optimized for up-front.  For example, if maximum memory
  bandwidth is required for an operation, then a statistical
  distribution of memory can be calculated fairly easily based on
  approximate expected memory usage.

  Getting a fair approximation of distribution at allocation can help
  reduce the migration load required after-the fact.  This is the
  intent of the included partial-interleave example, which allows for
  an approximate distribution of memory, where the local node is still
  the preferred location for the majority of memory.

1) Maybe this should be a set of sysfs interfaces?

  This would involve adding a /proc/pid/mempolicy interface that
  allows for external processes to interrogate and change the
  mempolicy of running processes. This would be a fundamental
  change to the mempolicy subsystem, as (so far as i can tell)
  this is not possible as of present.

  Additionally, the policy is per-thread, not per-pid. Making this
  work on a per-thread, so it would be /proc/pid/task/tid/mempolicy.

  I avoided that for this RFC as it seemed more radical than simply
  proposing a set/get_mempolicy2 interface.  Though technically it
  could be done.

2) Do we need this level extensibility?

   Presently the ability to dictate allocation-time placement is
   limited to a few primitive mechanisms:
     1) existing mempolicy, and those that can be implemented using
        the existing interface.
     2) numa-aware applications, requiring code changes.
     3) LDPRELOAD methods, which have compability issues.

   For the sake of compatibility, being able to extent numactl to
   include newer, more complex policies would be beneficial.

   While partial-interleave passes a simple interval as an interger,
   more complex policies may want to pass multiple, complex pieces of
   data. For example, a 'statistical-interleave' policy may pass a
   list of integers that dictates exactly how many allocations should
   happen per-node during interleave.  Another policy may take in one
   or more nodemask's and do more complex distributions.


Gregory Price (3):
  mm/mempolicy: refactor do_set_mempolicy for code re-use
  mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls
  mm/mempolicy: implement a partial-interleave mempolicy

 arch/x86/entry/syscalls/syscall_32.tbl |   2 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/mempolicy.h              |   8 +
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |  10 +-
 include/uapi/linux/mempolicy.h         |  37 +++
 mm/mempolicy.c                         | 420 +++++++++++++++++++++++--
 7 files changed, 456 insertions(+), 25 deletions(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use
  2023-09-14 23:54 [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2 Gregory Price
@ 2023-09-14 23:54 ` Gregory Price
  2023-10-02 11:03   ` Jonathan Cameron
  2023-09-14 23:54 ` [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls Gregory Price
  2023-09-14 23:54 ` [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy Gregory Price
  2 siblings, 1 reply; 10+ messages in thread
From: Gregory Price @ 2023-09-14 23:54 UTC (permalink / raw
  To: linux-mm
  Cc: linux-kernel, linux-arch, linux-api, linux-cxl, luto, tglx, mingo,
	bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

Refactors do_set_mempolicy into swap_mempolicy and do_set_mempolicy
so that swap_mempolicy can be re-used with set_mempolicy2.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..f49337f6f300 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -855,28 +855,21 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return vma_replace_policy(vma, new_pol);
 }
 
-/* Set the process memory policy */
-static long do_set_mempolicy(unsigned short mode, unsigned short flags,
-			     nodemask_t *nodes)
+/* Swap in a new mempolicy, release the old one if successful */
+static long swap_mempolicy(struct mempolicy *new,
+			   nodemask_t *nodes)
 {
-	struct mempolicy *new, *old;
-	NODEMASK_SCRATCH(scratch);
+	struct mempolicy *old = NULL;
 	int ret;
+	NODEMASK_SCRATCH(scratch);
 
 	if (!scratch)
 		return -ENOMEM;
 
-	new = mpol_new(mode, flags, nodes);
-	if (IS_ERR(new)) {
-		ret = PTR_ERR(new);
-		goto out;
-	}
-
 	task_lock(current);
 	ret = mpol_set_nodemask(new, nodes, scratch);
 	if (ret) {
 		task_unlock(current);
-		mpol_put(new);
 		goto out;
 	}
 
@@ -884,14 +877,35 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	current->mempolicy = new;
 	if (new && new->mode == MPOL_INTERLEAVE)
 		current->il_prev = MAX_NUMNODES-1;
-	task_unlock(current);
-	mpol_put(old);
-	ret = 0;
 out:
+	task_unlock(current);
+	if (old)
+		mpol_put(old);
+
 	NODEMASK_SCRATCH_FREE(scratch);
 	return ret;
 }
 
+/* Set the process memory policy */
+static long do_set_mempolicy(unsigned short mode, unsigned short flags,
+			     nodemask_t *nodes)
+{
+	struct mempolicy *new;
+	int ret;
+
+	new = mpol_new(mode, flags, nodes);
+	if (IS_ERR(new)) {
+		ret = PTR_ERR(new);
+		goto out;
+	}
+
+	ret = swap_mempolicy(new, nodes);
+	if (ret)
+		mpol_put(new);
+out:
+	return ret;
+}
+
 /*
  * Return nodemask for policy for get_mempolicy() query
  *
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls
  2023-09-14 23:54 [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2 Gregory Price
  2023-09-14 23:54 ` [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
@ 2023-09-14 23:54 ` Gregory Price
  2023-10-02 13:30   ` Jonathan Cameron
  2023-09-14 23:54 ` [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy Gregory Price
  2 siblings, 1 reply; 10+ messages in thread
From: Gregory Price @ 2023-09-14 23:54 UTC (permalink / raw
  To: linux-mm
  Cc: linux-kernel, linux-arch, linux-api, linux-cxl, luto, tglx, mingo,
	bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

sys_set_mempolicy is limited by its current argument structure
(mode, nodes, flags) to implementing policies that can be described
in that manner.

Implement set/get_mempolicy2 with a new mempolicy_args structure
which encapsulates the old behavior, and allows for new mempolicies
which may require additional information.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   2 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |  10 +-
 include/uapi/linux/mempolicy.h         |  32 ++++
 mm/mempolicy.c                         | 215 ++++++++++++++++++++++++-
 6 files changed, 261 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..a72ef588a704 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,5 @@
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	i386	cachestat		sys_cachestat
 452	i386	fchmodat2		sys_fchmodat2
+454	i386	set_mempolicy2		sys_set_mempolicy2
+455	i386	get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1d6eee30eceb..ec54064de8b3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -375,6 +375,8 @@
 451	common	cachestat		sys_cachestat
 452	common	fchmodat2		sys_fchmodat2
 453	64	map_shadow_stack	sys_map_shadow_stack
+454	common	set_mempolicy2		sys_set_mempolicy2
+455	common	get_mempolicy2		sys_get_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..d50a452954ae 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -813,6 +813,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long addr, unsigned long flags);
 asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
 				unsigned long maxnode);
+asmlinkage long sys_get_mempolicy2(struct mempolicy_args __user *args);
+asmlinkage long sys_set_mempolicy2(struct mempolicy_args __user *args);
 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *from,
 				const unsigned long __user *to);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index abe087c53b4b..397dcf804941 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -823,8 +823,16 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
 #define __NR_fchmodat2 452
 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
 
+/* CONFIG_MMU only */
+#ifndef __ARCH_NOMMU
+#define __NR_set_mempolicy 454
+__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_set_mempolicy 455
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 456
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 046d0ccba4cd..53650f69db2b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,9 +23,41 @@ enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_LEGACY,	/* set_mempolicy limited to above modes */
 	MPOL_MAX,	/* always last member of enum */
 };
 
+struct mempolicy_args {
+	int err;
+	unsigned short mode;
+	unsigned long *nodemask;
+	unsigned long maxnode;
+	unsigned short flags;
+	struct {
+		/* Memory allowed */
+		struct {
+			int err;
+			unsigned long maxnode;
+			unsigned long *nodemask;
+		} allowed;
+		/* Address information */
+		struct {
+			int err;
+			unsigned long addr;
+			unsigned long node;
+			unsigned short mode;
+			unsigned short flags;
+		} addr;
+		/* Interleave */
+	} get;
+	/* Mode specific settings */
+	union {
+		struct {
+			unsigned long next_node; /* get only */
+		} interleave;
+	};
+};
+
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f49337f6f300..1cf7709400f1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1483,7 +1483,7 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 	*flags = *mode & MPOL_MODE_FLAGS;
 	*mode &= ~MPOL_MODE_FLAGS;
 
-	if ((unsigned int)(*mode) >=  MPOL_MAX)
+	if ((unsigned int)(*mode) >= MPOL_LEGACY)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
@@ -1614,6 +1614,219 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
 	return kernel_set_mempolicy(mode, nmask, maxnode);
 }
 
+static long do_set_mempolicy2(struct mempolicy_args *args)
+{
+	struct mempolicy *new = NULL;
+	nodemask_t nodes;
+	int err;
+
+	if (args->mode <= MPOL_LEGACY)
+		return -EINVAL;
+
+	if (args->mode >= MPOL_MAX)
+		return -EINVAL;
+
+	err = get_nodes(&nodes, args->nodemask, args->maxnode);
+	if (err)
+		return err;
+
+	new = mpol_new(args->mode, args->flags, &nodes);
+	if (IS_ERR(new)) {
+		err = PTR_ERR(new);
+		goto out;
+	}
+
+	switch (args->mode) {
+	default:
+		BUG();
+	}
+
+	if (err)
+		goto out;
+
+	err = swap_mempolicy(new, &nodes);
+out:
+	if (err && new)
+		mpol_put(new);
+	return err;
+};
+
+static bool mempolicy2_args_valid(struct mempolicy_args *kargs)
+{
+	/* Legacy modes are routed through the legacy interface */
+	if (kargs->mode <= MPOL_LEGACY)
+		return false;
+
+	if (kargs->mode >= MPOL_MAX)
+		return false;
+
+	return true;
+}
+
+static long kernel_set_mempolicy2(const struct mempolicy_args __user *uargs,
+				  size_t usize)
+{
+	struct mempolicy_args kargs;
+	int err;
+
+	if (usize != sizeof(kargs))
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return err;
+
+	/* If the mode is legacy, use the legacy path */
+	if (kargs.mode < MPOL_LEGACY) {
+		int legacy_mode = kargs.mode | kargs.flags;
+		const unsigned long __user *lnmask = kargs.nodemask;
+		unsigned long maxnode = kargs.maxnode;
+
+		return kernel_set_mempolicy(legacy_mode, lnmask, maxnode);
+	}
+
+	if (!mempolicy2_args_valid(&kargs))
+		return -EINVAL;
+
+	return do_set_mempolicy2(&kargs);
+}
+
+SYSCALL_DEFINE2(set_mempolicy2, const struct mempolicy_args __user *, args,
+		size_t, size)
+{
+	return kernel_set_mempolicy2(args, size);
+}
+
+/* Gets extended mempolicy information */
+static long do_get_mempolicy2(struct mempolicy_args *kargs)
+{
+	struct mempolicy *pol = current->mempolicy;
+	nodemask_t knodes;
+	int err = 0;
+
+	kargs->err = 0;
+	kargs->mode = pol->mode;
+	/* Mask off internal flags */
+	kargs->flags = (pol->flags & MPOL_MODE_FLAGS);
+
+	if (kargs->nodemask) {
+		if (mpol_store_user_nodemask(pol)) {
+			knodes = pol->w.user_nodemask;
+		} else {
+			task_lock(current);
+			get_policy_nodemask(pol, &knodes);
+			task_unlock(current);
+		}
+		err = copy_nodes_to_user(kargs->nodemask,
+					 kargs->maxnode,
+					 &knodes);
+		if (err)
+			return -EINVAL;
+	}
+
+
+	if (kargs->get.allowed.nodemask) {
+		kargs->get.allowed.err = 0;
+		task_lock(current);
+		knodes = cpuset_current_mems_allowed;
+		task_unlock(current);
+		err = copy_nodes_to_user(kargs->get.allowed.nodemask,
+					 kargs->get.allowed.maxnode,
+					 &knodes);
+		kargs->get.allowed.err = err ? err : 0;
+		kargs->err |= err ? err : 1;
+	}
+
+	if (kargs->get.addr.addr) {
+		struct mempolicy *addr_pol = NULL;
+		struct vm_area_struct *vma = NULL;
+		struct mm_struct *mm = current->mm;
+		unsigned long addr = kargs->get.addr.addr;
+
+		kargs->get.addr.err = 0;
+
+		/*
+		 * Do NOT fall back to task policy if the
+		 * vma/shared policy at addr is NULL.  We
+		 * want to return MPOL_DEFAULT in this case.
+		 */
+		mmap_read_lock(mm);
+		vma = vma_lookup(mm, addr);
+		if (!vma) {
+			mmap_read_unlock(mm);
+			kargs->get.addr.err = -EFAULT;
+			kargs->err |= err ? err : 2;
+			goto mode_info;
+		}
+		if (vma->vm_ops && vma->vm_ops->get_policy)
+			addr_pol = vma->vm_ops->get_policy(vma, addr);
+		else
+			addr_pol = vma->vm_policy;
+
+		kargs->get.addr.mode = addr_pol->mode;
+		/* Mask off internal flags */
+		kargs->get.addr.flags = (pol->flags & MPOL_MODE_FLAGS);
+
+		/*
+		 * Take a refcount on the mpol, because we are about to
+		 * drop the mmap_lock, after which only "pol" remains
+		 * valid, "vma" is stale.
+		 */
+		vma = NULL;
+		mpol_get(addr_pol);
+		mmap_read_unlock(mm);
+		err = lookup_node(mm, addr);
+		mpol_put(addr_pol);
+		if (err < 0) {
+			kargs->get.addr.err = err;
+			kargs->err |= err ? err : 4;
+			goto mode_info;
+		}
+		kargs->get.addr.node = err;
+	}
+
+mode_info:
+	switch (kargs->mode) {
+	case MPOL_INTERLEAVE:
+		kargs->interleave.next_node = next_node_in(current->il_prev,
+							   pol->nodes);
+		break;
+	default:
+		break;
+	}
+
+	return err;
+}
+
+static long kernel_get_mempolicy2(struct mempolicy_args __user *uargs,
+				  size_t usize)
+{
+	struct mempolicy_args kargs;
+	int err;
+
+	if (usize != sizeof(struct mempolicy_args))
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return err;
+
+	/* Get the extended memory policy information (kargs.ext) */
+	err = do_get_mempolicy2(&kargs);
+	if (err)
+		return err;
+
+	err = copy_to_user(uargs, &kargs, sizeof(struct mempolicy_args));
+
+	return err;
+}
+
+SYSCALL_DEFINE2(get_mempolicy2, struct mempolicy_args __user *, policy,
+		size_t, size)
+{
+	return kernel_get_mempolicy2(policy, size);
+}
+
 static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *old_nodes,
 				const unsigned long __user *new_nodes)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy
  2023-09-14 23:54 [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2 Gregory Price
  2023-09-14 23:54 ` [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
  2023-09-14 23:54 ` [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls Gregory Price
@ 2023-09-14 23:54 ` Gregory Price
  2023-10-02 13:40   ` Jonathan Cameron
  2 siblings, 1 reply; 10+ messages in thread
From: Gregory Price @ 2023-09-14 23:54 UTC (permalink / raw
  To: linux-mm
  Cc: linux-kernel, linux-arch, linux-api, linux-cxl, luto, tglx, mingo,
	bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

The partial-interleave mempolicy implements interleave on an
allocation interval. The default node is the local node, for
which N pages will be allocated before an interleave pass occurs.

For example:
  nodes=0,1,2
  interval=3
  cpunode=0

Over 10 consecutive allocations, the following nodes will be selected:
[0,0,0,1,2,0,0,0,1,2]

In this example, there is a 60%/20%/20% distribution of memory.

Using this mechanism, it becomes possible to define an approximate
distribution percentage of memory across a set of nodes:

local_node% : interval/((nr_nodes-1)+interval-1)
other_node% : (1-local_node%)/(nr_nodes-1)

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 include/linux/mempolicy.h      |   8 ++
 include/uapi/linux/mempolicy.h |   5 +
 mm/mempolicy.c                 | 161 +++++++++++++++++++++++++++++++--
 3 files changed, 166 insertions(+), 8 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..41a6de9ff556 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -48,6 +48,14 @@ struct mempolicy {
 	nodemask_t nodes;	/* interleave/bind/perfer */
 	int home_node;		/* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */
 
+	union {
+		/* Partial Interleave: Allocate local count, then interleave */
+		struct {
+			int interval;
+			int count;
+		} part_int;
+	};
+
 	union {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 53650f69db2b..1af344344459 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -24,6 +24,7 @@ enum {
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
 	MPOL_LEGACY,	/* set_mempolicy limited to above modes */
+	MPOL_PARTIAL_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
@@ -55,6 +56,10 @@ struct mempolicy_args {
 		struct {
 			unsigned long next_node; /* get only */
 		} interleave;
+		struct {
+			unsigned long interval;  /* get and set */
+			unsigned long next_node; /* get only */
+		} part_int;
 	};
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1cf7709400f1..a2ee45ac2ab6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -399,6 +399,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_nodemask,
 	},
+	[MPOL_PARTIAL_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 	[MPOL_PREFERRED] = {
 		.create = mpol_new_preferred,
 		.rebind = mpol_rebind_preferred,
@@ -875,7 +879,8 @@ static long swap_mempolicy(struct mempolicy *new,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_PARTIAL_INTERLEAVE))
 		current->il_prev = MAX_NUMNODES-1;
 out:
 	task_unlock(current);
@@ -920,6 +925,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	switch (p->mode) {
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_PARTIAL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
 		*nodes = p->nodes;
@@ -1614,6 +1620,23 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
 	return kernel_set_mempolicy(mode, nmask, maxnode);
 }
 
+static long do_set_partial_interleave(struct mempolicy_args *args,
+				      struct mempolicy *new,
+				      nodemask_t *nodes)
+{
+	/* Preferred interleave cannot be done with no nodemask */
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+
+	/* Preferred interleave interval cannot be <= 0 */
+	if (args->part_int.interval <= 0)
+		return -EINVAL;
+
+	new->part_int.interval = args->part_int.interval;
+	new->part_int.count = 0;
+	return 0;
+}
+
 static long do_set_mempolicy2(struct mempolicy_args *args)
 {
 	struct mempolicy *new = NULL;
@@ -1637,6 +1660,9 @@ static long do_set_mempolicy2(struct mempolicy_args *args)
 	}
 
 	switch (args->mode) {
+	case MPOL_PARTIAL_INTERLEAVE:
+		err = do_set_partial_interleave(args, new, &nodes);
+		break;
 	default:
 		BUG();
 	}
@@ -1791,6 +1817,11 @@ static long do_get_mempolicy2(struct mempolicy_args *kargs)
 		kargs->interleave.next_node = next_node_in(current->il_prev,
 							   pol->nodes);
 		break;
+	case MPOL_PARTIAL_INTERLEAVE:
+		kargs->part_int.next_node = next_node_in(current->il_prev,
+							 pol->nodes);
+		kargs->part_int.interval = pol->part_int.interval;
+		break;
 	default:
 		break;
 	}
@@ -2133,8 +2164,19 @@ static unsigned interleave_nodes(struct mempolicy *policy)
 	struct task_struct *me = current;
 
 	next = next_node_in(me->il_prev, policy->nodes);
-	if (next < MAX_NUMNODES)
+
+	if (policy->mode == MPOL_PARTIAL_INTERLEAVE) {
+		if (next == numa_node_id()) {
+			if (++policy->part_int.count >= policy->part_int.interval) {
+				policy->part_int.count = 0;
+				me->il_prev = next;
+			}
+		} else if (next < MAX_NUMNODES) {
+			me->il_prev = next;
+		}
+	} else if (next < MAX_NUMNODES)
 		me->il_prev = next;
+
 	return next;
 }
 
@@ -2159,6 +2201,7 @@ unsigned int mempolicy_slab_node(void)
 		return first_node(policy->nodes);
 
 	case MPOL_INTERLEAVE:
+	case MPOL_PARTIAL_INTERLEAVE:
 		return interleave_nodes(policy);
 
 	case MPOL_BIND:
@@ -2195,7 +2238,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 	nodemask_t nodemask = pol->nodes;
 	unsigned int target, nnodes;
 	int i;
-	int nid;
+	int nid = MAX_NUMNODES;
 	/*
 	 * The barrier will stabilize the nodemask in a register or on
 	 * the stack so that it will stop changing under the code.
@@ -2208,8 +2251,35 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 	nnodes = nodes_weight(nodemask);
 	if (!nnodes)
 		return numa_node_id();
-	target = (unsigned int)n % nnodes;
-	nid = first_node(nodemask);
+
+	if (pol->mode == MPOL_PARTIAL_INTERLEAVE) {
+		int interval = pol->part_int.interval;
+		/*
+		 * Mode or interval can change so default to basic interleave
+		 * if the interval has become invalid.  Basic interleave is
+		 * equivalent to interval=1. Don't double-count the base node
+		 */
+		if (interval == 0)
+			interval = 1;
+		interval -= 1;
+
+		/* If target <= the interval, no need to call next_node */
+		target = ((unsigned int)n % (nnodes + interval));
+		target -= (target > interval) ? interval : target;
+		target %= MAX_NUMNODES;
+
+		/* If the local node ID is no longer set, do interleave */
+		nid = numa_node_id();
+		if (!node_isset(nid, nodemask))
+			nid = MAX_NUMNODES;
+	}
+
+	/* If partial interleave generated an invalid nid, do interleave */
+	if (nid == MAX_NUMNODES) {
+		target = (unsigned int)n % nnodes;
+		nid = first_node(nodemask);
+	}
+
 	for (i = 0; i < target; i++)
 		nid = next_node(nid, nodemask);
 	return nid;
@@ -2263,7 +2333,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 	*nodemask = NULL;
 	mode = (*mpol)->mode;
 
-	if (unlikely(mode == MPOL_INTERLEAVE)) {
+	if (unlikely(mode == MPOL_INTERLEAVE) ||
+	    unlikely(mode == MPOL_PARTIAL_INTERLEAVE)) {
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
@@ -2304,6 +2375,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_PARTIAL_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2414,7 +2486,8 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 	pol = get_vma_policy(vma, addr);
 
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_PARTIAL_INTERLEAVE) {
 		struct page *page;
 		unsigned nid;
 
@@ -2516,7 +2589,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
 	 */
-	if (pol->mode == MPOL_INTERLEAVE)
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_PARTIAL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else if (pol->mode == MPOL_PREFERRED_MANY)
 		page = alloc_pages_preferred_many(gfp, order,
@@ -2576,6 +2650,68 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_partial_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	nodemask_t nodemask = pol->nodes;
+	unsigned long nr_pages_main;
+	unsigned long nr_pages_other;
+	unsigned long total_cycle;
+	unsigned long delta;
+	unsigned long interval;
+	int allocated = 0;
+	int start_nid;
+	int nnodes;
+	int prev, next;
+	int i;
+
+	/* This stabilizes nodes on the stack incase pol->nodes changes */
+	barrier();
+
+	nnodes = nodes_weight(nodemask);
+	start_nid = numa_node_id();
+
+	if (!node_isset(start_nid, nodemask))
+		start_nid = first_node(nodemask);
+
+	if (nnodes == 1) {
+		allocated = __alloc_pages_bulk(gfp, start_nid,
+					       NULL, nr_pages_main,
+					       NULL, page_array);
+		return allocated;
+	}
+	/* We don't want to double-count the main node in calculations */
+	nnodes--;
+
+	interval = pol->part_int.interval;
+	total_cycle = (interval + nnodes);
+	/* Number of pages on main node: (cycles*interval + up to interval) */
+	nr_pages_main = ((nr_pages / total_cycle) * interval);
+	nr_pages_main += (nr_pages % total_cycle % (interval + 1));
+	/* Number of pages on others: (remaining/nodes) + 1 page if delta  */
+	nr_pages_other = (nr_pages - nr_pages_main) / nnodes;
+	nr_pages_other /= nnodes;
+	/* Delta is number of pages beyond interval up to full cycle */
+	delta = nr_pages - (nr_pages_main + (nr_pages_other * nnodes));
+
+	/* start by allocating for the main node, then interleave rest */
+	prev = start_nid;
+	allocated = __alloc_pages_bulk(gfp, start_nid, NULL, nr_pages_main,
+				       NULL, page_array);
+	for (i = 0; i < nnodes; i++) {
+		int pages = nr_pages_other + (delta-- ? 1 : 0);
+
+		next = next_node_in(prev, nodemask);
+		if (next < MAX_NUMNODES)
+			prev = next;
+		allocated += __alloc_pages_bulk(gfp, next, NULL, pages,
+						NULL, page_array);
+	}
+
+	return allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2614,6 +2750,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_PARTIAL_INTERLEAVE)
+		return alloc_pages_bulk_array_partial_interleave(gfp, pol,
+								 nr_pages,
+								 page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2686,6 +2827,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	switch (a->mode) {
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_PARTIAL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
 		return !!nodes_equal(a->nodes, b->nodes);
@@ -2822,6 +2964,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
+	case MPOL_PARTIAL_INTERLEAVE:
 		pgoff = vma->vm_pgoff;
 		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
 		polnid = offset_il_node(pol, pgoff);
@@ -3209,6 +3352,7 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_PARTIAL_INTERLEAVE] = "partial interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -3379,6 +3523,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_PARTIAL_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default:
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use
  2023-09-14 23:54 ` [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
@ 2023-10-02 11:03   ` Jonathan Cameron
  0 siblings, 0 replies; 10+ messages in thread
From: Jonathan Cameron @ 2023-10-02 11:03 UTC (permalink / raw
  To: Gregory Price
  Cc: linux-mm, linux-kernel, linux-arch, linux-api, linux-cxl, luto,
	tglx, mingo, bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

On Thu, 14 Sep 2023 19:54:55 -0400
Gregory Price <gourry.memverge@gmail.com> wrote:

> Refactors do_set_mempolicy into swap_mempolicy and do_set_mempolicy
> so that swap_mempolicy can be re-used with set_mempolicy2.
> 
> Signed-off-by: Gregory Price <gregory.price@memverge.com>

Obviously this is an RFC, so you probably didn't give it the polish
a finished patch might have.  Still I was curious and reading it and
I can't resist pointing out trivial stuff.. So....

> ---
>  mm/mempolicy.c | 44 +++++++++++++++++++++++++++++---------------
>  1 file changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 42b5567e3773..f49337f6f300 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -855,28 +855,21 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	return vma_replace_policy(vma, new_pol);
>  }
>  
> -/* Set the process memory policy */
> -static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> -			     nodemask_t *nodes)
> +/* Swap in a new mempolicy, release the old one if successful */

Not really swapping. More replacing given we don't get the
old one back to do something else with it.

> +static long swap_mempolicy(struct mempolicy *new,
> +			   nodemask_t *nodes)

Excessive wrapping.

>  {
> -	struct mempolicy *new, *old;
> -	NODEMASK_SCRATCH(scratch);
> +	struct mempolicy *old = NULL;
>  	int ret;
> +	NODEMASK_SCRATCH(scratch);

I'd avoid the reordering as makes it look like slightly more is happening
in this change than is actually the case.

>  
>  	if (!scratch)
>  		return -ENOMEM;
>  
> -	new = mpol_new(mode, flags, nodes);
> -	if (IS_ERR(new)) {
> -		ret = PTR_ERR(new);
> -		goto out;
> -	}
> -
>  	task_lock(current);
>  	ret = mpol_set_nodemask(new, nodes, scratch);
>  	if (ret) {
>  		task_unlock(current);
> -		mpol_put(new);
>  		goto out;
>  	}
>  
> @@ -884,14 +877,35 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>  	current->mempolicy = new;
>  	if (new && new->mode == MPOL_INTERLEAVE)
>  		current->il_prev = MAX_NUMNODES-1;
> -	task_unlock(current);
> -	mpol_put(old);
> -	ret = 0;
>  out:
> +	task_unlock(current);
> +	if (old)
> +		mpol_put(old);
It's protected against NULL parameter internally, so
	mpol_put(old);

which has advantage that a block of diff will hopefully disappear making
this patch easier to read.

> +
>  	NODEMASK_SCRATCH_FREE(scratch);
>  	return ret;
>  }
>  
> +/* Set the process memory policy */
> +static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> +			     nodemask_t *nodes)
> +{
> +	struct mempolicy *new;
> +	int ret;
> +
> +	new = mpol_new(mode, flags, nodes);
> +	if (IS_ERR(new)) {
> +		ret = PTR_ERR(new);
> +		goto out;

Given nothing to do at out lable, in keeping with at least some local
style, you could do direct returns on errors.

	if (IS_ERR(new))
		return PTR_ERR(new)

	ret = swap_mempolicy(new, nodes);
	if (ret) {
		mpol_put(new);
		return ret;
	}

	return 0;

> +	}
> +
> +	ret = swap_mempolicy(new, nodes);
> +	if (ret)
> +		mpol_put(new);
> +out:
> +	return ret;
> +}
> +
>  /*
>   * Return nodemask for policy for get_mempolicy() query
>   *


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls
  2023-09-14 23:54 ` [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls Gregory Price
@ 2023-10-02 13:30   ` Jonathan Cameron
  2023-10-02 15:30     ` Gregory Price
  2023-10-02 18:03     ` Gregory Price
  0 siblings, 2 replies; 10+ messages in thread
From: Jonathan Cameron @ 2023-10-02 13:30 UTC (permalink / raw
  To: Gregory Price
  Cc: linux-mm, linux-kernel, linux-arch, linux-api, linux-cxl, luto,
	tglx, mingo, bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

On Thu, 14 Sep 2023 19:54:56 -0400
Gregory Price <gourry.memverge@gmail.com> wrote:

> sys_set_mempolicy is limited by its current argument structure
> (mode, nodes, flags) to implementing policies that can be described
> in that manner.
> 
> Implement set/get_mempolicy2 with a new mempolicy_args structure
> which encapsulates the old behavior, and allows for new mempolicies
> which may require additional information.
> 
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
Some random comments inline.

Jonathan


> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   2 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/unistd.h      |  10 +-
>  include/uapi/linux/mempolicy.h         |  32 ++++
>  mm/mempolicy.c                         | 215 ++++++++++++++++++++++++-
>  6 files changed, 261 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 2d0b1bd866ea..a72ef588a704 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -457,3 +457,5 @@
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
>  451	i386	cachestat		sys_cachestat
>  452	i386	fchmodat2		sys_fchmodat2
> +454	i386	set_mempolicy2		sys_set_mempolicy2
> +455	i386	get_mempolicy2		sys_get_mempolicy2
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 1d6eee30eceb..ec54064de8b3 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -375,6 +375,8 @@
>  451	common	cachestat		sys_cachestat
>  452	common	fchmodat2		sys_fchmodat2
>  453	64	map_shadow_stack	sys_map_shadow_stack
> +454	common	set_mempolicy2		sys_set_mempolicy2
> +455	common	get_mempolicy2		sys_get_mempolicy2
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 22bc6bc147f8..d50a452954ae 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -813,6 +813,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
>  				unsigned long addr, unsigned long flags);
>  asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
>  				unsigned long maxnode);
> +asmlinkage long sys_get_mempolicy2(struct mempolicy_args __user *args);
> +asmlinkage long sys_set_mempolicy2(struct mempolicy_args __user *args);
>  asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
>  				const unsigned long __user *from,
>  				const unsigned long __user *to);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index abe087c53b4b..397dcf804941 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -823,8 +823,16 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
>  #define __NR_fchmodat2 452
>  __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
>  
> +/* CONFIG_MMU only */
> +#ifndef __ARCH_NOMMU
> +#define __NR_set_mempolicy 454
> +__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
> +#define __NR_set_mempolicy 455
> +__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
> +#endif
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 453
> +#define __NR_syscalls 456
+3 for 2 additions?

>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 046d0ccba4cd..53650f69db2b 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,9 +23,41 @@ enum {
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
>  	MPOL_PREFERRED_MANY,
> +	MPOL_LEGACY,	/* set_mempolicy limited to above modes */
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> +struct mempolicy_args {
> +	int err;
> +	unsigned short mode;
> +	unsigned long *nodemask;
> +	unsigned long maxnode;
> +	unsigned short flags;
> +	struct {
> +		/* Memory allowed */
> +		struct {
> +			int err;
> +			unsigned long maxnode;
> +			unsigned long *nodemask;
> +		} allowed;
> +		/* Address information */
> +		struct {
> +			int err;
> +			unsigned long addr;
> +			unsigned long node;
> +			unsigned short mode;
> +			unsigned short flags;
> +		} addr;
> +		/* Interleave */
> +	} get;
> +	/* Mode specific settings */
> +	union {
> +		struct {
> +			unsigned long next_node; /* get only */
> +		} interleave;
> +	};
> +};
> +
>  /* Flags for set_mempolicy */
>  #define MPOL_F_STATIC_NODES	(1 << 15)
>  #define MPOL_F_RELATIVE_NODES	(1 << 14)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f49337f6f300..1cf7709400f1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1483,7 +1483,7 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
>  	*flags = *mode & MPOL_MODE_FLAGS;
>  	*mode &= ~MPOL_MODE_FLAGS;
>  
> -	if ((unsigned int)(*mode) >=  MPOL_MAX)
> +	if ((unsigned int)(*mode) >= MPOL_LEGACY)
>  		return -EINVAL;
>  	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
>  		return -EINVAL;
> @@ -1614,6 +1614,219 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
>  	return kernel_set_mempolicy(mode, nmask, maxnode);
>  }
>  
> +static long do_set_mempolicy2(struct mempolicy_args *args)
> +{
> +	struct mempolicy *new = NULL;
> +	nodemask_t nodes;
> +	int err;
> +
> +	if (args->mode <= MPOL_LEGACY)
> +		return -EINVAL;
> +
> +	if (args->mode >= MPOL_MAX)
> +		return -EINVAL;
> +
> +	err = get_nodes(&nodes, args->nodemask, args->maxnode);
> +	if (err)
> +		return err;
> +
> +	new = mpol_new(args->mode, args->flags, &nodes);
> +	if (IS_ERR(new)) {
> +		err = PTR_ERR(new);
> +		goto out;

I'd expect mpol_new() to be side effect free on error,
so
		return PTR_ERR(new);
should be fine?

> +	}
> +
> +	switch (args->mode) {
> +	default:
> +		BUG();
> +	}
> +
> +	if (err)
> +		goto out;
> +
> +	err = swap_mempolicy(new, &nodes);
> +out:
> +	if (err && new)

as IS_ERR(new) is true, I think this puts the node even if mpol_new
returned an error.  That seems unwise.

I'd push this block below a return 0 anyway, so as to avoid
error handling in the good path.

> +		mpol_put(new);
> +	return err;
> +};
> +
> +static bool mempolicy2_args_valid(struct mempolicy_args *kargs)
> +{
> +	/* Legacy modes are routed through the legacy interface */
> +	if (kargs->mode <= MPOL_LEGACY)
> +		return false;
> +
> +	if (kargs->mode >= MPOL_MAX)
> +		return false;
> +
> +	return true;

This is a range check, so I think equally clear (and shorter) as..
	/* Legacy modes are routed through the legacy interface */
	return kargs->mode > MPOL_LEGACY && kargs->mode < MPOL_MAX;

> +}
> +
> +static long kernel_set_mempolicy2(const struct mempolicy_args __user *uargs,
> +				  size_t usize)
> +{
> +	struct mempolicy_args kargs;
> +	int err;
> +
> +	if (usize != sizeof(kargs))

As below, maybe allow for bigger with assumption we'll ignore what is in the
extra space.

> +		return -EINVAL;
> +
> +	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
> +	if (err)
> +		return err;
> +
> +	/* If the mode is legacy, use the legacy path */
> +	if (kargs.mode < MPOL_LEGACY) {
> +		int legacy_mode = kargs.mode | kargs.flags;
> +		const unsigned long __user *lnmask = kargs.nodemask;
> +		unsigned long maxnode = kargs.maxnode;
> +
> +		return kernel_set_mempolicy(legacy_mode, lnmask, maxnode);
> +	}
> +
> +	if (!mempolicy2_args_valid(&kargs))
> +		return -EINVAL;
> +
> +	return do_set_mempolicy2(&kargs);
> +}
> +
> +SYSCALL_DEFINE2(set_mempolicy2, const struct mempolicy_args __user *, args,
> +		size_t, size)
> +{
> +	return kernel_set_mempolicy2(args, size);
> +}
> +
> +/* Gets extended mempolicy information */
> +static long do_get_mempolicy2(struct mempolicy_args *kargs)
> +{
> +	struct mempolicy *pol = current->mempolicy;
> +	nodemask_t knodes;
> +	int err = 0;
> +
> +	kargs->err = 0;
> +	kargs->mode = pol->mode;
> +	/* Mask off internal flags */
> +	kargs->flags = (pol->flags & MPOL_MODE_FLAGS);

Excessive brackets.

> +
> +	if (kargs->nodemask) {
> +		if (mpol_store_user_nodemask(pol)) {
> +			knodes = pol->w.user_nodemask;
> +		} else {
> +			task_lock(current);
> +			get_policy_nodemask(pol, &knodes);
> +			task_unlock(current);
> +		}
> +		err = copy_nodes_to_user(kargs->nodemask,
> +					 kargs->maxnode,
> +					 &knodes);
Can wrap this less.

> +		if (err)

return err ?

> +			return -EINVAL;
> +	}
> +
> +
> +	if (kargs->get.allowed.nodemask) {
> +		kargs->get.allowed.err = 0;
> +		task_lock(current);
> +		knodes = cpuset_current_mems_allowed;
> +		task_unlock(current);
> +		err = copy_nodes_to_user(kargs->get.allowed.nodemask,
> +					 kargs->get.allowed.maxnode,
> +					 &knodes);
> +		kargs->get.allowed.err = err ? err : 0;
> +		kargs->err |= err ? err : 1;
		if (err) {
			kargs->get.allowed.err = err;
			kargs->err |= err;
		} else {
			kargs->get.allowed.err = 0;
			kargs->err = 1;
	Not particularly obvious why 1 and if you get an error later it's going to be messy
        as will 1 |= err_code
		}
> +	}
> +
> +	if (kargs->get.addr.addr) {
> +		struct mempolicy *addr_pol = NULL;

Why init here - I think it's always set before use.

> +		struct vm_area_struct *vma = NULL;

Why init here?

> +		struct mm_struct *mm = current->mm;
> +		unsigned long addr = kargs->get.addr.addr;
> +
> +		kargs->get.addr.err = 0;

I'd set this only in the good path. You overwrite it
in the bad paths anyway, so just move it down below the error
checks.

> +
> +		/*
> +		 * Do NOT fall back to task policy if the
> +		 * vma/shared policy at addr is NULL.  We
> +		 * want to return MPOL_DEFAULT in this case.
> +		 */
> +		mmap_read_lock(mm);
> +		vma = vma_lookup(mm, addr);
> +		if (!vma) {
> +			mmap_read_unlock(mm);
> +			kargs->get.addr.err = -EFAULT;
> +			kargs->err |= err ? err : 2;
> +			goto mode_info;
> +		}
> +		if (vma->vm_ops && vma->vm_ops->get_policy)
> +			addr_pol = vma->vm_ops->get_policy(vma, addr);
> +		else
> +			addr_pol = vma->vm_policy;
> +
> +		kargs->get.addr.mode = addr_pol->mode;
> +		/* Mask off internal flags */
> +		kargs->get.addr.flags = (pol->flags & MPOL_MODE_FLAGS);
> +
> +		/*
> +		 * Take a refcount on the mpol, because we are about to
> +		 * drop the mmap_lock, after which only "pol" remains
> +		 * valid, "vma" is stale.
> +		 */
> +		vma = NULL;
> +		mpol_get(addr_pol);
> +		mmap_read_unlock(mm);
> +		err = lookup_node(mm, addr);
> +		mpol_put(addr_pol);
> +		if (err < 0) {
> +			kargs->get.addr.err = err;
> +			kargs->err |= err ? err : 4;
> +			goto mode_info;
> +		}
> +		kargs->get.addr.node = err;

Confusing to call something that isn't an error, err. I'd use a different
local variable for this and set err = rc in error path only.

Could set the get.addr.err = 0; down here as this is only way it remains 0
if you set it earlier.


> +	}
> +
> +mode_info:
> +	switch (kargs->mode) {
> +	case MPOL_INTERLEAVE:
> +		kargs->interleave.next_node = next_node_in(current->il_prev,
> +							   pol->nodes);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return err;
> +}
> +
> +static long kernel_get_mempolicy2(struct mempolicy_args __user *uargs,
> +				  size_t usize)
> +{
> +	struct mempolicy_args kargs;
> +	int err;
> +
> +	if (usize != sizeof(struct mempolicy_args))

sizeof(kargs) for same reason as below.  I'm not sure on convention here
but is it wise to leave option for a newer userspace to send a larger
struct, knowing that fields in it might be ignored by an older kernel?


> +		return -EINVAL;
> +
> +	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
> +	if (err)
> +		return err;
> +
> +	/* Get the extended memory policy information (kargs.ext) */
> +	err = do_get_mempolicy2(&kargs);
> +	if (err)
> +		return err;
> +
> +	err = copy_to_user(uargs, &kargs, sizeof(struct mempolicy_args));
> +
> +	return err;

return copy_to_user(uargs, &kargs, sizeof(kargs));
You are inconsistent on the sizeof.  Better to pick one style, and
given both are used, I'd go with using the sizeof(thing) rather
than sizeof(type) option + shorter lines ;)

> +}
> +
> +SYSCALL_DEFINE2(get_mempolicy2, struct mempolicy_args __user *, policy,
> +		size_t, size)
> +{
> +	return kernel_get_mempolicy2(policy, size);
> +}
> +
>  static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>  				const unsigned long __user *old_nodes,
>  				const unsigned long __user *new_nodes)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy
  2023-09-14 23:54 ` [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy Gregory Price
@ 2023-10-02 13:40   ` Jonathan Cameron
  2023-10-02 16:10     ` Gregory Price
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Cameron @ 2023-10-02 13:40 UTC (permalink / raw
  To: Gregory Price
  Cc: linux-mm, linux-kernel, linux-arch, linux-api, linux-cxl, luto,
	tglx, mingo, bp, dave.hansen, hpa, arnd, akpm, x86, Gregory Price

On Thu, 14 Sep 2023 19:54:57 -0400
Gregory Price <gourry.memverge@gmail.com> wrote:

> The partial-interleave mempolicy implements interleave on an

I'm not sure 'partial' really conveys what is going on here.
Weighted, or uneven-interleave maybe?

> allocation interval. The default node is the local node, for
> which N pages will be allocated before an interleave pass occurs.
> 
> For example:
>   nodes=0,1,2
>   interval=3
>   cpunode=0
> 
> Over 10 consecutive allocations, the following nodes will be selected:
> [0,0,0,1,2,0,0,0,1,2]
> 
> In this example, there is a 60%/20%/20% distribution of memory.
> 
> Using this mechanism, it becomes possible to define an approximate
> distribution percentage of memory across a set of nodes:
> 
> local_node% : interval/((nr_nodes-1)+interval-1)
> other_node% : (1-local_node%)/(nr_nodes-1)

I'd like to see more discussion here of why you would do this...


A few trivial bits inline,

Jonathan

...

> +static unsigned long alloc_pages_bulk_array_partial_interleave(gfp_t gfp,
> +		struct mempolicy *pol, unsigned long nr_pages,
> +		struct page **page_array)
> +{
> +	nodemask_t nodemask = pol->nodes;
> +	unsigned long nr_pages_main;
> +	unsigned long nr_pages_other;
> +	unsigned long total_cycle;
> +	unsigned long delta;
> +	unsigned long interval;
> +	int allocated = 0;
> +	int start_nid;
> +	int nnodes;
> +	int prev, next;
> +	int i;
> +
> +	/* This stabilizes nodes on the stack incase pol->nodes changes */
> +	barrier();
> +
> +	nnodes = nodes_weight(nodemask);
> +	start_nid = numa_node_id();
> +
> +	if (!node_isset(start_nid, nodemask))
> +		start_nid = first_node(nodemask);
> +
> +	if (nnodes == 1) {
> +		allocated = __alloc_pages_bulk(gfp, start_nid,
> +					       NULL, nr_pages_main,
> +					       NULL, page_array);
> +		return allocated;
		return __alloc_pages_bulk(...)

> +	}
> +	/* We don't want to double-count the main node in calculations */
> +	nnodes--;
> +
> +	interval = pol->part_int.interval;
> +	total_cycle = (interval + nnodes);

excess brackets. Same in various other places.


> +	/* Number of pages on main node: (cycles*interval + up to interval) */
> +	nr_pages_main = ((nr_pages / total_cycle) * interval);
> +	nr_pages_main += (nr_pages % total_cycle % (interval + 1));


> +	/* Number of pages on others: (remaining/nodes) + 1 page if delta  */
> +	nr_pages_other = (nr_pages - nr_pages_main) / nnodes;
> +	nr_pages_other /= nnodes;
> +	/* Delta is number of pages beyond interval up to full cycle */
> +	delta = nr_pages - (nr_pages_main + (nr_pages_other * nnodes));
> +
> +	/* start by allocating for the main node, then interleave rest */
> +	prev = start_nid;
> +	allocated = __alloc_pages_bulk(gfp, start_nid, NULL, nr_pages_main,
> +				       NULL, page_array);
> +	for (i = 0; i < nnodes; i++) {
> +		int pages = nr_pages_other + (delta-- ? 1 : 0);
> +
> +		next = next_node_in(prev, nodemask);
> +		if (next < MAX_NUMNODES)
> +			prev = next;
> +		allocated += __alloc_pages_bulk(gfp, next, NULL, pages,
> +						NULL, page_array);
> +	}
> +
> +	return allocated;
> +}
> +



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls
  2023-10-02 13:30   ` Jonathan Cameron
@ 2023-10-02 15:30     ` Gregory Price
  2023-10-02 18:03     ` Gregory Price
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Price @ 2023-10-02 15:30 UTC (permalink / raw
  To: Jonathan Cameron
  Cc: Gregory Price, linux-mm, linux-kernel, linux-arch, linux-api,
	linux-cxl, luto, tglx, mingo, bp, dave.hansen, hpa, arnd, akpm,
	x86

On Mon, Oct 02, 2023 at 02:30:08PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Sep 2023 19:54:56 -0400
> Gregory Price <gourry.memverge@gmail.com> wrote:
> 
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index abe087c53b4b..397dcf804941 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > ...
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 453
> > +#define __NR_syscalls 456
> +3 for 2 additions?
> 

When i'd originally written this, there was a partially merged syscall
colliding with 453, and this hadn't been incremented yet.  Did a quick
grep and it seems like that might have been reverted, so yeah this would
drop down to 453/454 & __NR=455.

> > +	/* Legacy modes are routed through the legacy interface */
> > +	if (kargs->mode <= MPOL_LEGACY)
> > +		return false;
> > +
> > +	if (kargs->mode >= MPOL_MAX)
> > +		return false;
> > +
> > +	return true;
> 
> This is a range check, so I think equally clear (and shorter) as..
> 	/* Legacy modes are routed through the legacy interface */
> 	return kargs->mode > MPOL_LEGACY && kargs->mode < MPOL_MAX;
>

I'll combine the range, but i left the two true/false conditions
separate because it's intended that follow on patches will add logic
before true is returned.

> > +		kargs->get.allowed.err = err ? err : 0;
> > +		kargs->err |= err ? err : 1;
> 		if (err) {
> 			kargs->get.allowed.err = err;
> 			kargs->err |= err;
> 		} else {
> 			kargs->get.allowed.err = 0;
> 			kargs->err = 1;
> 	Not particularly obvious why 1 and if you get an error later it's going to be messy
>         as will 1 |= err_code

My original intent was to just allow each section to error separately,
but honestly this seems overly complicated and somewhat against the
design of almost every other syscall, so i'm going to rip all these
error code spaces out and instead just have everything return on error.

Thanks!
Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy
  2023-10-02 13:40   ` Jonathan Cameron
@ 2023-10-02 16:10     ` Gregory Price
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2023-10-02 16:10 UTC (permalink / raw
  To: Jonathan Cameron
  Cc: Gregory Price, linux-mm, linux-kernel, linux-arch, linux-api,
	linux-cxl, luto, tglx, mingo, bp, dave.hansen, hpa, arnd, akpm,
	x86

On Mon, Oct 02, 2023 at 02:40:35PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Sep 2023 19:54:57 -0400
> Gregory Price <gourry.memverge@gmail.com> wrote:
> 
> > The partial-interleave mempolicy implements interleave on an
> 
> I'm not sure 'partial' really conveys what is going on here.
> Weighted, or uneven-interleave maybe?
>
> > local_node% : interval/((nr_nodes-1)+interval-1)
> > other_node% : (1-local_node%)/(nr_nodes-1)
> 
> I'd like to see more discussion here of why you would do this...
> 

TL;DR: "Partial" in the sense that it's a simplified version of
weighted interleave.  I honestly struggled with the name, but i'm not
tied to it if there's something better.

I also considered "Preferred Interleave", where the local node is
preferred for some weight, and the remaining nodes are interleaved.
Maybe that's a more intuitive name.

For now i'll start calling it "preferred interleave" instead.



More generally:

This was a first pass at weighted interleave without adding the full
weights[MAX_NUMNODES] field to the mempolicy structure.

I've since added full weighted interleave and that'll be in v2 of the
RFC (hopefully pushing up today after addressing your notes).



I'll these notes for discussion in the RFC v2
---
I can see advantages of both full-weighted and preferred-interleave.

Something to consider: task migration and cpuset/memcg.

With "full-weighted" interleave, consider the scenario where the user
initially runs on Node 0/Socket 0, and sets the following weights

[0:10,1:3,2:5,3:1]
Where the nodes are as follows:
0 - socket 0 DRAM
1 - socket 1 DRAM
2 - socket 0 CXL
3 - socket 1 DRAM

If that task gets migrated to socket 1... that's not going to be a good
weighting plan.  This is the same reason a single set of weighted tiers
that abstract nodes is not a good idea - because Nodes 1 and 3 in this
scenario have "similar attributes" but only relative to their local
sockets (0-2 and 1-3).

Worse - if Nodes 2 and 3 *don't* have similar attributes, if we
implement an "auto-rebalance" mechanism, a lot of assumptions would have
to be made, and any time a migration is detected between nodes you would
have to do this auto-rebalance.

Even worse - I attempted to expose the weights per-task via procfs, and
realized the entire mempolicy subsystem is very unfriendly to outside
tasks twiddling bits (i.e. mempolicy is very 'current'-centric).

There are *tons* of race conditions that have to be handled, and it's
really rather nasty in my opinion.

consider this code:

2446 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
2447 {
... snip ...
2458         nodemask = pol->nodes;
2459
2460         /*
2461          * The barrier will stabilize the nodemask in a register or on
2462          * the stack so that it will stop changing under the code.
2463          *
2464          * Between first_node() and next_node(), pol->nodes could be changed
2465          * by other threads. So we put pol->nodes in a local stack.
2466          */
2467         barrier();

big oof, you wouldn't be able to depend on this for weights, so you need
an algorithm that can allow some slop as weights are being replaced.

So unless we rewrite mempolicy.c to be more robust in this sense, I
would argue a fully-weighted scenario is most useful if you are very
confident that your task is not going to be migrated.  Otherwise there
will be very high costs associated with recalculating weights.



With preferred-interleave, if a task migrates, the rebalance happens
automatically based on the nodemask:  The new local node becomes the
heavily weighted node, and the rest interleave evenly.

(If local node is for some reason not in the nodemask, use first-node,
but this could possibly be changed to use a manually defined node))

So basically if you expect your task to be migrate-able, something like
"preferred interleave" gets you a more aligned post-migration behavior to
what you originally wanted.  Similarly, if your interleave ratios are
simple, then this strategy is the simplest way to get to the desired
outcome.


Is it the *best* strategy? TBD. The behavior is more predictable, though.

I will have a weighted interleave patch added to my next RFC.  I need to
test it first.


Thanks
Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls
  2023-10-02 13:30   ` Jonathan Cameron
  2023-10-02 15:30     ` Gregory Price
@ 2023-10-02 18:03     ` Gregory Price
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Price @ 2023-10-02 18:03 UTC (permalink / raw
  To: Jonathan Cameron
  Cc: Gregory Price, owner-linux-mm, linux-kernel, linux-arch,
	linux-api, linux-cxl, luto, tglx, mingo, bp, dave.hansen, hpa,
	arnd, akpm, x86

On Mon, Oct 02, 2023 at 02:30:08PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Sep 2023 19:54:56 -0400
> Gregory Price <gourry.memverge@gmail.com> wrote:
> 
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 1d6eee30eceb..ec54064de8b3 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -375,6 +375,8 @@
> >  451	common	cachestat		sys_cachestat
> >  452	common	fchmodat2		sys_fchmodat2
> >  453	64	map_shadow_stack	sys_map_shadow_stack
> > +454	common	set_mempolicy2		sys_set_mempolicy2
> > +455	common	get_mempolicy2		sys_get_mempolicy2
> >  

^^ this is the discrepency.  map_shadow_stack is at 453, so NR_syscalls
should already be 454, but map_shadow_stack has not be plumbed through
the rest of the kernel.

This needs to be addressed, but not in this RFC.

> >  #undef __NR_syscalls
> > -#define __NR_syscalls 453
> > +#define __NR_syscalls 456
> +3 for 2 additions?
> 

see above

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-10-02 18:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-14 23:54 [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2 Gregory Price
2023-09-14 23:54 ` [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
2023-10-02 11:03   ` Jonathan Cameron
2023-09-14 23:54 ` [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls Gregory Price
2023-10-02 13:30   ` Jonathan Cameron
2023-10-02 15:30     ` Gregory Price
2023-10-02 18:03     ` Gregory Price
2023-09-14 23:54 ` [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy Gregory Price
2023-10-02 13:40   ` Jonathan Cameron
2023-10-02 16:10     ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).