From: Vinicius Petrucci <vpetrucci@gmail.com>
To: akpm@linux-foundation.org
Cc: linux-mm@vger.kernel.org, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-api@vger.kernel.org, minchan@kernel.org,
dave.hansen@linux.intel.com, x86@kernel.org,
Jonathan.Cameron@huawei.com, aneesh.kumar@linux.ibm.com,
gregory.price@memverge.com, ying.huang@intel.com,
dan.j.williams@intel.com, hezhongkun.hzk@bytedance.com,
fvdl@google.com, surenb@google.com, rientjes@google.com,
hannes@cmpxchg.org, mhocko@suse.com, Hasan.Maruf@amd.com,
jgroves@micron.com, ravis.opensrc@micron.com,
sthanneeru@micron.com, emirakhur@micron.com,
vtavarespetr@micron.com
Subject: [RFC PATCH] mm/mbind: Introduce process_mbind() syscall for external memory binding
Date: Wed, 22 Nov 2023 15:31:05 -0600 [thread overview]
Message-ID: <ZV5zGROLefrsEcHJ@r13-u19.micron.com> (raw)
From: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
This patch introduces `process_mbind()` to enable a userspace orchestrator with
an understanding of another process's memory layout to alter its memory policy.
As system memory configurations become more and more complex (e.g., DDR+HBM+CXL memories),
such a userspace orchestrator can explore more advanced techniques to guide memory placement
to individual NUMA nodes across memory tiers. This allows for a more efficient allocation of
memory resources, leading to enhanced application performance.
Alternatively, there are existing methods such as LD_PRELOAD (https://pmem.io/memkind/) or
syscall_intercept (https://github.com/pmem/syscall_intercept), but these techniques, beyond the
lack of portability/universality, can lead to system incompatibility issues, inconsistency in
application behavior, potential risks due to global system-wide settings, and increased
complexity in implementation.
The concept of an external entity that understands the layout of another process's VM
is already present with `process_madvise()`. Thus, it seems reasonable to introduce
the `process_mbind` variant of `mbind`. The implementation framework of `process_mbind()`
is akin to `process_madvise()`. It uses pidfd of an external process to direct the memory
policy and supports a vector of memory address ranges.
The general use case here is similar to the prior RFC `pidfd_set_mempolicy()`
(https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/),
but offers a more fine-grained external control by binding specific memory regions
(say, heap data structures) to specific NUMA nodes. Another concrete use case was described
by a prior work showing up to 2X runtime improvement (compared to AutoNUMA tiering) using
memory object/region-based memory placement for workloads with irregular access patterns
such as graph analytics: https://iiswc.org/iiswc2022/IISWC2022_42.pdf
The proposed API is as follows:
long process_mbind(int pidfd,
const struct iovec *iovec,
unsigned long vlen,
unsigned long mode,
const unsigned long *nmask,
unsigned int flags);
The `pidfd` argument is used to select the process that is identified by the PID file
descriptor provided in pidfd. (See pidofd_open(2) for more information)
The pointer `iovec` points to an array of iovec structures (as described in <sys/uio.h>):
struct iovec {
void *iov_base; /* starting address of region */
size_t iov_len; /* size of region (in bytes) */
};
The `iovec` defines memory regions that start at the address (iov_base) and
have a size measured in bytes (iov_len).
The `vlen` indicates the quantity of elements contained in iovec.
Please note the initial `maxnode` parameter from `mbind` was omitted
to ensure the API doesn't exceed 6 arguments. Instead, the constant
MAX_NUMNODES was utilized.
Please see the mbind(2) man page for more details about other's arguments.
Additionally, it is worth noting the following:
- Using a vector of address ranges as an argument in `process_mbind` provides more
flexibility than the original `mbind` system call, even when invoked from a current
or local process.
- In contrast to `move_pages`, which requires an array of fixed-size pages,
`process_mbind` (with flags MPOL_MF_MOVE*) offers a more convinient and flexible page
migration capability on a per object or region basis.
- Similar to `process_madvise`, manipulating the memory binding of external processes
necessitates `CAP_SYS_NICE` and `PTRACE_MODE_READ_FSCREDS` checks (refer to ptrace(2)).
Suggested-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 4 ++
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
mm/mempolicy.c | 86 +++++++++++++++++++++++++-
5 files changed, 92 insertions(+), 4 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..9d9db49a3242 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
454 common futex_wake sys_futex_wake
455 common futex_wait sys_futex_wait
456 common futex_requeue sys_futex_requeue
+457 common process_mbind sys_process_mbind
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..def5250ed625 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -816,6 +816,10 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
const unsigned long __user *nmask,
unsigned long maxnode,
unsigned flags);
+asmlinkage long sys_process_mbind(int pidfd, const struct iovec __user *vec,
+ size_t vlen, unsigned long mode,
+ const unsigned long __user *nmask,
+ unsigned flags);
asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
unsigned long maxnode,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 756b013fb832..9ed2c91940d6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
__SYSCALL(__NR_futex_wait, sys_futex_wait)
#define __NR_futex_requeue 456
__SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_process_mbind 457
+__SYSCALL(__NR_process_mbind, sys_process_mbind)
#undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e1a6e3c675c0..cc5cb5ae3ae7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
COND_SYSCALL(process_mrelease);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
+COND_SYSCALL(process_mbind);
COND_SYSCALL(get_mempolicy);
COND_SYSCALL(set_mempolicy);
COND_SYSCALL(migrate_pages);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..91ee300fa728 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
}
#endif
-static long do_mbind(unsigned long start, unsigned long len,
+static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
unsigned short mode, unsigned short mode_flags,
nodemask_t *nmask, unsigned long flags)
{
- struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
struct vma_iterator vmi;
struct migration_mpol mmpol;
@@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
return 0;
}
+static long kernel_mbind_process(int pidfd, const struct iovec __user *vec,
+ size_t vlen, unsigned long mode,
+ const unsigned long __user *nmask, unsigned int flags)
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+ struct task_struct *task;
+ struct mm_struct *mm;
+ unsigned int f_flags;
+ unsigned short mode_flags;
+ int lmode = mode;
+ unsigned long maxnode = MAX_NUMNODES;
+ int err;
+ nodemask_t nodes;
+
+ err = sanitize_mpol_flags(&lmode, &mode_flags);
+ if (err)
+ goto out;
+
+ err = get_nodes(&nodes, nmask, maxnode);
+ if (err)
+ goto out;
+
+ ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack),
+ &iov, &iter);
+ if (ret < 0)
+ goto out;
+
+ task = pidfd_get_task(pidfd, &f_flags);
+ if (IS_ERR(task)) {
+ ret = PTR_ERR(task);
+ goto free_iov;
+ }
+
+ /* From process_madvise: Require PTRACE_MODE_READ
+ * to avoid leaking ASLR metadata. */
+ mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+ if (IS_ERR_OR_NULL(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ goto release_task;
+ }
+
+ /* From process_madvise: Require CAP_SYS_NICE for
+ * influencing process performance. */
+ if (!capable(CAP_SYS_NICE)) {
+ ret = -EPERM;
+ goto release_mm;
+ }
+
+ while (iov_iter_count(&iter)) {
+ unsigned long start = untagged_addr(
+ (unsigned long)iter_iov_addr(&iter));
+ unsigned long len = iter_iov_len(&iter);
+
+ ret = do_mbind(mm, start, len, lmode, mode_flags,
+ &nodes, flags);
+ if (ret < 0)
+ break;
+ iov_iter_advance(&iter, iter_iov_len(&iter));
+ }
+
+release_mm:
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+free_iov:
+ kfree(iov);
+out:
+ return ret;
+}
+
static long kernel_mbind(unsigned long start, unsigned long len,
unsigned long mode, const unsigned long __user *nmask,
unsigned long maxnode, unsigned int flags)
{
+ struct mm_struct *mm = current->mm;
unsigned short mode_flags;
nodemask_t nodes;
int lmode = mode;
@@ -1483,7 +1556,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
if (err)
return err;
- return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
+ return do_mbind(mm, start, len, lmode, mode_flags, &nodes, flags);
}
SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
@@ -1553,6 +1626,13 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
return err;
}
+SYSCALL_DEFINE6(process_mbind, int, pidfd, const struct iovec __user *, vec,
+ size_t, vlen, unsigned long, mode,
+ const unsigned long __user *, nmask, unsigned int, flags)
+{
+ return kernel_mbind_process(pidfd, vec, vlen, mode, nmask, flags);
+}
+
SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
--
2.41.0
next reply other threads:[~2023-11-22 21:31 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-22 21:31 Vinicius Petrucci [this message]
2023-11-22 21:39 ` [RFC PATCH] mm/mbind: Introduce process_mbind() syscall for external memory binding Andrew Morton
2023-11-22 21:45 ` Gregory Price
2023-11-22 23:57 ` Vinicius Petrucci
2023-11-22 23:53 ` Gregory Price
2023-11-23 15:21 ` Vinicius Petrucci
2023-11-24 8:13 ` Zhongkun He
2023-11-23 21:42 ` Gregory Price
2023-11-30 9:34 ` Zhongkun He
2023-11-30 16:07 ` Gregory Price
2023-12-01 13:53 ` Zhongkun He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZV5zGROLefrsEcHJ@r13-u19.micron.com \
--to=vpetrucci@gmail.com \
--cc=Hasan.Maruf@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=emirakhur@micron.com \
--cc=fvdl@google.com \
--cc=gregory.price@memverge.com \
--cc=hannes@cmpxchg.org \
--cc=hezhongkun.hzk@bytedance.com \
--cc=jgroves@micron.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=ravis.opensrc@micron.com \
--cc=rientjes@google.com \
--cc=sthanneeru@micron.com \
--cc=surenb@google.com \
--cc=vtavarespetr@micron.com \
--cc=x86@kernel.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).