* [patch 00/12] rseq: Implement time slice extension mechanism
@ 2025-09-08 22:59 Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
` (14 more replies)
0 siblings, 15 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
To: LKML
Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
This is the proper implementation of the PoC code, which I posted in reply
to the latest iteration of Prakash's time slice extension patches:
https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
Time slice extensions are an attempt to provide opportunistic priority
ceiling without the overhead of an actual priority ceiling protocol, but
also without the guarantees such a protocol provides.
The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.
This has been attempted to solve at least for a decade, but so far this
went nowhere. The recent attempts, which started to integrate with the
already existing RSEQ mechanism, have been at least going into the right
direction. The full history is partially in the above mentioned mail thread
and it's ancestors, but also in various threads in the LKML archives, which
require archaeological efforts to retrieve.
When trying to morph the PoC into actual mergeable code, I stumbled over
various shortcomings in the RSEQ code, which have been addressed in a
separate effort. The latest iteration can be found here:
https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
That is a prerequisite for this series as it allows a tight integration
into the RSEQ code without inflicting a lot of extra overhead into the hot
paths.
The main change vs. the PoC and the previous attempts is that it utilizes a
new field in the user space ABI rseq struct, which allows to reduce the
atomic operations in user space to a bare minimum. If the architecture
supports CPU local atomics, which protect against the obvious RMW race
vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
x86, required.
The kernel user space ABI consists only of two bits in this new field:
REQUEST and GRANTED
User space sets REQUEST at the begin of the critical section. If it
finishes the critical section without interruption then it can clear the
bit and move on.
If it is interrupted and the interrupt return path in the kernel observes a
rescheduling request, then the kernel can grant a time slice extension. The
kernel clears the REQUEST bit and sets the GRANTED bit with a simple
non-atomic store operation. If it does not grant the extension only the
REQUEST bit is cleared.
If user space observes the REQUEST bit cleared, when it finished the
critical section, then it has to check the GRANTED bit. If that is set,
then it has to invoke the rseq_slice_yield() syscall to terminate the
extension and yield the CPU.
The code flow in user space is:
// Simple store as there is no concurrency vs. the GRANTED bit
rseq->slice_ctrl = REQUEST;
critical_section();
// CPU local atomic required here:
if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
// Non-atomic check is sufficient as this can race
// against an interrupt, which revokes the grant
//
// If not set, then the request was either cleared by the kernel
// without grant or the grant was revoked.
//
// If set, tell the kernel that the critical section is done
// so it can reschedule
if (rseq->slice_ctrl & GRANTED)
rseq_slice_yield();
}
The other details, which differ from earlier attempts and the PoC, are:
- A separate syscall for terminating the extension to avoid side
effects and overloading of the already ill defined sched_yield(2)
- A separate per CPU timer, which again does not inflict side effects
on the scheduler internal hrtick timer. The hrtick timer can be
disabled at run-time and an expiry can cause interesting problems in
the scheduler code when it is unexpectedly invoked.
- Tight integration into the rseq exit to user mode code. It utilizes
the path when TIF_RESQ is not set at the end of exit_to_user_mode()
to arm the timer if an extension was granted. TIF_RSEQ indicates that
the task was scheduled and therefore would revoke the grant anyway.
- A futile attempt to make this "work" on the PREEMPT_LAZY preemption
model which is utilized by PREEMPT_RT.
It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
but not TIF_PREEMPT.
Pretending that this can be made work for TIF_PREEMPT on a fully
preemptible kernel is just wishful thinking as the chance that
TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
obvious reasons.
This only "works" by some definition of works, i.e. on a best effort
basis, for the PREEMPT_NONE model and nothing else. Though given the
problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
running code sections, the days of these models should be hopefully
numbered and everything consolidated on the LAZY model.
That makes this distinction moot and everything restricted to
TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
extension mechanism into the scheduler hotpath. I'm sure there will
be attempts to do that as there is no lack of crazy folks out
there...
- Actual documentation of the user space ABI and a initial self test.
The RSEQ modifications on which this series is based can be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
For your convenience all of it is also available as a conglomerate from
git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Thanks,
tglx
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 129 ++++++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/tools/syscall_32.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/s390/mm/pfault.c | 3
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/entry-common.h | 2
include/linux/rseq.h | 11 +
include/linux/rseq_entry.h | 176 ++++++++++++++++
include/linux/rseq_types.h | 28 ++
include/linux/sched.h | 7
include/linux/syscalls.h | 1
include/linux/thread_info.h | 16 -
include/uapi/asm-generic/unistd.h | 5
include/uapi/linux/prctl.h | 10
include/uapi/linux/rseq.h | 28 ++
init/Kconfig | 12 +
kernel/entry/common.c | 14 +
kernel/entry/syscall-common.c | 11 -
kernel/rcu/tiny.c | 8
kernel/rcu/tree.c | 14 -
kernel/rcu/tree_exp.h | 3
kernel/rcu/tree_plugin.h | 9
kernel/rcu/tree_stall.h | 3
kernel/rseq.c | 293 ++++++++++++++++++++++++++++
kernel/sys.c | 6
kernel/sys_ni.c | 1
scripts/syscall.tbl | 1
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 2
tools/testing/selftests/rseq/slice_test.c | 217 ++++++++++++++++++++
45 files changed, 991 insertions(+), 42 deletions(-)
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 01/12] sched: Provide and use set_need_resched_current()
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
` (13 subsequent siblings)
14 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
To: LKML
Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
work correctly outside of the scheduler.
Provide set_need_resched_current() which wraps this correctly and replace
all the open coded instances.
Signed-off-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/s390/mm/pfault.c | 3 +--
include/linux/sched.h | 7 +++++++
kernel/rcu/tiny.c | 8 +++-----
kernel/rcu/tree.c | 14 +++++---------
kernel/rcu/tree_exp.h | 3 +--
kernel/rcu/tree_plugin.h | 9 +++------
kernel/rcu/tree_stall.h | 3 +--
7 files changed, 21 insertions(+), 26 deletions(-)
--- a/arch/s390/mm/pfault.c
+++ b/arch/s390/mm/pfault.c
@@ -199,8 +199,7 @@ static void pfault_interrupt(struct ext_
* return to userspace schedule() to block.
*/
__set_current_state(TASK_UNINTERRUPTIBLE);
- set_tsk_need_resched(tsk);
- set_preempt_need_resched();
+ set_need_resched_current();
}
}
out:
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2034,6 +2034,13 @@ static inline int test_tsk_need_resched(
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
+static inline void set_need_resched_current(void)
+{
+ lockdep_assert_irqs_disabled();
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+}
+
/*
* cond_resched() and cond_resched_lock(): latency reduction via
* explicit rescheduling in places that are safe. The return
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -70,12 +70,10 @@ void rcu_qs(void)
*/
void rcu_sched_clock_irq(int user)
{
- if (user) {
+ if (user)
rcu_qs();
- } else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail) {
- set_tsk_need_resched(current);
- set_preempt_need_resched();
- }
+ else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail)
+ set_need_resched_current();
}
/*
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2696,10 +2696,8 @@ void rcu_sched_clock_irq(int user)
/* The load-acquire pairs with the store-release setting to true. */
if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
/* Idle and userspace execution already are quiescent states. */
- if (!rcu_is_cpu_rrupt_from_idle() && !user) {
- set_tsk_need_resched(current);
- set_preempt_need_resched();
- }
+ if (!rcu_is_cpu_rrupt_from_idle() && !user)
+ set_need_resched_current();
__this_cpu_write(rcu_data.rcu_urgent_qs, false);
}
rcu_flavor_sched_clock_irq(user);
@@ -2824,7 +2822,6 @@ static void strict_work_handler(struct w
/* Perform RCU core processing work for the current CPU. */
static __latent_entropy void rcu_core(void)
{
- unsigned long flags;
struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
struct rcu_node *rnp = rdp->mynode;
@@ -2837,8 +2834,8 @@ static __latent_entropy void rcu_core(vo
if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK))) {
rcu_preempt_deferred_qs(current);
} else if (rcu_preempt_need_deferred_qs(current)) {
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ guard(irqsave)();
+ set_need_resched_current();
}
/* Update RCU state based on any recent quiescent states. */
@@ -2847,10 +2844,9 @@ static __latent_entropy void rcu_core(vo
/* No grace period and unregistered callbacks? */
if (!rcu_gp_in_progress() &&
rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp)) {
- local_irq_save(flags);
+ guard(irqsave)();
if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
rcu_accelerate_cbs_unlocked(rnp, rdp);
- local_irq_restore(flags);
}
rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -729,8 +729,7 @@ static void rcu_exp_need_qs(void)
__this_cpu_write(rcu_data.cpu_no_qs.b.exp, true);
/* Store .exp before .rcu_urgent_qs. */
smp_store_release(this_cpu_ptr(&rcu_data.rcu_urgent_qs), true);
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ set_need_resched_current();
}
#ifdef CONFIG_PREEMPT_RCU
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -756,8 +756,7 @@ static void rcu_read_unlock_special(stru
// Also if no expediting and no possible deboosting,
// slow is OK. Plus nohz_full CPUs eventually get
// tick enabled.
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ set_need_resched_current();
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
cpu_online(rdp->cpu)) {
@@ -818,10 +817,8 @@ static void rcu_flavor_sched_clock_irq(i
if (rcu_preempt_depth() > 0 ||
(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
/* No QS, force context switch if deferred. */
- if (rcu_preempt_need_deferred_qs(t)) {
- set_tsk_need_resched(t);
- set_preempt_need_resched();
- }
+ if (rcu_preempt_need_deferred_qs(t))
+ set_need_resched_current();
} else if (rcu_preempt_need_deferred_qs(t)) {
rcu_preempt_deferred_qs(t); /* Report deferred QS. */
return;
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -763,8 +763,7 @@ static void print_cpu_stall(unsigned lon
* progress and it could be we're stuck in kernel space without context
* switches for an entirely unreasonable amount of time.
*/
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ set_need_resched_current();
}
static bool csd_lock_suppress_rcu_stall;
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
2025-09-09 0:04 ` Randy Dunlap
` (2 more replies)
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
` (12 subsequent siblings)
14 siblings, 3 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Peter Zilstra, Arnd Bergmann, linux-arch
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 26 ++++++
include/uapi/linux/rseq.h | 28 +++++++
init/Kconfig | 12 +++
kernel/rseq.c | 8 ++
6 files changed, 204 insertions(+)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,129 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and user-space for three purposes:
+
+ * user-space restartable sequences
+
+ * quick access to read the current CPU number, node ID from user-space
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow user-space to perform update operations on
+per-cpu data without requiring heavy-weight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq user space pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success and otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
+rseq slice_ctrl field. If the thread is interrupted and the interrupt
+results in a reschedule request in the kernel, then the kernel can grant a
+time slice extension and return to user space instead of scheduling
+out.
+
+The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
+and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
+field. If there is a reschedule of the thread after granting the extension,
+the kernel clears the granted bit to indicate that to user space.
+
+If the request bit is still set when the leaving the critical section, user
+space can clear it and continue.
+
+If the granted bit is set, then user space has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving user space from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by user space.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl = REQUEST;
+ critical_section();
+ if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
+ if (rseq->slice_ctrl & GRANTED)
+ rseq_slice_yield();
+ }
+
+local_test_and_clear_bit() has to be local CPU atomic to prevent the
+obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
+without LOCK prefix. On architectures, which do not provide lightweight CPU
+local atomics this needs to be implemented with regular atomic operations.
+
+Setting REQUEST has no atomicity requirements as there is no concurrency
+vs. the GRANTED bit.
+
+Checking the GRANTED has no atomicity requirements as there is obviously a
+race which cannot be avoided at all::
+
+ if (rseq->slice_ctrl & GRANTED)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -71,12 +71,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -84,6 +107,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a seperate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,22 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
+};
+
+enum rseq_slice_bits {
+ /* Time slice extension ABI bits */
+ RSEQ_SLICE_EXT_REQUEST_BIT = 0,
+ RSEQ_SLICE_EXT_GRANTED_BIT = 1,
+};
+
+enum rseq_slice_masks {
+ RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
+ RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
};
/*
@@ -142,6 +164,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control word. CPU local atomic updates from
+ * kernel and user space.
+ */
+ __u32 slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
If unsure, say N.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (put_user_masked_u64(0UL, &rseq->rseq_cs))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
+ if (put_user_masked_u32(rseqfl, &rseq->flags))
+ return -EFAULT;
+
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 03/12] rseq: Provide static branch for time slice extensions
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
2025-09-09 3:10 ` K Prateek Nayak
2025-09-11 15:42 ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
` (11 subsequent siblings)
14 siblings, 2 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 11 +++++++++++
kernel/rseq.c | 17 +++++++++++++++++
2 files changed, 28 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -77,6 +77,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
#define rseq_inline __always_inline
#endif
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+ return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
bool rseq_debug_validate_ids(struct task_struct *t);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -474,3 +474,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
return 0;
}
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return -EINVAL;
+
+ if (!on)
+ static_branch_disable(&rseq_slice_extension_key);
+ return 0;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 04/12] rseq: Add statistics for time slice extensions
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (2 preceding siblings ...)
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
2025-09-11 15:43 ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
` (10 subsequent siblings)
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
To: LKML
Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Extend the quick statistics with time slice specific fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/rseq_entry.h | 4 ++++
kernel/rseq.c | 12 ++++++++++++
2 files changed, 16 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,10 @@ struct rseq_stats {
unsigned long cs;
unsigned long clear;
unsigned long fixup;
+ unsigned long s_granted;
+ unsigned long s_expired;
+ unsigned long s_revoked;
+ unsigned long s_yielded;
};
DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu));
+ stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu));
+ stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu));
+ stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu));
+ }
}
seq_printf(m, "exit: %16lu\n", stats.exit);
@@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+ seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+ seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+ seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+ }
return 0;
}
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 05/12] rseq: Add prctl() to enable time slice extensions
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (3 preceding siblings ...)
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
2025-09-11 15:50 ` Mathieu Desnoyers
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
` (9 subsequent siblings)
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq.h | 9 +++++++
include/uapi/linux/prctl.h | 10 ++++++++
kernel/rseq.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 6 +++++
4 files changed, 77 insertions(+)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -190,4 +190,13 @@ void rseq_syscall(struct pt_regs *regs);
static inline void rseq_syscall(struct pt_regs *regs) { }
#endif /* !CONFIG_DEBUG_RSEQ */
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -EINVAL;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
#endif /* _LINUX_RSEQ_H */
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -376,4 +376,14 @@ struct prctl_mm_map {
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE: Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+
#endif /* _LINUX_PRCTL_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
#include <linux/sched.h>
@@ -490,6 +491,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ switch (arg2) {
+ case PR_RSEQ_SLICE_EXTENSION_GET:
+ if (arg3)
+ return -EINVAL;
+ return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+ case PR_RSEQ_SLICE_EXTENSION_SET: {
+ u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+ if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+ return -EINVAL;
+ if (!rseq_slice_extension_enabled())
+ return -ENOTSUPP;
+ if (!current->rseq.usrptr)
+ return -ENXIO;
+
+ /* No change? */
+ if (enable == !!current->rseq.slice.state.enabled)
+ return 0;
+
+ if (get_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ if (current->rseq.slice.state.enabled)
+ valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if ((rflags & valid) != valid)
+ goto die;
+
+ rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (enable)
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if (put_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ current->rseq.slice.state.enabled = enable;
+ return 0;
+ }
+ default:
+ return -EINVAL;
+ }
+die:
+ force_sig(SIGSEGV);
+ return -EFAULT;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
#include <linux/futex.h>
+#include <linux/rseq.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
case PR_FUTEX_HASH:
error = futex_hash_prctl(arg2, arg3, arg4);
break;
+ case PR_RSEQ_SLICE_EXTENSION:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = rseq_slice_extension_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 06/12] rseq: Implement sys_rseq_slice_yield()
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (4 preceding siblings ...)
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-09 9:52 ` K Prateek Nayak
2025-09-10 11:15 ` K Prateek Nayak
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
` (8 subsequent siblings)
14 siblings, 2 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Peter Zijlstra,
Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/tools/syscall_32.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 ++++-
kernel/rseq.c | 9 +++++++++
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
21 files changed, 32 insertions(+), 1 deletion(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
577 common open_tree_attr sys_open_tree_attr
578 common file_getattr sys_file_getattr
579 common file_setattr sys_file_setattr
+580 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -484,3 +484,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -481,3 +481,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -408,3 +408,4 @@
467 n32 open_tree_attr sys_open_tree_attr
468 n32 file_getattr sys_file_getattr
469 n32 file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -384,3 +384,4 @@
467 n64 open_tree_attr sys_open_tree_attr
468 n64 file_getattr sys_file_getattr
469 n64 file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -457,3 +457,4 @@
467 o32 open_tree_attr sys_open_tree_attr
468 o32 file_getattr sys_file_getattr
469 o32 file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -468,3 +468,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -560,3 +560,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 nospu rseq_slice_yield sys_rseq_slice_yield
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -472,3 +472,4 @@
467 common open_tree_attr sys_open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -475,3 +475,4 @@
467 i386 open_tree_attr sys_open_tree_attr
468 i386 file_getattr sys_file_getattr
469 i386 file_setattr sys_file_setattr
+470 i386 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -393,6 +393,7 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
#
# Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -957,6 +957,7 @@ asmlinkage long sys_statx(int dfd, const
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
unsigned flags,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -858,8 +858,11 @@
#define __NR_file_setattr 469
__SYSCALL(__NR_file_setattr, sys_file_setattr)
+#define __NR_rseq_slice_yield 470
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
#undef __NR_syscalls
-#define __NR_syscalls 470
+#define __NR_syscalls 471
/*
* 32 bit systems traditionally used different
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned
return -EFAULT;
}
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+ if (need_resched()) {
+ schedule();
+ return 1;
+ }
+ return 0;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,5 +390,6 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_sched_yield);
COND_SYSCALL(uretprobe);
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -410,3 +410,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_sched_yield sys_rseq_sched_yield
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 07/12] rseq: Implement syscall entry work for time slice extensions
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (5 preceding siblings ...)
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-10 5:22 ` K Prateek Nayak
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (7 subsequent siblings)
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues a syscall from the critical section. Wrong syscall and
inconsistent user space result in a SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/entry-common.h | 2 -
include/linux/rseq.h | 2 +
include/linux/thread_info.h | 16 ++++----
kernel/entry/syscall-common.c | 11 ++++-
kernel/rseq.c | 80 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 101 insertions(+), 10 deletions(-)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
+ SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \
ARCH_SYSCALL_WORK_ENTER)
-
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -191,8 +191,10 @@ static inline void rseq_syscall(struct p
#endif /* !CONFIG_DEBUG_RSEQ */
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
return -EINVAL;
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+ SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
};
-#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
#endif
#include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
}
}
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
{
long ret = 0;
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs
return -1L;
}
+ /*
+ * User space got a time slice extension granted and relinquishes
+ * the CPU. The work stops the slice timer to avoid an extra round
+ * through hrtimer_interrupt().
+ */
+ if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+ rseq_syscall_enter_work(syscall);
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -491,6 +491,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+ /*
+ * The interrupt guard is required to prevent inconsistent state in
+ * this case:
+ *
+ * set_tsk_need_resched()
+ * --> Interrupt
+ * wakeup()
+ * set_tsk_need_resched()
+ * set_preempt_need_resched()
+ * schedule_on_return()
+ * clear_tsk_need_resched()
+ * clear_preempt_need_resched()
+ * set_preempt_need_resched() <- Inconsistent state
+ *
+ * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+ * only sets the already set bit and does not create inconsistent
+ * state.
+ */
+ scoped_guard(irq)
+ set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+ u32 __user *sctrl = ¤t->rseq.usrptr->slice_ctrl;
+ u32 uval;
+
+ if (get_user_masked_u32(&uval, sctrl) || uval != expected)
+ force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+ struct task_struct *curr = current;
+ bool granted = curr->rseq.slice.state.granted;
+
+ clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ rseq_slice_validate_ctrl(granted ? RSEQ_SLICE_EXT_GRANTED : 0);
+
+ /*
+ * The kernel might have raced, revoked the grant and updated
+ * userspace, but kept the SLICE work set.
+ */
+ if (!granted)
+ return;
+
+ rseq_stat_inc(rseq_stats.s_yielded);
+
+ /*
+ * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+ * kernels.
+ */
+ scoped_guard(preempt) {
+ /*
+ * Now that preemption is disabled, quickly check whether
+ * the task was already rescheduled before arriving here.
+ */
+ if (!curr->rseq.event.sched_switch)
+ rseq_slice_set_need_resched(curr);
+ }
+
+ curr->rseq.slice.state.granted = false;
+ /*
+ * Clear the grant in user space and check whether this was the
+ * correct syscall to yield. If the user access fails or the task
+ * used an arbitrary syscall, terminate it.
+ */
+ if (put_user_masked_u32(0U, &curr->rseq.usrptr->slice_ctrl) ||
+ syscall != __NR_rseq_slice_yield)
+ force_sig(SIGSEGV);
+}
+
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
switch (arg2) {
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 08/12] rseq: Implement time slice extension enforcement timer
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (6 preceding siblings ...)
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-10 11:20 ` K Prateek Nayak
2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
` (6 subsequent siblings)
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extenstion was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 22 +++++++-
include/linux/rseq_types.h | 2
kernel/rseq.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 140 insertions(+), 3 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -88,8 +88,24 @@ static __always_inline bool rseq_slice_e
{
return static_branch_likely(&rseq_slice_extension_key);
}
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ if (likely(!current->rseq.slice.state.granted))
+ return false;
+
+ return __rseq_arm_slice_extension_timer();
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -560,8 +576,12 @@ static __always_inline void clear_tif_rs
static __always_inline bool
rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
+ /*
+ * Arm the slice extension timer if nothing to do anymore and the
+ * task really goes out to user space.
+ */
if (likely(!test_tif_rseq(ti_work)))
- return false;
+ return rseq_arm_slice_extension_timer();
if (unlikely(__rseq_exit_to_user_mode_restart(regs)))
return true;
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -87,9 +87,11 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @expires: The time when a grant expires
*/
struct rseq_slice {
union rseq_slice_state state;
+ u64 expires;
};
/**
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
@@ -489,8 +491,82 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
}
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+ struct hrtimer timer;
+ void *cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+ struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+ if (st->cookie == current && current->rseq.slice.state.granted) {
+ rseq_stat_inc(rseq_stats.s_expired);
+ set_need_resched_current();
+ }
+ return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+ struct task_struct *curr = current;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * This check prevents that a granted time slice extension exceeds
+ * the maximum scheduling latency when the grant expired before
+ * going out to user space. Don't bother to clear the grant here,
+ * it will be cleaned up automatically before going out to user
+ * space.
+ */
+ if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+ set_need_resched_current();
+ return true;
+ }
+
+ /*
+ * Store the task pointer as a cookie for comparison in the timer
+ * function. This is safe as the timer is CPU local and cannot be
+ * in the expiry function at this point.
+ */
+ st->cookie = curr;
+ hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /* Arm the syscall entry work */
+ set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+ return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+ /*
+ * st->cookie can be safely read as preemption is disabled and the
+ * timer is CPU local. The active check can obviously race with the
+ * hrtimer interrupt, but that's better than disabling interrupts
+ * unconditionaly right away.
+ *
+ * As this is most probably the first expiring timer, the cancel is
+ * expensive as it has to reprogram the hardware, but that's less
+ * expensive than going through a full hrtimer_interrupt() cycle
+ * for nothing.
+ *
+ * hrtimer_try_to_cancel() is sufficient here as with interrupts
+ * disabled the timer callback cannot be running and the timer base
+ * is well determined as the timer is pinned on the local CPU.
+ */
+ if (st->cookie == current && hrtimer_active(&st->timer)) {
+ scoped_guard(irq)
+ hrtimer_try_to_cancel(&st->timer);
+ }
+}
+
static inline void rseq_slice_set_need_resched(struct task_struct *curr)
{
/*
@@ -548,10 +624,11 @@ void rseq_syscall_enter_work(long syscal
rseq_stat_inc(rseq_stats.s_yielded);
/*
- * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
- * kernels.
+ * Required to stabilize the per CPU timer pointer and to make
+ * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
*/
scoped_guard(preempt) {
+ rseq_cancel_slice_extension_timer();
/*
* Now that preemption is disabled, quickly check whether
* the task was already rescheduled before arriving here.
@@ -631,6 +708,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
return 0;
}
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+ {
+ .procname = "rseq_slice_extension_nsec",
+ .data = &rseq_slice_ext_nsecs,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
+ .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
+ },
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+ if (rseq_slice_extension_enabled())
+ register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif /* !CONFIG_SYSCTL */
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
@@ -643,4 +745,17 @@ static int __init rseq_slice_cmdline(cha
return 0;
}
__setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ }
+ rseq_slice_sysctl_init();
+ return 0;
+}
+device_initcall(rseq_slice_init);
#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 09/12] rseq: Reset slice extension when scheduled
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (7 preceding siblings ...)
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
` (5 subsequent siblings)
14 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -103,9 +103,17 @@ static __always_inline bool rseq_arm_sli
return __rseq_arm_slice_extension_timer();
}
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+ if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+ rseq_stat_inc(rseq_stats.s_revoked);
+ t->rseq.slice.state.granted = false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -404,6 +412,13 @@ bool rseq_set_ids_get_csaddr(struct task
unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
if (csaddr)
unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl, efault);
+ rseq_slice_clear_grant(t);
+ }
user_access_end();
/* Cache the new values */
@@ -518,10 +533,19 @@ static __always_inline bool __rseq_exit_
* If IDs have not changed rseq_event::user_irq must be true
* See rseq_sched_switch_event().
*/
+ struct rseq __user *rseq = t->rseq.usrptr;
u64 csaddr;
- if (unlikely(get_user_masked_u64(&csaddr, &t->rseq.usrptr->rseq_cs)))
+ if (!user_rw_masked_begin(rseq))
goto fail;
+ unsafe_get_user(csaddr, &rseq->rseq_cs, fault);
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl, fault);
+ rseq_slice_clear_grant(t);
+ }
+ user_access_end();
if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -545,6 +569,8 @@ static __always_inline bool __rseq_exit_
t->rseq.event.events = 0;
return false;
+fault:
+ user_access_end();
fail:
pagefault_enable();
/* Force it into the slow path. Don't clear the state! */
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 10/12] rseq: Implement rseq_grant_slice_extension()
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (8 preceding siblings ...)
2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-09 8:14 ` K Prateek Nayak
2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
` (4 subsequent siblings)
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 111 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 111 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -41,6 +41,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <uapi/linux/rseq.h>
@@ -110,10 +111,120 @@ static __always_inline void rseq_slice_c
t->rseq.slice.state.granted = false;
}
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+ struct task_struct *curr = current;
+ union rseq_slice_state state;
+ struct rseq __user *rseq;
+ u32 usr_ctrl;
+
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ /* If not enabled or not a return from interrupt, nothing to do. */
+ state = curr->rseq.slice.state;
+ state.enabled &= curr->rseq.event.user_irq;
+ if (likely(!state.state))
+ return false;
+
+ rseq = curr->rseq.usrptr;
+ if (!user_rw_masked_begin(rseq))
+ goto die;
+
+ /*
+ * Quick check conditions where a grant is not possible or
+ * needs to be revoked.
+ *
+ * 1) Any TIF bit which needs to do extra work aside of
+ * rescheduling prevents a grant.
+ *
+ * 2) A previous rescheduling request resulted in a slice
+ * extension grant.
+ */
+ if (unlikely(work_pending || state.granted)) {
+ /* Clear user control unconditionally. No point for checking */
+ unsafe_put_user(0U, &rseq->slice_ctrl, fail);
+ user_access_end();
+ rseq_slice_clear_grant(curr);
+ return false;
+ }
+
+ unsafe_get_user(usr_ctrl, &rseq->slice_ctrl, fail);
+ if (likely(!(usr_ctrl & RSEQ_SLICE_EXT_REQUEST))) {
+ user_access_end();
+ return false;
+ }
+
+ /* Grant the slice extention */
+ unsafe_put_user(RSEQ_SLICE_EXT_GRANTED, &rseq->slice_ctrl, fail);
+ user_access_end();
+
+ rseq_stat_inc(rseq_stats.s_granted);
+
+ curr->rseq.slice.state.granted = true;
+ /* Store expiry time for arming the timer on the way out */
+ curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+ /*
+ * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+ * several ways:
+ *
+ * 1)
+ * CPU0 CPU1
+ * clear_tsk()
+ * set_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() -> Folds correctly
+ * 2)
+ * CPU0 CPU1
+ * set_tsk()
+ * clear_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+ *
+ * #1 is not any different from a regular remote reschedule as it
+ * sets the previously not set bit and then raises the IPI which
+ * folds it into the preempt counter
+ *
+ * #2 is obviously incorrect from a scheduler POV, but it's not
+ * differently incorrect than the code below clearing the
+ * reschedule request with the safety net of the timer.
+ *
+ * The important part is that the clearing is protected against the
+ * scheduler IPI and also against any other interrupt which might
+ * end up waking up a task and setting the bits in the middle of
+ * the operation:
+ *
+ * clear_tsk()
+ * ---> Interrupt
+ * wakeup_on_this_cpu()
+ * set_tsk()
+ * set_preempt()
+ * clear_preempt()
+ *
+ * which would be inconsistent state.
+ */
+ scoped_guard(irq) {
+ clear_tsk_need_resched(curr);
+ clear_preempt_need_resched();
+ }
+ return true;
+
+fail:
+ user_access_end();
+die:
+ force_sig(SIGSEGV);
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 11/12] entry: Hook up rseq time slice extension
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (9 preceding siblings ...)
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
` (3 subsequent siblings)
14 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
kernel/entry/common.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch 12/12] selftests/rseq: Implement time slice extension test
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (10 preceding siblings ...)
2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
2025-09-10 11:23 ` K Prateek Nayak
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (2 subsequent siblings)
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
To: LKML
Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and ariving in rseq_slice_yield().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 2
tools/testing/selftests/rseq/slice_test.c | 217 ++++++++++++++++++++++++++++++
4 files changed, 224 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
param_test_mm_cid_benchmark
param_test_mm_cid_compare_twice
syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test slice_test
TEST_GEN_PROGS_EXTENDED = librseq.so
@@ -59,3 +59,6 @@ include ../lib.mk
$(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
rseq.h rseq-*.h
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+ $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -164,6 +164,8 @@ struct rseq_abi {
*/
__u32 mm_cid;
+ __u32 slice_ctrl;
+
/*
* Flexible array member at end of structure, after last feature field.
*/
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield 470
+#endif
+
+#define BITS_PER_INT 32
+#define BITS_PER_BYTE 8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT 0
+# define RSEQ_SLICE_EXT_GRANTED_BIT 1
+#endif
+
+#ifndef asm_inline
+# define asm_inline asm __inline
+#endif
+
+#if defined(__x86_64__) || defined(__i386__)
+static __always_inline bool test_and_clear_request(unsigned int *addr)
+{
+ const unsigned int bit = RSEQ_SLICE_EXT_REQUEST_BIT;
+ bool res;
+
+ asm inline volatile("btrl %[__bit], %[__addr]\n"
+ : [__addr] "+m" (*addr), "=@cc" "c" (res)
+ : [__bit] "Ir" (bit)
+ : "memory");
+ return res;
+}
+#else
+static __always_inline bool test_and_clear_request(unsigned int *addr)
+{
+ const unsigned int mask = (1U << RSEQ_SLICE_EXT_REQUEST_BIT);
+
+ return __atomic_fetch_and(addr, ~mask, __ATOMIC_RELAXED) & mask;
+}
+#endif
+
+static __always_inline void set_request(unsigned int *addr)
+{
+ *addr = 1U << RSEQ_SLICE_EXT_REQUEST_BIT;
+}
+
+static __always_inline bool test_granted(unsigned int *addr)
+{
+ return !!(*addr & (1U << RSEQ_SLICE_EXT_GRANTED_BIT));
+}
+
+#define NSEC_PER_SEC 1000000000L
+#define NSEC_PER_USEC 1000L
+
+struct noise_params {
+ int noise_nsecs;
+ int sleep_nsecs;
+ int run;
+};
+
+FIXTURE(slice_ext)
+{
+ pthread_t noise_thread;
+ struct noise_params noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+ int64_t total_nsecs;
+ int slice_nsecs;
+ int noise_nsecs;
+ int sleep_nsecs;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 2 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 50 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+ int64_t span)
+{
+ int64_t delta = now->tv_sec - start->tv_sec;
+
+ delta *= NSEC_PER_SEC;
+ delta += now->tv_nsec - start->tv_nsec;
+ return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+ struct noise_params *p = arg;
+
+ while (RSEQ_READ_ONCE(p->run)) {
+ struct timespec ts_start, ts_now;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+ ts_start.tv_sec = 0;
+ ts_start.tv_nsec = p->sleep_nsecs;
+ clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+ }
+ return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+ cpu_set_t affinity;
+
+ ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+ /* Pin it on a single CPU. Avoid CPU 0 */
+ for (int i = 1; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+
+ CPU_ZERO(&affinity);
+ CPU_SET(i, &affinity);
+ ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+ break;
+ }
+
+ ASSERT_EQ(rseq_register_current_thread(), 0);
+
+ ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+ self->noise_params.noise_nsecs = variant->noise_nsecs;
+ self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+ self->noise_params.run = 1;
+
+ ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+ self->noise_params.run = 0;
+ pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+ unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+ struct rseq_abi *rs = rseq_get_abi();
+ struct timespec ts_start, ts_now;
+
+ ASSERT_NE(rs, NULL);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ struct timespec ts_cs;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+ set_request(&rs->slice_ctrl);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+ if (!test_and_clear_request(&rs->slice_ctrl)) {
+ if (test_granted(&rs->slice_ctrl)) {
+ yielded++;
+ if (!syscall(__NR_rseq_slice_yield))
+ raced++;
+ } else {
+ scheduled++;
+ }
+ } else {
+ success++;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+ printf("# Success %12ld\n", success);
+ printf("# Yielded %12ld\n", yielded);
+ printf("# Scheduled %12ld\n", scheduled);
+ printf("# Raced %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-09-09 0:04 ` Randy Dunlap
2025-09-11 15:41 ` Mathieu Desnoyers
2025-09-22 5:28 ` Prakash Sangappa
2 siblings, 0 replies; 54+ messages in thread
From: Randy Dunlap @ 2025-09-09 0:04 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Hi Thomas,
On 9/8/25 3:59 PM, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
>
> ---
> Documentation/userspace-api/index.rst | 1
> Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
> include/linux/rseq_types.h | 26 ++++++
> include/uapi/linux/rseq.h | 28 +++++++
> init/Kconfig | 12 +++
> kernel/rseq.c | 8 ++
> 6 files changed, 204 insertions(+)
>
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:
userspace or user-space or user space -- be consistent, please.
(above 2 times, and more below)
FWIW, "userspace" overwhelmingly wins in the kernel source tree.
On the $internet it looks like "user space" wins (quick look).
> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space
> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
just heavyweight
> +ABI is unfortunately only available in the code and selftests.
> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(
> +
> +Scheduler time slice extensions
> +-------------------------------
> +
> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> + * Enabled in Kconfig
> +
> + * Enabled at boot time (default is enabled)
> +
> + * A rseq user space pointer has been registered for the thread
^^^^^^^^^^
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg4 and arg5 must be zero
> +ENOTSUPP Functionality was disabled on the kernel command line
> +ENXIO Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
read-only for
> +space and only for informational purposes.
userspace ?
> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling
^^^^^^^^^^
> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.
?
> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
?
> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()
?
> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from
OK, I think that you like "user space". :)
> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> + rseq->slice_ctrl = REQUEST;
> + critical_section();
> + if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> + if (rseq->slice_ctrl & GRANTED)
> + rseq_slice_yield();
> + }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU
no comma ^
> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> + if (rseq->slice_ctrl & GRANTED)
> + -> Interrupt results in schedule and grant revocation
> + rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.
> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a seperate byte */
separate
("There is a rat in separate." -- old clue)
'arat'
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
>
> enum rseq_cs_flags {
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>
> If unsure, say N.
>
> +config RSEQ_SLICE_EXTENSION
> + bool "Enable rseq based time slice extension mechanism"
rseq-based
> + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> + help
> + Allows userspace to request a limited time slice extension when
Use tab + 2 spaces above instead of N spaces.
> + returning from an interrupt to user space via the RSEQ shared
> + data ABI. If granted, that allows to complete a critical section,
> + so that other threads are not stuck on a conflicted resource,
> + while the task is scheduled out.
--
~Randy
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-09-09 3:10 ` K Prateek Nayak
2025-09-09 4:11 ` Randy Dunlap
2025-09-11 15:42 ` Mathieu Desnoyers
1 sibling, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-09 3:10 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static int __init rseq_slice_cmdline(char *str)
> +{
> + bool on;
> +
> + if (kstrtobool(str, &on))
> + return -EINVAL;
> +
> + if (!on)
> + static_branch_disable(&rseq_slice_extension_key);
> + return 0;
I believe this should return "1" signalling that the cmdline was handled
correctly to avoid an "Unknown kernel command line parameters" message.
> +}
> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
2025-09-09 3:10 ` K Prateek Nayak
@ 2025-09-09 4:11 ` Randy Dunlap
2025-09-09 12:12 ` Thomas Gleixner
0 siblings, 1 reply; 54+ messages in thread
From: Randy Dunlap @ 2025-09-09 4:11 UTC (permalink / raw)
To: K Prateek Nayak, Thomas Gleixner, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 9/8/25 8:10 PM, K Prateek Nayak wrote:
> Hello Thomas,
>
> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>> +
>> +static int __init rseq_slice_cmdline(char *str)
>> +{
>> + bool on;
>> +
>> + if (kstrtobool(str, &on))
>> + return -EINVAL;
>> +
>> + if (!on)
>> + static_branch_disable(&rseq_slice_extension_key);
>> + return 0;
>
> I believe this should return "1" signalling that the cmdline was handled
> correctly to avoid an "Unknown kernel command line parameters" message.
Good catch. I agree.
Thanks.
>> +}
>> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
>> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
>>
>
--
~Randy
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 10/12] rseq: Implement rseq_grant_slice_extension()
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-09-09 8:14 ` K Prateek Nayak
2025-09-09 12:16 ` Thomas Gleixner
0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-09 8:14 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> #else /* CONFIG_RSEQ_SLICE_EXTENSION */
> static inline bool rseq_slice_extension_enabled(void) { return false; }
> static inline bool rseq_arm_slice_extension_timer(void) { return false; }
> static inline void rseq_slice_clear_grant(struct task_struct *t) { }
> +static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
This is still under the CONFIG_RSEQ block and when building with
CONFIG_RSEQ disabled gives the following error with changes from
Patch 11:
kernel/entry/common.c:40:30: error: implicit declaration of function ‘rseq_grant_slice_extension’ [-Werror=implicit-function-declaration]
40 | if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
Putting the rseq_grant_slice_extension() definition from above in
a separate "ifndef CONFIG_RSEQ_SLICE_EXTENSION" block at the end
keeps the build happy.
> #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
>
> bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 06/12] rseq: Implement sys_rseq_slice_yield()
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-09-09 9:52 ` K Prateek Nayak
2025-09-09 12:23 ` Thomas Gleixner
2025-09-10 11:15 ` K Prateek Nayak
1 sibling, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-09 9:52 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior
Hello Thomas,
On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned
> return -EFAULT;
> }
>
> +SYSCALL_DEFINE0(rseq_slice_yield)
> +{
> + if (need_resched()) {
> + schedule();
> + return 1;
> + }
> + return 0;
> +}
> +
> static int __init rseq_slice_cmdline(char *str)
> {
> bool on;
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -390,5 +390,6 @@ COND_SYSCALL(setuid16);
>
> /* restartable sequence */
> COND_SYSCALL(rseq);
> +COND_SYSCALL(rseq_sched_yield);
I'm not sure if it is my toolchain but when I try to build a version
with CONFIG_RSEQ_SLICE_EXTENSION disabled, I see:
ld: vmlinux.o: in function `x64_sys_call':
arch/x86/include/generated/asm/syscalls_64.h:471: undefined reference to `__x64_sys_rseq_slice_yield'
ld: vmlinux.o: in function `ia32_sys_call':
arch/x86/include/generated/asm/syscalls_32.h:471: undefined reference to `__ia32_sys_rseq_slice_yield'
ld: vmlinux.o:(.rodata+0x12d0): undefined reference to `__x64_sys_rseq_slice_yield'
I would have assumed the COND_SYSCALL() above would have stubbed this
but that doesn't seem to be the case. Am I missing something?
P.S. I'm running with:
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04.2)
GNU ld (GNU Binutils for Ubuntu) 2.38
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
2025-09-09 4:11 ` Randy Dunlap
@ 2025-09-09 12:12 ` Thomas Gleixner
2025-09-09 16:01 ` Randy Dunlap
0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:12 UTC (permalink / raw)
To: Randy Dunlap, K Prateek Nayak, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Mon, Sep 08 2025 at 21:11, Randy Dunlap wrote:
> On 9/8/25 8:10 PM, K Prateek Nayak wrote:
>> Hello Thomas,
>>
>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>>> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>>> +
>>> +static int __init rseq_slice_cmdline(char *str)
>>> +{
>>> + bool on;
>>> +
>>> + if (kstrtobool(str, &on))
>>> + return -EINVAL;
>>> +
>>> + if (!on)
>>> + static_branch_disable(&rseq_slice_extension_key);
>>> + return 0;
>>
>> I believe this should return "1" signalling that the cmdline was handled
>> correctly to avoid an "Unknown kernel command line parameters" message.
>
> Good catch. I agree.
> Thanks.
It seems I can't get that right ever ....
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 10/12] rseq: Implement rseq_grant_slice_extension()
2025-09-09 8:14 ` K Prateek Nayak
@ 2025-09-09 12:16 ` Thomas Gleixner
0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:16 UTC (permalink / raw)
To: K Prateek Nayak, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Tue, Sep 09 2025 at 13:44, K. Prateek Nayak wrote:
> Hello Thomas,
>
> On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>> #else /* CONFIG_RSEQ_SLICE_EXTENSION */
>> static inline bool rseq_slice_extension_enabled(void) { return false; }
>> static inline bool rseq_arm_slice_extension_timer(void) { return false; }
>> static inline void rseq_slice_clear_grant(struct task_struct *t) { }
>> +static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
>
> This is still under the CONFIG_RSEQ block and when building with
> CONFIG_RSEQ disabled gives the following error with changes from
> Patch 11:
>
> kernel/entry/common.c:40:30: error: implicit declaration of function ‘rseq_grant_slice_extension’ [-Werror=implicit-function-declaration]
> 40 | if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
>
> Putting the rseq_grant_slice_extension() definition from above in
> a separate "ifndef CONFIG_RSEQ_SLICE_EXTENSION" block at the end
> keeps the build happy.
Duh, yes.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 06/12] rseq: Implement sys_rseq_slice_yield()
2025-09-09 9:52 ` K Prateek Nayak
@ 2025-09-09 12:23 ` Thomas Gleixner
0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:23 UTC (permalink / raw)
To: K Prateek Nayak, LKML
Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior
On Tue, Sep 09 2025 at 15:22, K. Prateek Nayak wrote:
> On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>> /* restartable sequence */
>> COND_SYSCALL(rseq);
>> +COND_SYSCALL(rseq_sched_yield);
>
> I'm not sure if it is my toolchain but when I try to build a version
> with CONFIG_RSEQ_SLICE_EXTENSION disabled, I see:
>
> ld: vmlinux.o: in function `x64_sys_call':
> arch/x86/include/generated/asm/syscalls_64.h:471: undefined reference to `__x64_sys_rseq_slice_yield'
> ld: vmlinux.o: in function `ia32_sys_call':
> arch/x86/include/generated/asm/syscalls_32.h:471: undefined reference to `__ia32_sys_rseq_slice_yield'
> ld: vmlinux.o:(.rodata+0x12d0): undefined reference to `__x64_sys_rseq_slice_yield'
>
> I would have assumed the COND_SYSCALL() above would have stubbed this
> but that doesn't seem to be the case. Am I missing something?
Yes.
>> +COND_SYSCALL(rseq_sched_yield);
does not create a stub for rseq_slice_yield() obviously :)
/me looks for a brown paperbag.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (11 preceding siblings ...)
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-09-09 12:37 ` Thomas Gleixner
2025-09-10 4:42 ` K Prateek Nayak
2025-09-10 11:28 ` K Prateek Nayak
2025-09-11 15:27 ` Mathieu Desnoyers
14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:37 UTC (permalink / raw)
To: LKML
Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap
On Tue, Sep 09 2025 at 00:59, Thomas Gleixner wrote:
> For your convenience all of it is also available as a conglomerate from
> git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Force pushed a new version into the branch, which addresses the initial
feedback and fallout.
Thanks,
tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
2025-09-09 12:12 ` Thomas Gleixner
@ 2025-09-09 16:01 ` Randy Dunlap
0 siblings, 0 replies; 54+ messages in thread
From: Randy Dunlap @ 2025-09-09 16:01 UTC (permalink / raw)
To: Thomas Gleixner, K Prateek Nayak, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 9/9/25 5:12 AM, Thomas Gleixner wrote:
> On Mon, Sep 08 2025 at 21:11, Randy Dunlap wrote:
>> On 9/8/25 8:10 PM, K Prateek Nayak wrote:
>>> Hello Thomas,
>>>
>>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>>>> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>>>> +
>>>> +static int __init rseq_slice_cmdline(char *str)
>>>> +{
>>>> + bool on;
>>>> +
>>>> + if (kstrtobool(str, &on))
>>>> + return -EINVAL;
>>>> +
>>>> + if (!on)
>>>> + static_branch_disable(&rseq_slice_extension_key);
>>>> + return 0;
>>>
>>> I believe this should return "1" signalling that the cmdline was handled
>>> correctly to avoid an "Unknown kernel command line parameters" message.
>>
>> Good catch. I agree.
>> Thanks.
>
> It seems I can't get that right ever ....
Yeah, it's bass-ackwards.
I guess that's partly why we have early_param() and friends.
--
~Randy
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-09-10 4:42 ` K Prateek Nayak
0 siblings, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 4:42 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap
Hello Thomas,
On 9/9/2025 6:07 PM, Thomas Gleixner wrote:
> On Tue, Sep 09 2025 at 00:59, Thomas Gleixner wrote:
>> For your convenience all of it is also available as a conglomerate from
>> git:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> Force pushed a new version into the branch, which addresses the initial
> feedback and fallout.
Everything build fine now and the rseq selftests are happy too. Feel
free to include:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 07/12] rseq: Implement syscall entry work for time slice extensions
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-09-10 5:22 ` K Prateek Nayak
2025-09-10 7:49 ` Thomas Gleixner
0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 5:22 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> +static inline void rseq_slice_set_need_resched(struct task_struct *curr)
> +{
> + /*
> + * The interrupt guard is required to prevent inconsistent state in
> + * this case:
> + *
> + * set_tsk_need_resched()
> + * --> Interrupt
> + * wakeup()
> + * set_tsk_need_resched()
> + * set_preempt_need_resched()
> + * schedule_on_return()
> + * clear_tsk_need_resched()
> + * clear_preempt_need_resched()
> + * set_preempt_need_resched() <- Inconsistent state
> + *
> + * This is safe vs. a remote set of TIF_NEED_RESCHED because that
> + * only sets the already set bit and does not create inconsistent
> + * state.
> + */
> + scoped_guard(irq)
> + set_need_resched_current();
nit. any specific reason for using a scoped_guard() instead of just a
guard() here (and in rseq_cancel_slice_extension_timer()) other than to
prominently highlight what is being guarded?
> +}
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 07/12] rseq: Implement syscall entry work for time slice extensions
2025-09-10 5:22 ` K Prateek Nayak
@ 2025-09-10 7:49 ` Thomas Gleixner
0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-10 7:49 UTC (permalink / raw)
To: K Prateek Nayak, LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Wed, Sep 10 2025 at 10:52, K. Prateek Nayak wrote:
> On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>> +static inline void rseq_slice_set_need_resched(struct task_struct *curr)
>> +{
>> + /*
>> + * The interrupt guard is required to prevent inconsistent state in
>> + * this case:
>> + *
>> + * set_tsk_need_resched()
>> + * --> Interrupt
>> + * wakeup()
>> + * set_tsk_need_resched()
>> + * set_preempt_need_resched()
>> + * schedule_on_return()
>> + * clear_tsk_need_resched()
>> + * clear_preempt_need_resched()
>> + * set_preempt_need_resched() <- Inconsistent state
>> + *
>> + * This is safe vs. a remote set of TIF_NEED_RESCHED because that
>> + * only sets the already set bit and does not create inconsistent
>> + * state.
>> + */
>> + scoped_guard(irq)
>> + set_need_resched_current();
>
> nit. any specific reason for using a scoped_guard() instead of just a
> guard() here (and in rseq_cancel_slice_extension_timer()) other than to
> prominently highlight what is being guarded?
Yes, the intention was to highlight it and scoped_guard() really
does. From a code generation perspective it's the same outcome.
Thanks,
tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 06/12] rseq: Implement sys_rseq_slice_yield()
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-09-09 9:52 ` K Prateek Nayak
@ 2025-09-10 11:15 ` K Prateek Nayak
1 sibling, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:15 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior
Hello Thomas,
On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned
> return -EFAULT;
> }
>
nit.
Perhaps a small note here to highlight how need_resched() is true
for tasks who had the slice extension granted. Something like:
/**
* sys_rseq_slice_yield - yield the current processor if a task granted with
* slice extension is done with the critical work before being forced out.
*
* This syscall entry work ensures NEED_RESCHED is set if the task was granted
* a slice extension before arriving here.
*
* Return: 1 if the task successfully yielded the CPU within the granted slice.
* 0 if the slice extension was either never granted or was revoked by
* going over the granted extension.
*/
> +SYSCALL_DEFINE0(rseq_slice_yield)
> +{
> + if (need_resched()) {
> + schedule();
> + return 1;
> + }
> + return 0;
> +}
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 08/12] rseq: Implement time slice extension enforcement timer
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-09-10 11:20 ` K Prateek Nayak
0 siblings, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:20 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> The timer is armed when an extenstion was granted right before actually
nit. s/extenstion/extension/
> returning to user mode in rseq_exit_to_user_mode_restart().
[..snip..]
> +static void rseq_cancel_slice_extension_timer(void)
> +{
> + struct slice_timer *st = this_cpu_ptr(&slice_timer);
> +
> + /*
> + * st->cookie can be safely read as preemption is disabled and the
> + * timer is CPU local. The active check can obviously race with the
> + * hrtimer interrupt, but that's better than disabling interrupts
> + * unconditionaly right away.
nit. s/unconditionaly/unconditionally/
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 12/12] selftests/rseq: Implement time slice extension test
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-09-10 11:23 ` K Prateek Nayak
0 siblings, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:23 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> Provide an initial test case to evaluate the functionality. This needs to be
> extended to cover the ABI violations and expose the race condition between
> observing granted and ariving in rseq_slice_yield().
nit. s/ariving/arriving/
I finally managed to trigger that cheeky race condition too :)
# Starting 2 tests from 2 test cases.
# RUN slice_ext.n2_2_50.slice_test ...
# Success 2088616
# Yielded 45097
# Scheduled 174
# Raced 2
# OK slice_ext.n2_2_50.slice_test
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (12 preceding siblings ...)
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-09-10 11:28 ` K Prateek Nayak
2025-09-10 14:50 ` Thomas Gleixner
2025-09-11 15:27 ` Mathieu Desnoyers
14 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:28 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
> For your convenience all of it is also available as a conglomerate from
> git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Apart from a couple of nit picks, I couldn't spot anything out of place
and the overall approach looks solid. Please feel free to include:
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-10 11:28 ` K Prateek Nayak
@ 2025-09-10 14:50 ` Thomas Gleixner
2025-09-11 3:03 ` K Prateek Nayak
0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-10 14:50 UTC (permalink / raw)
To: K Prateek Nayak, LKML
Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>> For your convenience all of it is also available as a conglomerate from
>> git:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> Apart from a couple of nit picks, I couldn't spot anything out of place
> and the overall approach looks solid. Please feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Thanks a lot for going through it and testing.
Do you have a real workload or a mockup at hand, which benefits
from that slice extension functionality?
It would be really nice to have more than a pretty lame selftest.
thanks,
tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-10 14:50 ` Thomas Gleixner
@ 2025-09-11 3:03 ` K Prateek Nayak
2025-09-11 7:36 ` Prakash Sangappa
0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-11 3:03 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
Hello Thomas,
On 9/10/2025 8:20 PM, Thomas Gleixner wrote:
> On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>> For your convenience all of it is also available as a conglomerate from
>>> git:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>
>> Apart from a couple of nit picks, I couldn't spot anything out of place
>> and the overall approach looks solid. Please feel free to include:
>>
>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Thanks a lot for going through it and testing.
>
> Do you have a real workload or a mockup at hand, which benefits
> from that slice extension functionality?
Not at the moment but we did have some interest for this feature
internally. Give me a week and I'll let you know if they had found a
use-case / have a prototype to test this.
In the meantime, Prakash should have a test bench that he used to
test his early RFC
https://lore.kernel.org/lkml/20241113000126.967713-1-prakash.sangappa@oracle.com/
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-11 3:03 ` K Prateek Nayak
@ 2025-09-11 7:36 ` Prakash Sangappa
0 siblings, 0 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-11 7:36 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Thomas Gleixner, LKML, Peter Zilstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Sep 11, 2025, at 5:03 AM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Thomas,
>
> On 9/10/2025 8:20 PM, Thomas Gleixner wrote:
>> On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
>>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>>> For your convenience all of it is also available as a conglomerate from
>>>> git:
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>>
>>> Apart from a couple of nit picks, I couldn't spot anything out of place
>>> and the overall approach looks solid. Please feel free to include:
>>>
>>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>
>> Thanks a lot for going through it and testing.
>>
>> Do you have a real workload or a mockup at hand, which benefits
>> from that slice extension functionality?
>
> Not at the moment but we did have some interest for this feature
> internally. Give me a week and I'll let you know if they had found a
> use-case / have a prototype to test this.
>
> In the meantime, Prakash should have a test bench that he used to
> test his early RFC
> https://lore.kernel.org/lkml/20241113000126.967713-1-prakash.sangappa@oracle.com/
>
(Have been AFK, and will be for few more days)
The above was with a database workload. Will coordinate with our database team to get it tested
with the updated API from this patch series.
Thanks,
-Prakash
> --
> Thanks and Regards,
> Prateek
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (13 preceding siblings ...)
2025-09-10 11:28 ` K Prateek Nayak
@ 2025-09-11 15:27 ` Mathieu Desnoyers
2025-09-11 20:18 ` Thomas Gleixner
14 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:27 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-09-08 18:59, Thomas Gleixner wrote:
> This is the proper implementation of the PoC code, which I posted in reply
> to the latest iteration of Prakash's time slice extension patches:
>
> https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
>
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
>
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.
>
> This has been attempted to solve at least for a decade, but so far this
> went nowhere. The recent attempts, which started to integrate with the
> already existing RSEQ mechanism, have been at least going into the right
> direction. The full history is partially in the above mentioned mail thread
> and it's ancestors, but also in various threads in the LKML archives, which
it's -> its
> require archaeological efforts to retrieve.
>
> When trying to morph the PoC into actual mergeable code, I stumbled over
> various shortcomings in the RSEQ code, which have been addressed in a
> separate effort. The latest iteration can be found here:
>
> https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
>
> That is a prerequisite for this series as it allows a tight integration
> into the RSEQ code without inflicting a lot of extra overhead into the hot
> paths.
>
> The main change vs. the PoC and the previous attempts is that it utilizes a
> new field in the user space ABI rseq struct, which allows to reduce the
> atomic operations in user space to a bare minimum. If the architecture
> supports CPU local atomics, which protect against the obvious RMW race
> vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
> x86, required.
Good!
>
> The kernel user space ABI consists only of two bits in this new field:
>
> REQUEST and GRANTED
>
> User space sets REQUEST at the begin of the critical section. If it
beginning
> finishes the critical section without interruption then it can clear the
> bit and move on.
>
> If it is interrupted and the interrupt return path in the kernel observes a
> rescheduling request, then the kernel can grant a time slice extension. The
> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
> non-atomic store operation. If it does not grant the extension only the
> REQUEST bit is cleared.
>
> If user space observes the REQUEST bit cleared, when it finished the
> critical section, then it has to check the GRANTED bit. If that is set,
> then it has to invoke the rseq_slice_yield() syscall to terminate the
Does it "have" to ? What is the consequence of misbehaving ?
> extension and yield the CPU.
>
> The code flow in user space is:
>
> // Simple store as there is no concurrency vs. the GRANTED bit
> rseq->slice_ctrl = REQUEST;
>
> critical_section();
>
> // CPU local atomic required here:
> if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> // Non-atomic check is sufficient as this can race
> // against an interrupt, which revokes the grant
> //
> // If not set, then the request was either cleared by the kernel
> // without grant or the grant was revoked.
> //
> // If set, tell the kernel that the critical section is done
> // so it can reschedule
> if (rseq->slice_ctrl & GRANTED)
> rseq_slice_yield();
I wonder if we could achieve this without the cpu-local atomic, and
just rely on simple relaxed-atomic or volatile loads/stores and compiler
barriers in userspace. Let's say we have:
union {
u16 slice_ctrl;
struct {
u8 rseq->slice_request;
u8 rseq->slice_grant;
};
};
With userspace doing:
rseq->slice_request = true; /* WRITE_ONCE() */
barrier();
critical_section();
barrier();
rseq->slice_request = false; /* WRITE_ONCE() */
if (rseq->slice_grant) /* READ_ONCE() */
rseq_slice_yield();
In the kernel interrupt return path, if the kernel observes
"rseq->slice_request" set and "rseq->slice_grant" cleared,
it grants the extension and sets "rseq->slice_grant".
rseq_slice_yield() clears rseq->slice_grant.
> }
>
> The other details, which differ from earlier attempts and the PoC, are:
>
> - A separate syscall for terminating the extension to avoid side
> effects and overloading of the already ill defined sched_yield(2)
>
> - A separate per CPU timer, which again does not inflict side effects
> on the scheduler internal hrtick timer. The hrtick timer can be
> disabled at run-time and an expiry can cause interesting problems in
> the scheduler code when it is unexpectedly invoked.
>
> - Tight integration into the rseq exit to user mode code. It utilizes
> the path when TIF_RESQ is not set at the end of exit_to_user_mode()
TIF_RSEQ
> to arm the timer if an extension was granted. TIF_RSEQ indicates that
> the task was scheduled and therefore would revoke the grant anyway.
>
> - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
> model which is utilized by PREEMPT_RT.
Can you clarify why this attempt is "futile" ?
Thanks,
Mathieu
>
> It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
> but not TIF_PREEMPT.
>
> Pretending that this can be made work for TIF_PREEMPT on a fully
> preemptible kernel is just wishful thinking as the chance that
> TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
> obvious reasons.
>
> This only "works" by some definition of works, i.e. on a best effort
> basis, for the PREEMPT_NONE model and nothing else. Though given the
> problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
> running code sections, the days of these models should be hopefully
> numbered and everything consolidated on the LAZY model.
>
> That makes this distinction moot and everything restricted to
> TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
> extension mechanism into the scheduler hotpath. I'm sure there will
> be attempts to do that as there is no lack of crazy folks out
> there...
>
> - Actual documentation of the user space ABI and a initial self test.
>
> The RSEQ modifications on which this series is based can be found here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
>
> For your convenience all of it is also available as a conglomerate from
> git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> Thanks,
>
> tglx
> ---
> Documentation/userspace-api/index.rst | 1
> Documentation/userspace-api/rseq.rst | 129 ++++++++++++
> arch/alpha/kernel/syscalls/syscall.tbl | 1
> arch/arm/tools/syscall.tbl | 1
> arch/arm64/tools/syscall_32.tbl | 1
> arch/m68k/kernel/syscalls/syscall.tbl | 1
> arch/microblaze/kernel/syscalls/syscall.tbl | 1
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1
> arch/parisc/kernel/syscalls/syscall.tbl | 1
> arch/powerpc/kernel/syscalls/syscall.tbl | 1
> arch/s390/kernel/syscalls/syscall.tbl | 1
> arch/s390/mm/pfault.c | 3
> arch/sh/kernel/syscalls/syscall.tbl | 1
> arch/sparc/kernel/syscalls/syscall.tbl | 1
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> arch/xtensa/kernel/syscalls/syscall.tbl | 1
> include/linux/entry-common.h | 2
> include/linux/rseq.h | 11 +
> include/linux/rseq_entry.h | 176 ++++++++++++++++
> include/linux/rseq_types.h | 28 ++
> include/linux/sched.h | 7
> include/linux/syscalls.h | 1
> include/linux/thread_info.h | 16 -
> include/uapi/asm-generic/unistd.h | 5
> include/uapi/linux/prctl.h | 10
> include/uapi/linux/rseq.h | 28 ++
> init/Kconfig | 12 +
> kernel/entry/common.c | 14 +
> kernel/entry/syscall-common.c | 11 -
> kernel/rcu/tiny.c | 8
> kernel/rcu/tree.c | 14 -
> kernel/rcu/tree_exp.h | 3
> kernel/rcu/tree_plugin.h | 9
> kernel/rcu/tree_stall.h | 3
> kernel/rseq.c | 293 ++++++++++++++++++++++++++++
> kernel/sys.c | 6
> kernel/sys_ni.c | 1
> scripts/syscall.tbl | 1
> tools/testing/selftests/rseq/.gitignore | 1
> tools/testing/selftests/rseq/Makefile | 5
> tools/testing/selftests/rseq/rseq-abi.h | 2
> tools/testing/selftests/rseq/slice_test.c | 217 ++++++++++++++++++++
> 45 files changed, 991 insertions(+), 42 deletions(-)
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-09-09 0:04 ` Randy Dunlap
@ 2025-09-11 15:41 ` Mathieu Desnoyers
2025-09-11 15:49 ` Mathieu Desnoyers
2025-09-22 5:28 ` Prakash Sangappa
2 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:41 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Michael Jeanson
On 2025-09-08 18:59, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
>
> - Two flag bits for the rseq user space ABI, which allow user space to
> query the availability and enablement without a syscall.
>
> - A new member to the user space ABI struct rseq, which is going to be
> used to communicate request and grant between kernel and user space.
>
> - A rseq state struct to hold the kernel state of this
>
> - Documentation of the new mechanism
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> Documentation/userspace-api/index.rst | 1
> Documentation/userspace-api/rseq.rst | 129 ++++++++++++++++++++++++++++++++++
> include/linux/rseq_types.h | 26 ++++++
> include/uapi/linux/rseq.h | 28 +++++++
> init/Kconfig | 12 +++
> kernel/rseq.c | 8 ++
> 6 files changed, 204 insertions(+)
>
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -21,6 +21,7 @@ System calls
> ebpf/index
> ioctl/index
> mseal
> + rseq
>
> Security-related interfaces
> ===========================
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:
> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space
Also reading the "concurrency ID" (mm_cid).
> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
> +ABI is unfortunately only available in the code and selftests.
Note that I've done a man page available here:
https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2
which describes the ABI.
> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(
At what level should we document this here ? Would it be OK to show examples
that rely on librseq helpers ?
> +
> +Scheduler time slice extensions
> +-------------------------------
> +
Note: I suspect we'll also want to add this section to the rseq(2) man page.
> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> + * Enabled in Kconfig
> +
> + * Enabled at boot time (default is enabled)
> +
> + * A rseq user space pointer has been registered for the thread
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg4 and arg5 must be zero
> +ENOTSUPP Functionality was disabled on the kernel command line
> +ENXIO Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL Functionality not available or invalid function arguments.
> + Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
> +space and only for informational purposes.
Do those flags have a meaning within the struct rseq_cs @flags field as
well, or just within the struct rseq flags field ?
> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling
> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.
> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()
> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from
> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> + rseq->slice_ctrl = REQUEST;
> + critical_section();
> + if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> + if (rseq->slice_ctrl & GRANTED)
> + rseq_slice_yield();
> + }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU
> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> + if (rseq->slice_ctrl & GRANTED)
> + -> Interrupt results in schedule and grant revocation
> + rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.
See my cover letter comments about the algorithm above.
Thanks,
Mathieu
> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -71,12 +71,35 @@ struct rseq_ids {
> };
>
> /**
> + * union rseq_slice_state - Status information for rseq time slice extension
> + * @state: Compound to access the overall state
> + * @enabled: Time slice extension is enabled for the task
> + * @granted: Time slice extension was granted to the task
> + */
> +union rseq_slice_state {
> + u16 state;
> + struct {
> + u8 enabled;
> + u8 granted;
> + };
> +};
> +
> +/**
> + * struct rseq_slice - Status information for rseq time slice extension
> + * @state: Time slice extension state
> + */
> +struct rseq_slice {
> + union rseq_slice_state state;
> +};
> +
> +/**
> * struct rseq_data - Storage for all rseq related data
> * @usrptr: Pointer to the registered user space RSEQ memory
> * @len: Length of the RSEQ region
> * @sig: Signature of critial section abort IPs
> * @event: Storage for event management
> * @ids: Storage for cached CPU ID and MM CID
> + * @slice: Storage for time slice extension data
> */
> struct rseq_data {
> struct rseq __user *usrptr;
> @@ -84,6 +107,9 @@ struct rseq_data {
> u32 sig;
> struct rseq_event event;
> struct rseq_ids ids;
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> + struct rseq_slice slice;
> +#endif
> };
>
> #else /* CONFIG_RSEQ */
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a seperate byte */
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
>
> enum rseq_cs_flags {
> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
> +
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
> + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
> + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
> +};
> +
> +enum rseq_slice_bits {
> + /* Time slice extension ABI bits */
> + RSEQ_SLICE_EXT_REQUEST_BIT = 0,
> + RSEQ_SLICE_EXT_GRANTED_BIT = 1,
> +};
> +
> +enum rseq_slice_masks {
> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
> };
>
> /*
> @@ -142,6 +164,12 @@ struct rseq {
> __u32 mm_cid;
>
> /*
> + * Time slice extension control word. CPU local atomic updates from
> + * kernel and user space.
> + */
> + __u32 slice_ctrl;
> +
> + /*
> * Flexible array member at end of structure, after last feature field.
> */
> char end[];
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>
> If unsure, say N.
>
> +config RSEQ_SLICE_EXTENSION
> + bool "Enable rseq based time slice extension mechanism"
> + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> + help
> + Allows userspace to request a limited time slice extension when
> + returning from an interrupt to user space via the RSEQ shared
> + data ABI. If granted, that allows to complete a critical section,
> + so that other threads are not stuck on a conflicted resource,
> + while the task is scheduled out.
> +
> + If unsure, say N.
> +
> config DEBUG_RSEQ
> default n
> bool "Enable debugging of rseq() system call" if EXPERT
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
> */
> SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
> {
> + u32 rseqfl = 0;
> +
> if (flags & RSEQ_FLAG_UNREGISTER) {
> if (flags & ~RSEQ_FLAG_UNREGISTER)
> return -EINVAL;
> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> if (put_user_masked_u64(0UL, &rseq->rseq_cs))
> return -EFAULT;
>
> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> +
> + if (put_user_masked_u32(rseqfl, &rseq->flags))
> + return -EFAULT;
> +
> /*
> * Activate the registration by setting the rseq area address, length
> * and signature in the task struct.
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-09-09 3:10 ` K Prateek Nayak
@ 2025-09-11 15:42 ` Mathieu Desnoyers
1 sibling, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:42 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-09-08 18:59, Thomas Gleixner wrote:
> Guard the time slice extension functionality with a static key, which can
> be disabled on the kernel command line.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
> include/linux/rseq_entry.h | 11 +++++++++++
> kernel/rseq.c | 17 +++++++++++++++++
> 2 files changed, 28 insertions(+)
>
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -77,6 +77,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
> #define rseq_inline __always_inline
> #endif
>
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static __always_inline bool rseq_slice_extension_enabled(void)
> +{
> + return static_branch_likely(&rseq_slice_extension_key);
> +}
> +#else /* CONFIG_RSEQ_SLICE_EXTENSION */
> +static inline bool rseq_slice_extension_enabled(void) { return false; }
> +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
> +
> bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
> bool rseq_debug_validate_ids(struct task_struct *t);
>
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -474,3 +474,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>
> return 0;
> }
> +
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static int __init rseq_slice_cmdline(char *str)
> +{
> + bool on;
> +
> + if (kstrtobool(str, &on))
> + return -EINVAL;
> +
> + if (!on)
> + static_branch_disable(&rseq_slice_extension_key);
> + return 0;
as pointed out elsewhere, this should be return 1.
Other than that:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +}
> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 04/12] rseq: Add statistics for time slice extensions
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-09-11 15:43 ` Mathieu Desnoyers
0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:43 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-09-08 18:59, Thomas Gleixner wrote:
> Extend the quick statistics with time slice specific fields.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
> include/linux/rseq_entry.h | 4 ++++
> kernel/rseq.c | 12 ++++++++++++
> 2 files changed, 16 insertions(+)
>
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -15,6 +15,10 @@ struct rseq_stats {
> unsigned long cs;
> unsigned long clear;
> unsigned long fixup;
> + unsigned long s_granted;
> + unsigned long s_expired;
> + unsigned long s_revoked;
> + unsigned long s_yielded;
> };
>
> DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi
> stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
> stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
> stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> + stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu));
> + stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu));
> + stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu));
> + stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu));
> + }
> }
>
> seq_printf(m, "exit: %16lu\n", stats.exit);
> @@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi
> seq_printf(m, "cs: %16lu\n", stats.cs);
> seq_printf(m, "clear: %16lu\n", stats.clear);
> seq_printf(m, "fixup: %16lu\n", stats.fixup);
> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> + seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
> + seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
> + seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
> + seq_printf(m, "syield: %16lu\n", stats.s_yielded);
> + }
> return 0;
> }
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-11 15:41 ` Mathieu Desnoyers
@ 2025-09-11 15:49 ` Mathieu Desnoyers
0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:49 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Michael Jeanson
On 2025-09-11 11:41, Mathieu Desnoyers wrote:
> On 2025-09-08 18:59, Thomas Gleixner wrote:
[...]
>
>> +
>> +The kernel enforces flag consistency and terminates the thread with
>> SIGSEGV
>> +if it detects a violation.
>> --- a/include/linux/rseq_types.h
>> +++ b/include/linux/rseq_types.h
>> @@ -71,12 +71,35 @@ struct rseq_ids {
>> };
>> /**
>> + * union rseq_slice_state - Status information for rseq time slice
>> extension
>> + * @state: Compound to access the overall state
>> + * @enabled: Time slice extension is enabled for the task
>> + * @granted: Time slice extension was granted to the task
>> + */
>> +union rseq_slice_state {
>> + u16 state;
>> + struct {
>> + u8 enabled;
>> + u8 granted;
>> + };
>> +};
>> +
>> +/**
>> + * struct rseq_slice - Status information for rseq time slice extension
>> + * @state: Time slice extension state
>> + */
>> +struct rseq_slice {
>> + union rseq_slice_state state;
>> +};
>> +
>> +/**
>> * struct rseq_data - Storage for all rseq related data
>> * @usrptr: Pointer to the registered user space RSEQ memory
>> * @len: Length of the RSEQ region
>> * @sig: Signature of critial section abort IPs
>> * @event: Storage for event management
>> * @ids: Storage for cached CPU ID and MM CID
>> + * @slice: Storage for time slice extension data
>> */
>> struct rseq_data {
>> struct rseq __user *usrptr;
>> @@ -84,6 +107,9 @@ struct rseq_data {
>> u32 sig;
>> struct rseq_event event;
>> struct rseq_ids ids;
>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>> + struct rseq_slice slice;
>> +#endif
Note: we could move this #ifdef to surround the definition
of both union rseq_slice_state and struct rseq_slice,
and emit an empty structure in the #else case rather than
do the ifdef here.
Thanks,
Mathieu
>> };
>> #else /* CONFIG_RSEQ */
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -23,9 +23,15 @@ enum rseq_flags {
>> };
>> enum rseq_cs_flags_bit {
>> + /* Historical and unsupported bits */
>> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
>> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
>> + /* (3) Intentional gap to put new bits into a seperate byte */
>> +
>> + /* User read only feature flags */
>> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
>> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
>> };
>> enum rseq_cs_flags {
>> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
>> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
>> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>> +
>> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
>> + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
>> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
>> + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
>> +};
>> +
>> +enum rseq_slice_bits {
>> + /* Time slice extension ABI bits */
>> + RSEQ_SLICE_EXT_REQUEST_BIT = 0,
>> + RSEQ_SLICE_EXT_GRANTED_BIT = 1,
>> +};
>> +
>> +enum rseq_slice_masks {
>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>> };
>> /*
>> @@ -142,6 +164,12 @@ struct rseq {
>> __u32 mm_cid;
>> /*
>> + * Time slice extension control word. CPU local atomic updates from
>> + * kernel and user space.
>> + */
>> + __u32 slice_ctrl;
>> +
>> + /*
>> * Flexible array member at end of structure, after last feature
>> field.
>> */
>> char end[];
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>> If unsure, say N.
>> +config RSEQ_SLICE_EXTENSION
>> + bool "Enable rseq based time slice extension mechanism"
>> + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY &&
>> HAVE_GENERIC_TIF_BITS
>> + help
>> + Allows userspace to request a limited time slice extension
>> when
>> + returning from an interrupt to user space via the RSEQ shared
>> + data ABI. If granted, that allows to complete a critical section,
>> + so that other threads are not stuck on a conflicted resource,
>> + while the task is scheduled out.
>> +
>> + If unsure, say N.
>> +
>> config DEBUG_RSEQ
>> default n
>> bool "Enable debugging of rseq() system call" if EXPERT
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
>> */
>> SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> int, flags, u32, sig)
>> {
>> + u32 rseqfl = 0;
>> +
>> if (flags & RSEQ_FLAG_UNREGISTER) {
>> if (flags & ~RSEQ_FLAG_UNREGISTER)
>> return -EINVAL;
>> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>> if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>> return -EFAULT;
>> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
>> + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> +
>> + if (put_user_masked_u32(rseqfl, &rseq->flags))
>> + return -EFAULT;
>> +
>> /*
>> * Activate the registration by setting the rseq area address,
>> length
>> * and signature in the task struct.
>>
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 05/12] rseq: Add prctl() to enable time slice extensions
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-09-11 15:50 ` Mathieu Desnoyers
2025-09-11 16:52 ` K Prateek Nayak
0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:50 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-09-08 18:59, Thomas Gleixner wrote:
> Implement a prctl() so that tasks can enable the time slice extension
> mechanism. This fails, when time slice extensions are disabled at compile
> time or on the kernel command line and when no rseq pointer is registered
> in the kernel.
>
> That allows to implement a single trivial check in the exit to user mode
> hotpath, to decide whether the whole mechanism needs to be invoked.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
> include/linux/rseq.h | 9 +++++++
> include/uapi/linux/prctl.h | 10 ++++++++
> kernel/rseq.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
> kernel/sys.c | 6 +++++
> 4 files changed, 77 insertions(+)
>
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -190,4 +190,13 @@ void rseq_syscall(struct pt_regs *regs);
> static inline void rseq_syscall(struct pt_regs *regs) { }
> #endif /* !CONFIG_DEBUG_RSEQ */
>
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
> +#else /* CONFIG_RSEQ_SLICE_EXTENSION */
> +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
> +{
> + return -EINVAL;
> +}
> +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
> +
> #endif /* _LINUX_RSEQ_H */
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -376,4 +376,14 @@ struct prctl_mm_map {
> # define PR_FUTEX_HASH_SET_SLOTS 1
> # define PR_FUTEX_HASH_GET_SLOTS 2
>
> +/* RSEQ time slice extensions */
> +#define PR_RSEQ_SLICE_EXTENSION 79
> +# define PR_RSEQ_SLICE_EXTENSION_GET 1
> +# define PR_RSEQ_SLICE_EXTENSION_SET 2
> +/*
> + * Bits for RSEQ_SLICE_EXTENSION_GET/SET
> + * PR_RSEQ_SLICE_EXT_ENABLE: Enable
> + */
> +# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
> +
> #endif /* _LINUX_PRCTL_H */
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -71,6 +71,7 @@
> #define RSEQ_BUILD_SLOW_PATH
>
> #include <linux/debugfs.h>
> +#include <linux/prctl.h>
> #include <linux/ratelimit.h>
> #include <linux/rseq_entry.h>
> #include <linux/sched.h>
> @@ -490,6 +491,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> #ifdef CONFIG_RSEQ_SLICE_EXTENSION
> DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>
> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
> +{
> + switch (arg2) {
> + case PR_RSEQ_SLICE_EXTENSION_GET:
> + if (arg3)
> + return -EINVAL;
> + return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
> +
> + case PR_RSEQ_SLICE_EXTENSION_SET: {
> + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
> +
> + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
> + return -EINVAL;
> + if (!rseq_slice_extension_enabled())
> + return -ENOTSUPP;
> + if (!current->rseq.usrptr)
> + return -ENXIO;
> +
> + /* No change? */
> + if (enable == !!current->rseq.slice.state.enabled)
> + return 0;
> +
> + if (get_user(rflags, ¤t->rseq.usrptr->flags))
> + goto die;
> +
> + if (current->rseq.slice.state.enabled)
> + valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
> +
> + if ((rflags & valid) != valid)
> + goto die;
> +
> + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> + if (enable)
> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
> +
> + if (put_user(rflags, ¤t->rseq.usrptr->flags))
> + goto die;
> +
> + current->rseq.slice.state.enabled = enable;
What should happen to this enabled state if rseq is unregistered
after this prctl ?
Thanks,
Mathieu
> + return 0;
> + }
> + default:
> + return -EINVAL;
> + }
> +die:
> + force_sig(SIGSEGV);
> + return -EFAULT;
> +}
> +
> static int __init rseq_slice_cmdline(char *str)
> {
> bool on;
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -53,6 +53,7 @@
> #include <linux/time_namespace.h>
> #include <linux/binfmts.h>
> #include <linux/futex.h>
> +#include <linux/rseq.h>
>
> #include <linux/sched.h>
> #include <linux/sched/autogroup.h>
> @@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
> case PR_FUTEX_HASH:
> error = futex_hash_prctl(arg2, arg3, arg4);
> break;
> + case PR_RSEQ_SLICE_EXTENSION:
> + if (arg4 || arg5)
> + return -EINVAL;
> + error = rseq_slice_extension_prctl(arg2, arg3);
> + break;
> default:
> trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
> error = -EINVAL;
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 05/12] rseq: Add prctl() to enable time slice extensions
2025-09-11 15:50 ` Mathieu Desnoyers
@ 2025-09-11 16:52 ` K Prateek Nayak
2025-09-11 17:18 ` Mathieu Desnoyers
0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-11 16:52 UTC (permalink / raw)
To: Mathieu Desnoyers, Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
Hello Mathieu,
On 9/11/2025 9:20 PM, Mathieu Desnoyers wrote:
>> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
>> +{
>> + switch (arg2) {
>> + case PR_RSEQ_SLICE_EXTENSION_GET:
>> + if (arg3)
>> + return -EINVAL;
>> + return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
>> +
>> + case PR_RSEQ_SLICE_EXTENSION_SET: {
>> + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
>> +
>> + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
>> + return -EINVAL;
>> + if (!rseq_slice_extension_enabled())
>> + return -ENOTSUPP;
>> + if (!current->rseq.usrptr)
>> + return -ENXIO;
>> +
>> + /* No change? */
>> + if (enable == !!current->rseq.slice.state.enabled)
>> + return 0;
>> +
>> + if (get_user(rflags, ¤t->rseq.usrptr->flags))
>> + goto die;
>> +
>> + if (current->rseq.slice.state.enabled)
>> + valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>> +
>> + if ((rflags & valid) != valid)
>> + goto die;
>> +
>> + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> + if (enable)
>> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>> +
>> + if (put_user(rflags, ¤t->rseq.usrptr->flags))
>> + goto die;
>> +
>> + current->rseq.slice.state.enabled = enable;
>
> What should happen to this enabled state if rseq is unregistered
> after this prctl ?
Wouldn't rseq_reset() deal with it since it does a:
memset(&t->rseq, 0, sizeof(t->rseq));
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 05/12] rseq: Add prctl() to enable time slice extensions
2025-09-11 16:52 ` K Prateek Nayak
@ 2025-09-11 17:18 ` Mathieu Desnoyers
0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 17:18 UTC (permalink / raw)
To: K Prateek Nayak, Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch
On 2025-09-11 12:52, K Prateek Nayak wrote:
> Hello Mathieu,
>
> On 9/11/2025 9:20 PM, Mathieu Desnoyers wrote:
>>> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
>>> +{
>>> + switch (arg2) {
>>> + case PR_RSEQ_SLICE_EXTENSION_GET:
>>> + if (arg3)
>>> + return -EINVAL;
>>> + return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
>>> +
>>> + case PR_RSEQ_SLICE_EXTENSION_SET: {
>>> + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>>> + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
>>> +
>>> + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
>>> + return -EINVAL;
>>> + if (!rseq_slice_extension_enabled())
>>> + return -ENOTSUPP;
>>> + if (!current->rseq.usrptr)
>>> + return -ENXIO;
>>> +
>>> + /* No change? */
>>> + if (enable == !!current->rseq.slice.state.enabled)
>>> + return 0;
>>> +
>>> + if (get_user(rflags, ¤t->rseq.usrptr->flags))
>>> + goto die;
>>> +
>>> + if (current->rseq.slice.state.enabled)
>>> + valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>>> +
>>> + if ((rflags & valid) != valid)
>>> + goto die;
>>> +
>>> + rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>>> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>>> + if (enable)
>>> + rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>>> +
>>> + if (put_user(rflags, ¤t->rseq.usrptr->flags))
>>> + goto die;
>>> +
>>> + current->rseq.slice.state.enabled = enable;
>>
>> What should happen to this enabled state if rseq is unregistered
>> after this prctl ?
>
> Wouldn't rseq_reset() deal with it since it does a:
>
> memset(&t->rseq, 0, sizeof(t->rseq));
>
Good point, thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-11 15:27 ` Mathieu Desnoyers
@ 2025-09-11 20:18 ` Thomas Gleixner
2025-09-12 12:33 ` Mathieu Desnoyers
0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-11 20:18 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Thu, Sep 11 2025 at 11:27, Mathieu Desnoyers wrote:
> On 2025-09-08 18:59, Thomas Gleixner wrote:
>> If it is interrupted and the interrupt return path in the kernel observes a
>> rescheduling request, then the kernel can grant a time slice extension. The
>> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
>> non-atomic store operation. If it does not grant the extension only the
>> REQUEST bit is cleared.
>>
>> If user space observes the REQUEST bit cleared, when it finished the
>> critical section, then it has to check the GRANTED bit. If that is set,
>> then it has to invoke the rseq_slice_yield() syscall to terminate the
>
> Does it "have" to ? What is the consequence of misbehaving ?
It receives SIGSEGV because that means that it did not follow the rules
and stuck an arbitrary syscall into the critical section.
> I wonder if we could achieve this without the cpu-local atomic, and
> just rely on simple relaxed-atomic or volatile loads/stores and compiler
> barriers in userspace. Let's say we have:
>
> union {
> u16 slice_ctrl;
> struct {
> u8 rseq->slice_request;
> u8 rseq->slice_grant;
Interesting way to define a struct member :)
> };
> };
>
> With userspace doing:
>
> rseq->slice_request = true; /* WRITE_ONCE() */
> barrier();
> critical_section();
> barrier();
> rseq->slice_request = false; /* WRITE_ONCE() */
> if (rseq->slice_grant) /* READ_ONCE() */
> rseq_slice_yield();
That should work as it's strictly CPU local. Good point, now that you
said it it's obvious :)
Let me rework it accordingly.
> In the kernel interrupt return path, if the kernel observes
> "rseq->slice_request" set and "rseq->slice_grant" cleared,
> it grants the extension and sets "rseq->slice_grant".
They can't be both set. If they are then user space fiddled with the
bits.
>> - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
>> model which is utilized by PREEMPT_RT.
>
> Can you clarify why this attempt is "futile" ?
Because on RT interrupts usually end up with TIF_PREEMPT set either due
to softirqs or interrupt threads. And no, we don't want to
overcomplicate things right now to make it "work" for real-time tasks in
the first place as that's just going to result either endless
discussions or subtle latency problems or both. For now allowing it for
the 'LAZY' case is good enough.
With the non-RT LAZY model that's not really a good idea either, because
when TIF_PREEMPT is set, then either the preempting task is in a RT
class or the to be preempted task already has overrun the LAZY granted
computation time and the scheduler sets TIF_PREEMPT to whack it over the
head.
Thanks,
tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-11 20:18 ` Thomas Gleixner
@ 2025-09-12 12:33 ` Mathieu Desnoyers
2025-09-12 16:31 ` Thomas Gleixner
0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-12 12:33 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-09-11 16:18, Thomas Gleixner wrote:
> On Thu, Sep 11 2025 at 11:27, Mathieu Desnoyers wrote:
>> On 2025-09-08 18:59, Thomas Gleixner wrote:
[...]
>> Does it "have" to ? What is the consequence of misbehaving ?
>
> It receives SIGSEGV because that means that it did not follow the rules
> and stuck an arbitrary syscall into the critical section.
Not following the rules could also be done by just looping for a long
time in userspace within or after the critical section, in which case
the timer should catch it.
>
>> I wonder if we could achieve this without the cpu-local atomic, and
>> just rely on simple relaxed-atomic or volatile loads/stores and compiler
>> barriers in userspace. Let's say we have:
>>
>> union {
>> u16 slice_ctrl;
>> struct {
>> u8 rseq->slice_request;
>> u8 rseq->slice_grant;
>
> Interesting way to define a struct member :)
This goes with the usual warning "this code has never even been
remotely close to a compiler, so handle with care" ;-)
>
>> };
>> };
>>
>> With userspace doing:
>>
>> rseq->slice_request = true; /* WRITE_ONCE() */
>> barrier();
>> critical_section();
>> barrier();
>> rseq->slice_request = false; /* WRITE_ONCE() */
>> if (rseq->slice_grant) /* READ_ONCE() */
>> rseq_slice_yield();
>
> That should work as it's strictly CPU local. Good point, now that you
> said it it's obvious :)
>
> Let me rework it accordingly.
I have two questions wrt ABI here:
1) Do we expect the slice requests to be done from C and higher level
languages or only from assembly ?
2) Slice requests are a good fit for locking. Locking typically
has nesting ability.
We should consider making the slice request ABI a 8-bit
or 16-bit nesting counter to allow nesting of its users.
3) Slice requests are also a good fit for rseq critical sections.
Of course someone could explicitly increment/decrement the
slice request counter before/after the rseq critical sections, but
I think we could do better there and integrate this directly within
the struct rseq_cs as a new critical section flag. Basically, a
critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
better name) set within its descriptor flags would behave as if
the slice request counter is non-zero when preempted without
requiring any extra instruction on the fast path. The only
added overhead would be a check of the rseq->slice_grant flag
when exiting the critical section to conditionally issue
rseq_slice_yield().
This point (3) is an optimization that could come as a future step
if the overhead of incrementing the slice_request proves to be a
bottleneck for rseq critical sections.
>
>> In the kernel interrupt return path, if the kernel observes
>> "rseq->slice_request" set and "rseq->slice_grant" cleared,
>> it grants the extension and sets "rseq->slice_grant".
>
> They can't be both set. If they are then user space fiddled with the
> bits.
Ah, yes, that's true if the kernel clears the slice_request when setting
the slice_grant.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-12 12:33 ` Mathieu Desnoyers
@ 2025-09-12 16:31 ` Thomas Gleixner
2025-09-12 19:26 ` Mathieu Desnoyers
0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-12 16:31 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
> On 2025-09-11 16:18, Thomas Gleixner wrote:
>> It receives SIGSEGV because that means that it did not follow the rules
>> and stuck an arbitrary syscall into the critical section.
>
> Not following the rules could also be done by just looping for a long
> time in userspace within or after the critical section, in which case
> the timer should catch it.
It's pretty much impossible to tell for the kernel without more
overhead, whether that's actually a violation of the rules or not.
The operation after the grant can be interrupted (without resulting in
scheduling), which is out of control of the task which got the extension
granted.
The timer is there to ensure that there is an upper bound to the grant
independent of the actual reason.
Going through a different syscall is an obvious deviation from the rule.
As far I understood the earlier discussions, scheduler folks want to
enforce that because of PREEMPT_NONE semantics, where a randomly chosen
syscall might not result in an immediate reschedule because the work,
which needs to be done takes arbitrary time to complete.
Though that's arguably not much different from
syscall()
-> tick -> NEED_RESCHED
do_tons_of_work();
exit_to_user()
schedule();
except that in the slice extension case, the latency increases by the
slice extension time.
If we allow arbitrary syscalls to terminate the grant, then we need to
stick an immediate schedule() into the syscall entry work function. We'd
still need the separate yield() syscall to provide a side effect free
way of termination.
I have no strong opinions either way. Peter?
>>> rseq->slice_request = true; /* WRITE_ONCE() */
>>> barrier();
>>> critical_section();
>>> barrier();
>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>> if (rseq->slice_grant) /* READ_ONCE() */
>>> rseq_slice_yield();
>>
>> That should work as it's strictly CPU local. Good point, now that you
>> said it it's obvious :)
>>
>> Let me rework it accordingly.
>
> I have two questions wrt ABI here:
>
> 1) Do we expect the slice requests to be done from C and higher level
> languages or only from assembly ?
It doesn't matter as long as the ordering is guaranteed.
> 2) Slice requests are a good fit for locking. Locking typically
> has nesting ability.
>
> We should consider making the slice request ABI a 8-bit
> or 16-bit nesting counter to allow nesting of its users.
Making request a counter requires to keep request set when the
extension is granted. So the states would be:
request granted
0 0 Neutral
>0 0 Requested
>=0 1 Granted
That should work.
Though I'm not really convinced that unconditionally embeddeding it into
random locking primitives is the right thing to do.
The extension makes only sense, when the actual critical section is
small and likely to complete within the extension time, which is usually
only true for highly optimized code and not for general usage, where the
lock held section is arbitrary long and might even result in syscalls
even if the critical section itself does not have an obvious explicit
syscall embedded:
lock(a)
lock(b) <- Contention results in syscall
Same applies for library functions within a critical section.
That then immediately conflicts with the yield mechanism rules, because
the extension could have been granted _before_ the syscall happens, so
we'd have remove that restriction too.
That said, we can make the ABI a counter and split the slice control
word into two u16. So the decision function would be:
get_usr(ctrl);
if (!ctrl.request)
return;
....
ctrl.granted = 1;
put_usr(ctrl);
Along with documentation why this should only be used nested when you
know what you are doing.
> 3) Slice requests are also a good fit for rseq critical sections.
> Of course someone could explicitly increment/decrement the
> slice request counter before/after the rseq critical sections, but
> I think we could do better there and integrate this directly within
> the struct rseq_cs as a new critical section flag. Basically, a
> critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
> better name) set within its descriptor flags would behave as if
> the slice request counter is non-zero when preempted without
> requiring any extra instruction on the fast path. The only
> added overhead would be a check of the rseq->slice_grant flag
> when exiting the critical section to conditionally issue
> rseq_slice_yield().
Plus checking first whether rseq->slice.request is actually zero,
i.e. whether the rseq critical section was the outermost one. If not,
you cannot invoke the yield even if granted is true, right?
But mixing state spaces is not really a good idea at all. Let's not go
there.
Also you'd make checking of rseq_cs unconditional, which means extra
work in the grant decision function as it would then have to do:
if (!usr->slice.ctrl.request) {
if (!usr->rseq_cs)
return;
if (!valid_ptr(usr->rseq_cs))
goto die;
if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
return;
if (!(use->rseq_cs.flags & REQUEST))
return;
}
IOW, we'd copy half of the rseq cs handling into that code.
Can we please keep it independent and simple?
Thanks,
tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-12 16:31 ` Thomas Gleixner
@ 2025-09-12 19:26 ` Mathieu Desnoyers
2025-09-13 13:02 ` Thomas Gleixner
0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-12 19:26 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Florian Weimer, carlos@redhat.com, libc-coord
[ For those just CC'd on this thread, the discussion is about time slice
extension for userspace critical sections. We are specifically
discussing the kernel ABI we plan to expose to userspace. ]
On 2025-09-12 12:31, Thomas Gleixner wrote:
> On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
>> On 2025-09-11 16:18, Thomas Gleixner wrote:
>>> It receives SIGSEGV because that means that it did not follow the rules
>>> and stuck an arbitrary syscall into the critical section.
>>
>> Not following the rules could also be done by just looping for a long
>> time in userspace within or after the critical section, in which case
>> the timer should catch it.
>
> It's pretty much impossible to tell for the kernel without more
> overhead, whether that's actually a violation of the rules or not.
>
> The operation after the grant can be interrupted (without resulting in
> scheduling), which is out of control of the task which got the extension
> granted.
>
> The timer is there to ensure that there is an upper bound to the grant
> independent of the actual reason.
If the worse side-effect of this feature is that the slice extension
is not granted when users misbehave, IMHO this would increase the
likelihood of adoption compared to failure modes that end up killing the
offending processes.
>
> Going through a different syscall is an obvious deviation from the rule.
AFAIU, the grant is cleared when a signal handler is delivered, which
makes it OK for signals to issue system calls even if they are nested
on top of a granted extension critical section.
>
> As far I understood the earlier discussions, scheduler folks want to
> enforce that because of PREEMPT_NONE semantics, where a randomly chosen
> syscall might not result in an immediate reschedule because the work,
> which needs to be done takes arbitrary time to complete.
>
> Though that's arguably not much different from
>
> syscall()
> -> tick -> NEED_RESCHED
> do_tons_of_work();
> exit_to_user()
> schedule();
>
> except that in the slice extension case, the latency increases by the
> slice extension time.
>
> If we allow arbitrary syscalls to terminate the grant, then we need to
> stick an immediate schedule() into the syscall entry work function. We'd
> still need the separate yield() syscall to provide a side effect free
> way of termination.
>
> I have no strong opinions either way. Peter?
If it happens to not be too bothersome to allow arbitrary system calls
to act as implicit rseq_slice_yield() rather than result in a
segmentation fault, I think it would make this feature more widely
adopted.
Another scenario I have in mind is a userspace critical section that
would typically benefit from slice extension, but seldomly requires
to issue a system call. In C and higher level languages, that could be
very much outside of the user control, such as accessing a
global-dynamic TLS variable located within a global-dynamic shared
object, which can trigger memory allocation under the hood on first
access.
Handling syscall within granted extension by killing the process
will likely reserve this feature to the niche use-cases.
>
>>>> rseq->slice_request = true; /* WRITE_ONCE() */
>>>> barrier();
>>>> critical_section();
>>>> barrier();
>>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>>> if (rseq->slice_grant) /* READ_ONCE() */
>>>> rseq_slice_yield();
>>>
>>> That should work as it's strictly CPU local. Good point, now that you
>>> said it it's obvious :)
>>>
>>> Let me rework it accordingly.
>>
>> I have two questions wrt ABI here:
>>
>> 1) Do we expect the slice requests to be done from C and higher level
>> languages or only from assembly ?
>
> It doesn't matter as long as the ordering is guaranteed.
OK, so I understand that you intend to target higher level languages
as well, which makes my second point (nesting) relevant.
>
>> 2) Slice requests are a good fit for locking. Locking typically
>> has nesting ability.
>>
>> We should consider making the slice request ABI a 8-bit
>> or 16-bit nesting counter to allow nesting of its users.
>
> Making request a counter requires to keep request set when the
> extension is granted. So the states would be:
>
> request granted
> 0 0 Neutral
> >0 0 Requested
> >=0 1 Granted
Yes.
>
> That should work.
>
> Though I'm not really convinced that unconditionally embeddeding it into
> random locking primitives is the right thing to do.
Me neither. I wonder what would be a good approach to integrate this
with locking APIs. Here are a few ideas, some worse than others:
- Extend pthread_mutexattr_t to set whether the mutex should be
slice-extended. Downside: if a mutex has some long and some
short critical sections, it's really a one-size fits all decision
for all critical sections for that mutex.
- Extend the pthread_mutex_lock/trylock with new APIs to allow
specifying whether slice-extension is needed for the upcoming critical
section.
- Just let the pthread_mutex_lock caller explicitly request the
slice extension *after* grabbing the lock. Downside: this opens
a window of a few instructions where preemption can happen
and slice extension would have been useful. Should we care ?
>
> The extension makes only sense, when the actual critical section is
> small and likely to complete within the extension time, which is usually
> only true for highly optimized code and not for general usage, where the
> lock held section is arbitrary long and might even result in syscalls
> even if the critical section itself does not have an obvious explicit
> syscall embedded:
>
> lock(a)
> lock(b) <- Contention results in syscall
Nested locking is another scenario where _typically_ we'd want the
slice extension for the outer lock if it is expected to be a short
critical section, and sometimes hit futex while the extension is granted
and clear the grant if this happens without killing the process.
>
> Same applies for library functions within a critical section.
Yes.
>
> That then immediately conflicts with the yield mechanism rules, because
> the extension could have been granted _before_ the syscall happens, so
> we'd have remove that restriction too.
Yes.
>
> That said, we can make the ABI a counter and split the slice control
> word into two u16. So the decision function would be:
>
> get_usr(ctrl);
> if (!ctrl.request)
> return;
> ....
> ctrl.granted = 1;
> put_usr(ctrl);
>
> Along with documentation why this should only be used nested when you
> know what you are doing.
Yes.
This would turn the end of critical section into a
decrement-and-test-for-zero. It's only when the request counter
decrements back to zero that userspace should handle the granted
flag and yield.
>
>> 3) Slice requests are also a good fit for rseq critical sections.
>> Of course someone could explicitly increment/decrement the
>> slice request counter before/after the rseq critical sections, but
>> I think we could do better there and integrate this directly within
>> the struct rseq_cs as a new critical section flag. Basically, a
>> critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>> better name) set within its descriptor flags would behave as if
>> the slice request counter is non-zero when preempted without
>> requiring any extra instruction on the fast path. The only
>> added overhead would be a check of the rseq->slice_grant flag
>> when exiting the critical section to conditionally issue
>> rseq_slice_yield().
>
> Plus checking first whether rseq->slice.request is actually zero,
> i.e. whether the rseq critical section was the outermost one. If not,
> you cannot invoke the yield even if granted is true, right?
Right.
>
> But mixing state spaces is not really a good idea at all. Let's not go
> there.
I agree, let's keep this (3) for later if there is a strong use-case
justifying the complexity.
What is important for right now though is to figure out the behavior
with respect to an ongoing rseq critical section when a time slice
extension is granted: is the rseq critical section aborted or does
it keep going on return to userspace ?
>
> Also you'd make checking of rseq_cs unconditional, which means extra
> work in the grant decision function as it would then have to do:
>
> if (!usr->slice.ctrl.request) {
> if (!usr->rseq_cs)
> return;
> if (!valid_ptr(usr->rseq_cs))
> goto die;
> if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
> return;
> if (!(use->rseq_cs.flags & REQUEST))
> return;
> }
>
> IOW, we'd copy half of the rseq cs handling into that code.
>
> Can we please keep it independent and simple?
Of course.
So in summary, here is my current understanding:
- It would be good to support nested slice-extension requests,
- It would be preferable to allow arbitrary system calls to
cancel an ongoing slice-extension grant rather than kill the
process if we want the slice-extension feature to be useful
outside of niche use-cases.
Thoughts ?
Thanks,
Mathieu
>
> Thanks,
>
> tglx
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-12 19:26 ` Mathieu Desnoyers
@ 2025-09-13 13:02 ` Thomas Gleixner
2025-09-19 17:30 ` Prakash Sangappa
0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-13 13:02 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Florian Weimer, carlos@redhat.com, libc-coord
On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>> 2) Slice requests are a good fit for locking. Locking typically
>>> has nesting ability.
>>>
>>> We should consider making the slice request ABI a 8-bit
>>> or 16-bit nesting counter to allow nesting of its users.
>>
>> Making request a counter requires to keep request set when the
>> extension is granted. So the states would be:
>>
>> request granted
>> 0 0 Neutral
>> >0 0 Requested
>> >=0 1 Granted
>
Second thoughts on this.
Such a scheme means that slice_ctrl.request must be read only for the
kernel because otherwise the user space decrement would need to be an
atomic dec_if_not_zero(). We just argued the one atomic operation away. :)
That means, the kernel can only set and clear Granted. That in turn
loses the information whether a slice extension was denied or revoked,
which was something the Oracle people wanted to have. I'm not sure
whether that was a functional or more a instrumentation feature.
But what's worse: this is a receipe for disaster as it creates obviously
subtle and hard to debug ways to leak an increment, which means the
request would stay active forever defeating the whole purpose.
And no, the kernel cannot keep track of the counter and observe whether
it became zero at some point or not. You surely could come up with a
convoluted scheme to work around that in form of sequence counters or
whatever, but that just creates extra complexity for a very dubious
value.
The point is that the time slice extension is just providing an
opportunistic priority ceiling mechanism with low overhead and without
guarantees.
Once a request is not granted or revoked, the performance of that
particular operation goes south no matter what. Nesting does not help
there at all, which is a strong argument for using KISS as the primary
engineering principle here.
The simple boolean request/granted pair is simple and very well
defined. It does not suffer from any of those problems.
If user space wants nesting, then it can do so on its own without
creating an ill defined and fragile kernel/user ABI. We created enough
of them in the past and all of them resulted in long term headaches.
> Handling syscall within granted extension by killing the process
I'm absolutely not opposed to lift the syscall restriction to make
things easier, but this is the wrong argument for it:
> will likely reserve this feature to the niche use-cases.
Having this used only by people who actually know what they are doing is
actually the preferred outcome.
We've seen it over and over that supposedly "easy" features result in
mindless overutilization because everyone and his dog thinks they need
them just because and for the very wrong reasons. The unconditional
usage of the most power hungry floating point extensions just because
they are available, is only one example of many.
Thanks,
tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-13 13:02 ` Thomas Gleixner
@ 2025-09-19 17:30 ` Prakash Sangappa
2025-09-22 14:09 ` Mathieu Desnoyers
0 siblings, 1 reply; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-19 17:30 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, Peter Zilstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Florian Weimer,
carlos@redhat.com, libc-coord@lists.openwall.com
> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>> has nesting ability.
>>>>
>>>> We should consider making the slice request ABI a 8-bit
>>>> or 16-bit nesting counter to allow nesting of its users.
>>>
>>> Making request a counter requires to keep request set when the
>>> extension is granted. So the states would be:
>>>
>>> request granted
>>> 0 0 Neutral
>>>> 0 0 Requested
>>>> =0 1 Granted
>>
>
> Second thoughts on this.
>
> Such a scheme means that slice_ctrl.request must be read only for the
> kernel because otherwise the user space decrement would need to be an
> atomic dec_if_not_zero(). We just argued the one atomic operation away. :)
>
> That means, the kernel can only set and clear Granted. That in turn
> loses the information whether a slice extension was denied or revoked,
> which was something the Oracle people wanted to have. I'm not sure
> whether that was a functional or more a instrumentation feature.
The denied indication was mainly instrumentation for observability to see
if a user application would attempt to set ‘REQUEST' again without yielding.
>
> But what's worse: this is a receipe for disaster as it creates obviously
> subtle and hard to debug ways to leak an increment, which means the
> request would stay active forever defeating the whole purpose.
>
> And no, the kernel cannot keep track of the counter and observe whether
> it became zero at some point or not. You surely could come up with a
> convoluted scheme to work around that in form of sequence counters or
> whatever, but that just creates extra complexity for a very dubious
> value.
>
> The point is that the time slice extension is just providing an
> opportunistic priority ceiling mechanism with low overhead and without
> guarantees.
>
> Once a request is not granted or revoked, the performance of that
> particular operation goes south no matter what. Nesting does not help
> there at all, which is a strong argument for using KISS as the primary
> engineering principle here.
>
> The simple boolean request/granted pair is simple and very well
> defined. It does not suffer from any of those problems.
Agree, I think keeping the API simple will be preferable. The request/granted
sequence makes sense.
>
> If user space wants nesting, then it can do so on its own without
> creating an ill defined and fragile kernel/user ABI. We created enough
> of them in the past and all of them resulted in long term headaches.
Guess user space should be able to handle nesting, possibly without the need of a counter?
AFAICS can’t the nested request, to extend the slice, be handled by checking
if both ‘REQUEST’ & ‘GRANTED’ bits are zero? If so, attempt to request
slice extension. Otherwise If either REQUEST or GRANTED bit Is set, then a slice
extension has been already requested or granted.
>
>> Handling syscall within granted extension by killing the process
>
> I'm absolutely not opposed to lift the syscall restriction to make
> things easier, but this is the wrong argument for it:
Killing the process seems drastic, and could deter use of this feature.
Can the consequence of calling the system be handled by calling schedule()
in syscall entry path if extension was granted, as you were implying?
Thanks
-Prakash
>
>> will likely reserve this feature to the niche use-cases.
>
> Having this used only by people who actually know what they are doing is
> actually the preferred outcome.
>
> We've seen it over and over that supposedly "easy" features result in
> mindless overutilization because everyone and his dog thinks they need
> them just because and for the very wrong reasons. The unconditional
> usage of the most power hungry floating point extensions just because
> they are available, is only one example of many.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-09-09 0:04 ` Randy Dunlap
2025-09-11 15:41 ` Mathieu Desnoyers
@ 2025-09-22 5:28 ` Prakash Sangappa
2025-09-22 5:57 ` K Prateek Nayak
2025-09-22 13:55 ` Mathieu Desnoyers
2 siblings, 2 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-22 5:28 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
..
> +enum rseq_slice_masks {
> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
> };
>
> /*
> @@ -142,6 +164,12 @@ struct rseq {
> __u32 mm_cid;
>
> /*
> + * Time slice extension control word. CPU local atomic updates from
> + * kernel and user space.
> + */
> + __u32 slice_ctrl;
We intend to backport the slice extension feature to older kernel versions.
With use of a new structure member for slice control, could there be discrepancy
with rseq structure size(older version) registered by libc? In that case the application
may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
Application would have to verify structure size, so should it be mentioned in the
documentation. Also, perhaps make the prctl() enable call return error, if structure size
does not match?
With regards to application determining the address and size of rseq structure
registered by libc, what are you thoughts on getting that thru the rseq(2)
system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
Thanks,
-Prakash
> +
> + /*
> * Flexible array member at end of structure, after last feature field.
> */
> char end[];
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-22 5:28 ` Prakash Sangappa
@ 2025-09-22 5:57 ` K Prateek Nayak
2025-09-22 13:57 ` Mathieu Desnoyers
2025-09-22 13:55 ` Mathieu Desnoyers
1 sibling, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-22 5:57 UTC (permalink / raw)
To: Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
Hello Prakash,
On 9/22/2025 10:58 AM, Prakash Sangappa wrote:
> With use of a new structure member for slice control, could there be discrepancy
> with rseq structure size(older version) registered by libc? In that case the application
> may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
In this case, wouldn't GLIBC's rseq registration fail if presumed
__rseq_size is smaller than the "struct rseq" size?
And if it has allocated a large enough area, then the prctl() should
help to query the slice extension feature's availability.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-22 5:28 ` Prakash Sangappa
2025-09-22 5:57 ` K Prateek Nayak
@ 2025-09-22 13:55 ` Mathieu Desnoyers
2025-09-23 0:57 ` Prakash Sangappa
1 sibling, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 13:55 UTC (permalink / raw)
To: Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org, Michael Jeanson
On 2025-09-22 01:28, Prakash Sangappa wrote:
>
>
>> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
> ..
>> +enum rseq_slice_masks {
>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>> };
>>
>> /*
>> @@ -142,6 +164,12 @@ struct rseq {
>> __u32 mm_cid;
>>
>> /*
>> + * Time slice extension control word. CPU local atomic updates from
>> + * kernel and user space.
>> + */
>> + __u32 slice_ctrl;
>
> We intend to backport the slice extension feature to older kernel versions.
>
> With use of a new structure member for slice control, could there be discrepancy
> with rseq structure size(older version) registered by libc? In that case the application
> may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
The rseq extension scheme allows this to seamlessly work.
You will need a glibc 2.41+, which uses the getauxval(3)
AT_RSEQ_FEATURE_SIZE and AT_RSEQ_ALIGN to query the feature size
supported by the Linux kernel. It allocates a per-thread memory
area which is large enough to support that feature set, and
registers it to the kernel through rseq(2) on thread creation.
Note that before we had the extensible rseq scheme, glibc registered
a 32-byte structure (including padding at the end), which is considered
as the rseq "original" registration size.
The "mm_cid" field ends at 28 bytes, which leaves 4 bytes of padding at
the end of the original rseq structure. Considering that the time slice
extension fields will likely fit within those 4 bytes, I expect that
applications linked against glibc [2.35, 2.40] will also be able to use
those fields. Those applications should use getauxval(3)
AT_RSEQ_FEATURE_SIZE to validate whether the kernel populates this field
or if it's just padding.
Note that this all works even if you backport the feature to an older kernel:
the rseq extension scheme does not depend on querying the kernel version at
all. You will however be required to backport the support for additional
rseq fields that come before the time slice, such as node_id and mm_cid,
if they are not implemented in your older kernel.
>
> Application would have to verify structure size, so should it be mentioned in the
> documentation.
Yes, applications should check that the glibc's __rseq_size is large enough to fit
the new slice field(s), *and* for the original rseq size special case
(32 bytes including padding), those would need to query getauxval(3)
AT_RSEQ_FEATURE_SIZE to make sure the field is indeed supported.
Also, perhaps make the prctl() enable call return error, if structure size
> does not match?
That's not how the extensible scheme works.
Either glibc registers a 32-byte area (in which the time slice feature would
fit), or it registers an area large enough to fit all kernel supported features,
or it fails registration. And prctl() is per-process, whereas the rseq registration
is per-thread, so it's kind of weird to make prctl() fail if the current
thread's rseq is not registered.
>
> With regards to application determining the address and size of rseq structure
> registered by libc, what are you thoughts on getting that thru the rseq(2)
> system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
>
> https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
I think that the other leg of that email thread got to a resolution of both static and
dynamic use-cases through use of an extern __weak symbol, no [1] ? Not that I am against
adding a rseq(2) query for rseq address, size, and signature, but I just want to double
check that it would be there for convenience and is not actually needed in the typical
use-cases.
Thanks,
Mathieu
[1] https://lore.kernel.org/all/aKPFIQwg5zxSS5oS@google.com/
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-22 5:57 ` K Prateek Nayak
@ 2025-09-22 13:57 ` Mathieu Desnoyers
0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 13:57 UTC (permalink / raw)
To: K Prateek Nayak, Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On 2025-09-22 01:57, K Prateek Nayak wrote:
> Hello Prakash,
>
> On 9/22/2025 10:58 AM, Prakash Sangappa wrote:
>> With use of a new structure member for slice control, could there be discrepancy
>> with rseq structure size(older version) registered by libc? In that case the application
>> may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
>
> In this case, wouldn't GLIBC's rseq registration fail if presumed
> __rseq_size is smaller than the "struct rseq" size?
The registered rseq size cannot be smaller than 32 bytes, else
registration is refused by the system call (-EINVAL).
The new slice extension fields would fit within those 32 bytes,
so it should always work.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-19 17:30 ` Prakash Sangappa
@ 2025-09-22 14:09 ` Mathieu Desnoyers
2025-09-23 1:01 ` Prakash Sangappa
0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 14:09 UTC (permalink / raw)
To: Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zilstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org, Florian Weimer, carlos@redhat.com,
libc-coord@lists.openwall.com
On 2025-09-19 13:30, Prakash Sangappa wrote:
>
>
>> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>> has nesting ability.
>>>>>
>>>>> We should consider making the slice request ABI a 8-bit
>>>>> or 16-bit nesting counter to allow nesting of its users.
>>>>
>>>> Making request a counter requires to keep request set when the
>>>> extension is granted. So the states would be:
>>>>
>>>> request granted
>>>> 0 0 Neutral
>>>>> 0 0 Requested
>>>>> =0 1 Granted
>>>
>>
>> Second thoughts on this.
>>
[...]
>
>>
>> If user space wants nesting, then it can do so on its own without
>> creating an ill defined and fragile kernel/user ABI. We created enough
>> of them in the past and all of them resulted in long term headaches.
>
> Guess user space should be able to handle nesting, possibly without the need of a counter?
>
> AFAICS can’t the nested request, to extend the slice, be handled by checking
> if both ‘REQUEST’ & ‘GRANTED’ bits are zero? If so, attempt to request
> slice extension. Otherwise If either REQUEST or GRANTED bit Is set, then a slice
> extension has been already requested or granted.
I think you are onto something here. If we want independent pieces of
software (e.g. libc and application) to allow nesting of time slice
extension requests, without having to deal with a counter and the
inevitable unbalance bugs (leak and underflow), we could require
userspace to check the value of the request and granted flags. If both
are zero, then it can set the request.
Then when userspace exits its critical section, it needs to remember
whether it has set a request or not, so it does not clear a request
too early if the request was set by an outer context. This requires
handing over additional state (one bit) from "lock" to "unlock" though.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
2025-09-22 13:55 ` Mathieu Desnoyers
@ 2025-09-23 0:57 ` Prakash Sangappa
0 siblings, 0 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-23 0:57 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Thomas Gleixner, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Michael Jeanson
> On Sep 22, 2025, at 6:55 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-09-22 01:28, Prakash Sangappa wrote:
>>> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>
>> ..
>>> +enum rseq_slice_masks {
>>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>>> };
>>>
>>> /*
>>> @@ -142,6 +164,12 @@ struct rseq {
>>> __u32 mm_cid;
>>>
>>> /*
>>> + * Time slice extension control word. CPU local atomic updates from
>>> + * kernel and user space.
>>> + */
>>> + __u32 slice_ctrl;
>> We intend to backport the slice extension feature to older kernel versions.
>> With use of a new structure member for slice control, could there be discrepancy
>> with rseq structure size(older version) registered by libc? In that case the application
>> may not be able to use slice extension feature unless Libc’s use of rseq is disabled.
>
> The rseq extension scheme allows this to seamlessly work.
>
> You will need a glibc 2.41+, which uses the getauxval(3)
> AT_RSEQ_FEATURE_SIZE and AT_RSEQ_ALIGN to query the feature size
> supported by the Linux kernel. It allocates a per-thread memory
> area which is large enough to support that feature set, and
> registers it to the kernel through rseq(2) on thread creation.
Ok,
>
> Note that before we had the extensible rseq scheme, glibc registered
> a 32-byte structure (including padding at the end), which is considered
> as the rseq "original" registration size.
>
> The "mm_cid" field ends at 28 bytes, which leaves 4 bytes of padding at
> the end of the original rseq structure. Considering that the time slice
> extension fields will likely fit within those 4 bytes, I expect that
> applications linked against glibc [2.35, 2.40] will also be able to use
> those fields. Those applications should use getauxval(3)
> AT_RSEQ_FEATURE_SIZE to validate whether the kernel populates this field
> or if it's just padding.
The question was about the size of rseq structure registered by glibc. If it is using
AT_RSEQ_FEATURE_SIZE to allocate the per-thread area for rseq, I suppose that
should be fine. However application would have to verify that __rseq_size size is large
enough.
As for the Kernel supporting slice extension, I expect the prctl(..,PR_RSEQ_SLICE_EXT_ENABLE)
would return an error if it is not supported, won’t that be sufficient or should it check
AT_RSEQ_FEATURE_SIZE?
>
> Note that this all works even if you backport the feature to an older kernel:
> the rseq extension scheme does not depend on querying the kernel version at
> all. You will however be required to backport the support for additional
> rseq fields that come before the time slice, such as node_id and mm_cid,
> if they are not implemented in your older kernel.
Yes, need to look at those changes that needs to be backported. Also, the dependent
'rseq: Optimize exit to user space’ changes from other patch series.
>
>> Application would have to verify structure size, so should it be mentioned in the
>> documentation.
>
> Yes, applications should check that the glibc's __rseq_size is large enough to fit
> the new slice field(s), *and* for the original rseq size special case
> (32 bytes including padding), those would need to query getauxval(3)
> AT_RSEQ_FEATURE_SIZE to make sure the field is indeed supported.
>
> Also, perhaps make the prctl() enable call return error, if structure size
>> does not match?
>
> That's not how the extensible scheme works.
>
> Either glibc registers a 32-byte area (in which the time slice feature would
> fit), or it registers an area large enough to fit all kernel supported features,
> or it fails registration. And prctl() is per-process, whereas the rseq registration
> is per-thread, so it's kind of weird to make prctl() fail if the current
> thread's rseq is not registered.
I meant the prctl(.., PR_RSEQ_SLICE_EXT_ENABLE) call is per thread and
sets the enabled bit in per thread rseq. This could fail if rseq struct size is not large enough?
>
>> With regards to application determining the address and size of rseq structure
>> registered by libc, what are you thoughts on getting that thru the rseq(2)
>> system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
>> https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
>
> I think that the other leg of that email thread got to a resolution of both static and
> dynamic use-cases through use of an extern __weak symbol, no [1] ? Not that I am against
> adding a rseq(2) query for rseq address, size, and signature, but I just want to double
> check that it would be there for convenience and is not actually needed in the typical
> use-cases.
Yes, mainly for convenience.
Thanks,
-Prakash
>
> Thanks,
>
> Mathieu
>
> [1] https://lore.kernel.org/all/aKPFIQwg5zxSS5oS@google.com/
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch 00/12] rseq: Implement time slice extension mechanism
2025-09-22 14:09 ` Mathieu Desnoyers
@ 2025-09-23 1:01 ` Prakash Sangappa
0 siblings, 0 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-23 1:01 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Thomas Gleixner, LKML, Peter Zilstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Florian Weimer,
carlos@redhat.com, libc-coord@lists.openwall.com
> On Sep 22, 2025, at 7:09 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-09-19 13:30, Prakash Sangappa wrote:
>>> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>
>>> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>>>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>>> has nesting ability.
>>>>>>
>>>>>> We should consider making the slice request ABI a 8-bit
>>>>>> or 16-bit nesting counter to allow nesting of its users.
>>>>>
>>>>> Making request a counter requires to keep request set when the
>>>>> extension is granted. So the states would be:
>>>>>
>>>>> request granted
>>>>> 0 0 Neutral
>>>>>> 0 0 Requested
>>>>>> =0 1 Granted
>>>>
>>>
>>> Second thoughts on this.
>>>
> [...]
>>>
>>> If user space wants nesting, then it can do so on its own without
>>> creating an ill defined and fragile kernel/user ABI. We created enough
>>> of them in the past and all of them resulted in long term headaches.
>> Guess user space should be able to handle nesting, possibly without the need of a counter?
>> AFAICS can’t the nested request, to extend the slice, be handled by checking
>> if both ‘REQUEST’ & ‘GRANTED’ bits are zero? If so, attempt to request
>> slice extension. Otherwise If either REQUEST or GRANTED bit Is set, then a slice
>> extension has been already requested or granted.
>
> I think you are onto something here. If we want independent pieces of
> software (e.g. libc and application) to allow nesting of time slice
> extension requests, without having to deal with a counter and the
> inevitable unbalance bugs (leak and underflow), we could require
> userspace to check the value of the request and granted flags. If both
> are zero, then it can set the request.
>
> Then when userspace exits its critical section, it needs to remember
> whether it has set a request or not, so it does not clear a request
> too early if the request was set by an outer context. This requires
> handing over additional state (one bit) from "lock" to "unlock" though.
Yes that is correct. Additional state will be required to track if slice extension
was requested in that context.
-Prakash
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2025-09-23 1:02 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-09-09 0:04 ` Randy Dunlap
2025-09-11 15:41 ` Mathieu Desnoyers
2025-09-11 15:49 ` Mathieu Desnoyers
2025-09-22 5:28 ` Prakash Sangappa
2025-09-22 5:57 ` K Prateek Nayak
2025-09-22 13:57 ` Mathieu Desnoyers
2025-09-22 13:55 ` Mathieu Desnoyers
2025-09-23 0:57 ` Prakash Sangappa
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-09-09 3:10 ` K Prateek Nayak
2025-09-09 4:11 ` Randy Dunlap
2025-09-09 12:12 ` Thomas Gleixner
2025-09-09 16:01 ` Randy Dunlap
2025-09-11 15:42 ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
2025-09-11 15:43 ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
2025-09-11 15:50 ` Mathieu Desnoyers
2025-09-11 16:52 ` K Prateek Nayak
2025-09-11 17:18 ` Mathieu Desnoyers
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-09-09 9:52 ` K Prateek Nayak
2025-09-09 12:23 ` Thomas Gleixner
2025-09-10 11:15 ` K Prateek Nayak
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-09-10 5:22 ` K Prateek Nayak
2025-09-10 7:49 ` Thomas Gleixner
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-09-10 11:20 ` K Prateek Nayak
2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-09-09 8:14 ` K Prateek Nayak
2025-09-09 12:16 ` Thomas Gleixner
2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-09-10 11:23 ` K Prateek Nayak
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-10 4:42 ` K Prateek Nayak
2025-09-10 11:28 ` K Prateek Nayak
2025-09-10 14:50 ` Thomas Gleixner
2025-09-11 3:03 ` K Prateek Nayak
2025-09-11 7:36 ` Prakash Sangappa
2025-09-11 15:27 ` Mathieu Desnoyers
2025-09-11 20:18 ` Thomas Gleixner
2025-09-12 12:33 ` Mathieu Desnoyers
2025-09-12 16:31 ` Thomas Gleixner
2025-09-12 19:26 ` Mathieu Desnoyers
2025-09-13 13:02 ` Thomas Gleixner
2025-09-19 17:30 ` Prakash Sangappa
2025-09-22 14:09 ` Mathieu Desnoyers
2025-09-23 1:01 ` Prakash Sangappa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).