Linux-arch Archive mirror
 help / color / mirror / Atom feed
* [patch 00/12] rseq: Implement time slice extension mechanism
@ 2025-09-08 22:59 Thomas Gleixner
  2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
                   ` (14 more replies)
  0 siblings, 15 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

This is the proper implementation of the PoC code, which I posted in reply
to the latest iteration of Prakash's time slice extension patches:

     https://lore.kernel.org/all/87o6smb3a0.ffs@tglx

Time slice extensions are an attempt to provide opportunistic priority
ceiling without the overhead of an actual priority ceiling protocol, but
also without the guarantees such a protocol provides.

The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.

This has been attempted to solve at least for a decade, but so far this
went nowhere.  The recent attempts, which started to integrate with the
already existing RSEQ mechanism, have been at least going into the right
direction. The full history is partially in the above mentioned mail thread
and it's ancestors, but also in various threads in the LKML archives, which
require archaeological efforts to retrieve.

When trying to morph the PoC into actual mergeable code, I stumbled over
various shortcomings in the RSEQ code, which have been addressed in a
separate effort. The latest iteration can be found here:

     https://lore.kernel.org/all/20250908212737.353775467@linutronix.de

That is a prerequisite for this series as it allows a tight integration
into the RSEQ code without inflicting a lot of extra overhead into the hot
paths.

The main change vs. the PoC and the previous attempts is that it utilizes a
new field in the user space ABI rseq struct, which allows to reduce the
atomic operations in user space to a bare minimum. If the architecture
supports CPU local atomics, which protect against the obvious RMW race
vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
x86, required.

The kernel user space ABI consists only of two bits in this new field:

	REQUEST and GRANTED

User space sets REQUEST at the begin of the critical section. If it
finishes the critical section without interruption then it can clear the
bit and move on.

If it is interrupted and the interrupt return path in the kernel observes a
rescheduling request, then the kernel can grant a time slice extension. The
kernel clears the REQUEST bit and sets the GRANTED bit with a simple
non-atomic store operation. If it does not grant the extension only the
REQUEST bit is cleared.

If user space observes the REQUEST bit cleared, when it finished the
critical section, then it has to check the GRANTED bit. If that is set,
then it has to invoke the rseq_slice_yield() syscall to terminate the
extension and yield the CPU.

The code flow in user space is:

   	  // Simple store as there is no concurrency vs. the GRANTED bit
      	  rseq->slice_ctrl = REQUEST;

	  critical_section();

	  // CPU local atomic required here:
	  if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
	     	// Non-atomic check is sufficient as this can race
		// against an interrupt, which revokes the grant
		//
		// If not set, then the request was either cleared by the kernel
		// without grant or the grant was revoked.
		//
		// If set, tell the kernel that the critical section is done
		// so it can reschedule
	  	if (rseq->slice_ctrl & GRANTED)
			rseq_slice_yield();
	  }

The other details, which differ from earlier attempts and the PoC, are:

    - A separate syscall for terminating the extension to avoid side
      effects and overloading of the already ill defined sched_yield(2)

    - A separate per CPU timer, which again does not inflict side effects
      on the scheduler internal hrtick timer. The hrtick timer can be
      disabled at run-time and an expiry can cause interesting problems in
      the scheduler code when it is unexpectedly invoked.

    - Tight integration into the rseq exit to user mode code. It utilizes
      the path when TIF_RESQ is not set at the end of exit_to_user_mode()
      to arm the timer if an extension was granted. TIF_RSEQ indicates that
      the task was scheduled and therefore would revoke the grant anyway.

    - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
      model which is utilized by PREEMPT_RT.

      It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
      but not TIF_PREEMPT.

      Pretending that this can be made work for TIF_PREEMPT on a fully
      preemptible kernel is just wishful thinking as the chance that
      TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
      obvious reasons.

      This only "works" by some definition of works, i.e. on a best effort
      basis, for the PREEMPT_NONE model and nothing else. Though given the
      problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
      running code sections, the days of these models should be hopefully
      numbered and everything consolidated on the LAZY model.

      That makes this distinction moot and everything restricted to
      TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
      extension mechanism into the scheduler hotpath. I'm sure there will
      be attempts to do that as there is no lack of crazy folks out
      there...

    - Actual documentation of the user space ABI and a initial self test.

The RSEQ modifications on which this series is based can be found here:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Thanks,

	tglx
---
 Documentation/userspace-api/index.rst       |    1 
 Documentation/userspace-api/rseq.rst        |  129 ++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/tools/syscall_32.tbl             |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/s390/mm/pfault.c                       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 include/linux/entry-common.h                |    2 
 include/linux/rseq.h                        |   11 +
 include/linux/rseq_entry.h                  |  176 ++++++++++++++++
 include/linux/rseq_types.h                  |   28 ++
 include/linux/sched.h                       |    7 
 include/linux/syscalls.h                    |    1 
 include/linux/thread_info.h                 |   16 -
 include/uapi/asm-generic/unistd.h           |    5 
 include/uapi/linux/prctl.h                  |   10 
 include/uapi/linux/rseq.h                   |   28 ++
 init/Kconfig                                |   12 +
 kernel/entry/common.c                       |   14 +
 kernel/entry/syscall-common.c               |   11 -
 kernel/rcu/tiny.c                           |    8 
 kernel/rcu/tree.c                           |   14 -
 kernel/rcu/tree_exp.h                       |    3 
 kernel/rcu/tree_plugin.h                    |    9 
 kernel/rcu/tree_stall.h                     |    3 
 kernel/rseq.c                               |  293 ++++++++++++++++++++++++++++
 kernel/sys.c                                |    6 
 kernel/sys_ni.c                             |    1 
 scripts/syscall.tbl                         |    1 
 tools/testing/selftests/rseq/.gitignore     |    1 
 tools/testing/selftests/rseq/Makefile       |    5 
 tools/testing/selftests/rseq/rseq-abi.h     |    2 
 tools/testing/selftests/rseq/slice_test.c   |  217 ++++++++++++++++++++
 45 files changed, 991 insertions(+), 42 deletions(-)



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 01/12] sched: Provide and use set_need_resched_current()
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
  2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
work correctly outside of the scheduler.

Provide set_need_resched_current() which wraps this correctly and replace
all the open coded instances.

Signed-off-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/s390/mm/pfault.c    |    3 +--
 include/linux/sched.h    |    7 +++++++
 kernel/rcu/tiny.c        |    8 +++-----
 kernel/rcu/tree.c        |   14 +++++---------
 kernel/rcu/tree_exp.h    |    3 +--
 kernel/rcu/tree_plugin.h |    9 +++------
 kernel/rcu/tree_stall.h  |    3 +--
 7 files changed, 21 insertions(+), 26 deletions(-)

--- a/arch/s390/mm/pfault.c
+++ b/arch/s390/mm/pfault.c
@@ -199,8 +199,7 @@ static void pfault_interrupt(struct ext_
 			 * return to userspace schedule() to block.
 			 */
 			__set_current_state(TASK_UNINTERRUPTIBLE);
-			set_tsk_need_resched(tsk);
-			set_preempt_need_resched();
+			set_need_resched_current();
 		}
 	}
 out:
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2034,6 +2034,13 @@ static inline int test_tsk_need_resched(
 	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
 }
 
+static inline void set_need_resched_current(void)
+{
+	lockdep_assert_irqs_disabled();
+	set_tsk_need_resched(current);
+	set_preempt_need_resched();
+}
+
 /*
  * cond_resched() and cond_resched_lock(): latency reduction via
  * explicit rescheduling in places that are safe. The return
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -70,12 +70,10 @@ void rcu_qs(void)
  */
 void rcu_sched_clock_irq(int user)
 {
-	if (user) {
+	if (user)
 		rcu_qs();
-	} else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail) {
-		set_tsk_need_resched(current);
-		set_preempt_need_resched();
-	}
+	else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail)
+		set_need_resched_current();
 }
 
 /*
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2696,10 +2696,8 @@ void rcu_sched_clock_irq(int user)
 	/* The load-acquire pairs with the store-release setting to true. */
 	if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
 		/* Idle and userspace execution already are quiescent states. */
-		if (!rcu_is_cpu_rrupt_from_idle() && !user) {
-			set_tsk_need_resched(current);
-			set_preempt_need_resched();
-		}
+		if (!rcu_is_cpu_rrupt_from_idle() && !user)
+			set_need_resched_current();
 		__this_cpu_write(rcu_data.rcu_urgent_qs, false);
 	}
 	rcu_flavor_sched_clock_irq(user);
@@ -2824,7 +2822,6 @@ static void strict_work_handler(struct w
 /* Perform RCU core processing work for the current CPU.  */
 static __latent_entropy void rcu_core(void)
 {
-	unsigned long flags;
 	struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
 	struct rcu_node *rnp = rdp->mynode;
 
@@ -2837,8 +2834,8 @@ static __latent_entropy void rcu_core(vo
 	if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK))) {
 		rcu_preempt_deferred_qs(current);
 	} else if (rcu_preempt_need_deferred_qs(current)) {
-		set_tsk_need_resched(current);
-		set_preempt_need_resched();
+		guard(irqsave)();
+		set_need_resched_current();
 	}
 
 	/* Update RCU state based on any recent quiescent states. */
@@ -2847,10 +2844,9 @@ static __latent_entropy void rcu_core(vo
 	/* No grace period and unregistered callbacks? */
 	if (!rcu_gp_in_progress() &&
 	    rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp)) {
-		local_irq_save(flags);
+		guard(irqsave)();
 		if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
 			rcu_accelerate_cbs_unlocked(rnp, rdp);
-		local_irq_restore(flags);
 	}
 
 	rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -729,8 +729,7 @@ static void rcu_exp_need_qs(void)
 	__this_cpu_write(rcu_data.cpu_no_qs.b.exp, true);
 	/* Store .exp before .rcu_urgent_qs. */
 	smp_store_release(this_cpu_ptr(&rcu_data.rcu_urgent_qs), true);
-	set_tsk_need_resched(current);
-	set_preempt_need_resched();
+	set_need_resched_current();
 }
 
 #ifdef CONFIG_PREEMPT_RCU
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -756,8 +756,7 @@ static void rcu_read_unlock_special(stru
 			// Also if no expediting and no possible deboosting,
 			// slow is OK.  Plus nohz_full CPUs eventually get
 			// tick enabled.
-			set_tsk_need_resched(current);
-			set_preempt_need_resched();
+			set_need_resched_current();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
 			    needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
 			    cpu_online(rdp->cpu)) {
@@ -818,10 +817,8 @@ static void rcu_flavor_sched_clock_irq(i
 	if (rcu_preempt_depth() > 0 ||
 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 		/* No QS, force context switch if deferred. */
-		if (rcu_preempt_need_deferred_qs(t)) {
-			set_tsk_need_resched(t);
-			set_preempt_need_resched();
-		}
+		if (rcu_preempt_need_deferred_qs(t))
+			set_need_resched_current();
 	} else if (rcu_preempt_need_deferred_qs(t)) {
 		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
 		return;
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -763,8 +763,7 @@ static void print_cpu_stall(unsigned lon
 	 * progress and it could be we're stuck in kernel space without context
 	 * switches for an entirely unreasonable amount of time.
 	 */
-	set_tsk_need_resched(current);
-	set_preempt_need_resched();
+	set_need_resched_current();
 }
 
 static bool csd_lock_suppress_rcu_stall;


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
  2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
  2025-09-09  0:04   ` Randy Dunlap
                     ` (2 more replies)
  2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
                   ` (12 subsequent siblings)
  14 siblings, 3 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Peter Zilstra, Arnd Bergmann, linux-arch

Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 Documentation/userspace-api/index.rst |    1 
 Documentation/userspace-api/rseq.rst  |  129 ++++++++++++++++++++++++++++++++++
 include/linux/rseq_types.h            |   26 ++++++
 include/uapi/linux/rseq.h             |   28 +++++++
 init/Kconfig                          |   12 +++
 kernel/rseq.c                         |    8 ++
 6 files changed, 204 insertions(+)

--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
    ebpf/index
    ioctl/index
    mseal
+   rseq
 
 Security-related interfaces
 ===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,129 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and user-space for three purposes:
+
+ * user-space restartable sequences
+
+ * quick access to read the current CPU number, node ID from user-space
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow user-space to perform update operations on
+per-cpu data without requiring heavy-weight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+    * Enabled in Kconfig
+
+    * Enabled at boot time (default is enabled)
+
+    * A rseq user space pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success and otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg4 and arg5 must be zero
+ENOTSUPP  Functionality was disabled on the kernel command line
+ENXIO	  Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
+rseq slice_ctrl field. If the thread is interrupted and the interrupt
+results in a reschedule request in the kernel, then the kernel can grant a
+time slice extension and return to user space instead of scheduling
+out.
+
+The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
+and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
+field. If there is a reschedule of the thread after granting the extension,
+the kernel clears the granted bit to indicate that to user space.
+
+If the request bit is still set when the leaving the critical section, user
+space can clear it and continue.
+
+If the granted bit is set, then user space has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving user space from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by user space.
+
+The required code flow is as follows::
+
+    rseq->slice_ctrl = REQUEST;
+    critical_section();
+    if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
+        if (rseq->slice_ctrl & GRANTED)
+                rseq_slice_yield();
+    }
+
+local_test_and_clear_bit() has to be local CPU atomic to prevent the
+obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
+without LOCK prefix. On architectures, which do not provide lightweight CPU
+local atomics this needs to be implemented with regular atomic operations.
+
+Setting REQUEST has no atomicity requirements as there is no concurrency
+vs. the GRANTED bit.
+
+Checking the GRANTED has no atomicity requirements as there is obviously a
+race which cannot be avoided at all::
+
+    if (rseq->slice_ctrl & GRANTED)
+      -> Interrupt results in schedule and grant revocation
+        rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -71,12 +71,35 @@ struct rseq_ids {
 };
 
 /**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state:	Compound to access the overall state
+ * @enabled:	Time slice extension is enabled for the task
+ * @granted:	Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+	u16			state;
+	struct {
+		u8		enabled;
+		u8		granted;
+	};
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state:	Time slice extension state
+ */
+struct rseq_slice {
+	union rseq_slice_state	state;
+};
+
+/**
  * struct rseq_data - Storage for all rseq related data
  * @usrptr:	Pointer to the registered user space RSEQ memory
  * @len:	Length of the RSEQ region
  * @sig:	Signature of critial section abort IPs
  * @event:	Storage for event management
  * @ids:	Storage for cached CPU ID and MM CID
+ * @slice:	Storage for time slice extension data
  */
 struct rseq_data {
 	struct rseq __user		*usrptr;
@@ -84,6 +107,9 @@ struct rseq_data {
 	u32				sig;
 	struct rseq_event		event;
 	struct rseq_ids			ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+	struct rseq_slice		slice;
+#endif
 };
 
 #else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
 };
 
 enum rseq_cs_flags_bit {
+	/* Historical and unsupported bits */
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	/* (3) Intentional gap to put new bits into a seperate byte */
+
+	/* User read only feature flags */
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +41,22 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE	=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED		=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
+};
+
+enum rseq_slice_bits {
+	/* Time slice extension ABI bits */
+	RSEQ_SLICE_EXT_REQUEST_BIT		= 0,
+	RSEQ_SLICE_EXT_GRANTED_BIT		= 1,
+};
+
+enum rseq_slice_masks {
+	RSEQ_SLICE_EXT_REQUEST	= (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
+	RSEQ_SLICE_EXT_GRANTED	= (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
 };
 
 /*
@@ -142,6 +164,12 @@ struct rseq {
 	__u32 mm_cid;
 
 	/*
+	 * Time slice extension control word. CPU local atomic updates from
+	 * kernel and user space.
+	 */
+	__u32 slice_ctrl;
+
+	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
 	char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
 
 	  If unsure, say N.
 
+config RSEQ_SLICE_EXTENSION
+	bool "Enable rseq based time slice extension mechanism"
+	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+	help
+          Allows userspace to request a limited time slice extension when
+	  returning from an interrupt to user space via the RSEQ shared
+	  data ABI. If granted, that allows to complete a critical section,
+	  so that other threads are not stuck on a conflicted resource,
+	  while the task is scheduled out.
+
+	  If unsure, say N.
+
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
  */
 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 {
+	u32 rseqfl = 0;
+
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
 			return -EINVAL;
@@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
 		return -EFAULT;
 
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
+	if (put_user_masked_u32(rseqfl, &rseq->flags))
+		return -EFAULT;
+
 	/*
 	 * Activate the registration by setting the rseq area address, length
 	 * and signature in the task struct.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 03/12] rseq: Provide static branch for time slice extensions
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
  2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
  2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
  2025-09-09  3:10   ` K Prateek Nayak
  2025-09-11 15:42   ` Mathieu Desnoyers
  2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq_entry.h |   11 +++++++++++
 kernel/rseq.c              |   17 +++++++++++++++++
 2 files changed, 28 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -77,6 +77,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
 #define rseq_inline __always_inline
 #endif
 
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+	return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
 bool rseq_debug_validate_ids(struct task_struct *t);
 
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -474,3 +474,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 
 	return 0;
 }
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+	bool on;
+
+	if (kstrtobool(str, &on))
+		return -EINVAL;
+
+	if (!on)
+		static_branch_disable(&rseq_slice_extension_key);
+	return 0;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 04/12] rseq: Add statistics for time slice extensions
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (2 preceding siblings ...)
  2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
  2025-09-11 15:43   ` Mathieu Desnoyers
  2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Extend the quick statistics with time slice specific fields.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |    4 ++++
 kernel/rseq.c              |   12 ++++++++++++
 2 files changed, 16 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,10 @@ struct rseq_stats {
 	unsigned long	cs;
 	unsigned long	clear;
 	unsigned long	fixup;
+	unsigned long	s_granted;
+	unsigned long	s_expired;
+	unsigned long	s_revoked;
+	unsigned long	s_yielded;
 };
 
 DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi
 		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
 		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
 		stats.fixup	+= data_race(per_cpu(rseq_stats.fixup, cpu));
+		if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+			stats.s_granted	+= data_race(per_cpu(rseq_stats.s_granted, cpu));
+			stats.s_expired	+= data_race(per_cpu(rseq_stats.s_expired, cpu));
+			stats.s_revoked	+= data_race(per_cpu(rseq_stats.s_revoked, cpu));
+			stats.s_yielded	+= data_race(per_cpu(rseq_stats.s_yielded, cpu));
+		}
 	}
 
 	seq_printf(m, "exit:   %16lu\n", stats.exit);
@@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi
 	seq_printf(m, "cs:     %16lu\n", stats.cs);
 	seq_printf(m, "clear:  %16lu\n", stats.clear);
 	seq_printf(m, "fixup:  %16lu\n", stats.fixup);
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+		seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+		seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+		seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+		seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+	}
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 05/12] rseq: Add prctl() to enable time slice extensions
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (3 preceding siblings ...)
  2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-09-08 22:59 ` Thomas Gleixner
  2025-09-11 15:50   ` Mathieu Desnoyers
  2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 22:59 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.

That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq.h       |    9 +++++++
 include/uapi/linux/prctl.h |   10 ++++++++
 kernel/rseq.c              |   52 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |    6 +++++
 4 files changed, 77 insertions(+)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -190,4 +190,13 @@ void rseq_syscall(struct pt_regs *regs);
 static inline void rseq_syscall(struct pt_regs *regs) { }
 #endif /* !CONFIG_DEBUG_RSEQ */
 
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+	return -EINVAL;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
 #endif /* _LINUX_RSEQ_H */
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -376,4 +376,14 @@ struct prctl_mm_map {
 # define PR_FUTEX_HASH_SET_SLOTS	1
 # define PR_FUTEX_HASH_GET_SLOTS	2
 
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION			79
+# define PR_RSEQ_SLICE_EXTENSION_GET		1
+# define PR_RSEQ_SLICE_EXTENSION_SET		2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE:	Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE		0x01
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
 #define RSEQ_BUILD_SLOW_PATH
 
 #include <linux/debugfs.h>
+#include <linux/prctl.h>
 #include <linux/ratelimit.h>
 #include <linux/rseq_entry.h>
 #include <linux/sched.h>
@@ -490,6 +491,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
 DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
 
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+	switch (arg2) {
+	case PR_RSEQ_SLICE_EXTENSION_GET:
+		if (arg3)
+			return -EINVAL;
+		return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+	case PR_RSEQ_SLICE_EXTENSION_SET: {
+		u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+		bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+		if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+			return -EINVAL;
+		if (!rseq_slice_extension_enabled())
+			return -ENOTSUPP;
+		if (!current->rseq.usrptr)
+			return -ENXIO;
+
+		/* No change? */
+		if (enable == !!current->rseq.slice.state.enabled)
+			return 0;
+
+		if (get_user(rflags, &current->rseq.usrptr->flags))
+			goto die;
+
+		if (current->rseq.slice.state.enabled)
+			valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+		if ((rflags & valid) != valid)
+			goto die;
+
+		rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+		rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+		if (enable)
+			rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+		if (put_user(rflags, &current->rseq.usrptr->flags))
+			goto die;
+
+		current->rseq.slice.state.enabled = enable;
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+die:
+	force_sig(SIGSEGV);
+	return -EFAULT;
+}
+
 static int __init rseq_slice_cmdline(char *str)
 {
 	bool on;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
 #include <linux/time_namespace.h>
 #include <linux/binfmts.h>
 #include <linux/futex.h>
+#include <linux/rseq.h>
 
 #include <linux/sched.h>
 #include <linux/sched/autogroup.h>
@@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 	case PR_FUTEX_HASH:
 		error = futex_hash_prctl(arg2, arg3, arg4);
 		break;
+	case PR_RSEQ_SLICE_EXTENSION:
+		if (arg4 || arg5)
+			return -EINVAL;
+		error = rseq_slice_extension_prctl(arg2, arg3);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 06/12] rseq: Implement sys_rseq_slice_yield()
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (4 preceding siblings ...)
  2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-09  9:52   ` K Prateek Nayak
  2025-09-10 11:15   ` K Prateek Nayak
  2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Peter Zijlstra,
	Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior

Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.

sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
---
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/tools/syscall_32.tbl             |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 include/linux/syscalls.h                    |    1 +
 include/uapi/asm-generic/unistd.h           |    5 ++++-
 kernel/rseq.c                               |    9 +++++++++
 kernel/sys_ni.c                             |    1 +
 scripts/syscall.tbl                         |    1 +
 21 files changed, 32 insertions(+), 1 deletion(-)

--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
 577	common	open_tree_attr			sys_open_tree_attr
 578	common	file_getattr			sys_file_getattr
 579	common	file_setattr			sys_file_setattr
+580	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -484,3 +484,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -481,3 +481,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -408,3 +408,4 @@
 467	n32	open_tree_attr			sys_open_tree_attr
 468	n32	file_getattr			sys_file_getattr
 469	n32	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -384,3 +384,4 @@
 467	n64	open_tree_attr			sys_open_tree_attr
 468	n64	file_getattr			sys_file_getattr
 469	n64	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -457,3 +457,4 @@
 467	o32	open_tree_attr			sys_open_tree_attr
 468	o32	file_getattr			sys_file_getattr
 469	o32	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -468,3 +468,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -560,3 +560,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	nospu	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -472,3 +472,4 @@
 467  common	open_tree_attr		sys_open_tree_attr		sys_open_tree_attr
 468  common	file_getattr		sys_file_getattr		sys_file_getattr
 469  common	file_setattr		sys_file_setattr		sys_file_setattr
+470  common	rseq_slice_yield	sys_rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -475,3 +475,4 @@
 467	i386	open_tree_attr		sys_open_tree_attr
 468	i386	file_getattr		sys_file_getattr
 469	i386	file_setattr		sys_file_setattr
+470	i386	rseq_slice_yield	sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -393,6 +393,7 @@
 467	common	open_tree_attr		sys_open_tree_attr
 468	common	file_getattr		sys_file_getattr
 469	common	file_setattr		sys_file_setattr
+470	common	rseq_slice_yield	sys_rseq_slice_yield
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -957,6 +957,7 @@ asmlinkage long sys_statx(int dfd, const
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
 				   unsigned flags,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -858,8 +858,11 @@
 #define __NR_file_setattr 469
 __SYSCALL(__NR_file_setattr, sys_file_setattr)
 
+#define __NR_rseq_slice_yield 470
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
 #undef __NR_syscalls
-#define __NR_syscalls 470
+#define __NR_syscalls 471
 
 /*
  * 32 bit systems traditionally used different
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned
 	return -EFAULT;
 }
 
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+	if (need_resched()) {
+		schedule();
+		return 1;
+	}
+	return 0;
+}
+
 static int __init rseq_slice_cmdline(char *str)
 {
 	bool on;
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,5 +390,6 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_sched_yield);
 
 COND_SYSCALL(uretprobe);
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -410,3 +410,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_sched_yield		sys_rseq_sched_yield


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 07/12] rseq: Implement syscall entry work for time slice extensions
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (5 preceding siblings ...)
  2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-10  5:22   ` K Prateek Nayak
  2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.

In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.

Doing it in syscall entry work allows to catch misbehaving user space,
which issues a syscall from the critical section. Wrong syscall and
inconsistent user space result in a SIGSEGV.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/entry-common.h  |    2 -
 include/linux/rseq.h          |    2 +
 include/linux/thread_info.h   |   16 ++++----
 kernel/entry/syscall-common.c |   11 ++++-
 kernel/rseq.c                 |   80 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 10 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_RSEQ_SLICE |	\
 				 ARCH_SYSCALL_WORK_ENTER)
-
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -191,8 +191,10 @@ static inline void rseq_syscall(struct p
 #endif /* !CONFIG_DEBUG_RSEQ */
 
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
 int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
 static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
 {
 	return -EINVAL;
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
 };
 
-#define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU	BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP			BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT		BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE		BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU		BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT		BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH	BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP		BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE		BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
 #endif
 
 #include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
 	}
 }
 
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
-				unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
 {
 	long ret = 0;
 
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs
 			return -1L;
 	}
 
+	/*
+	 * User space got a time slice extension granted and relinquishes
+	 * the CPU. The work stops the slice timer to avoid an extra round
+	 * through hrtimer_interrupt().
+	 */
+	if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+		rseq_syscall_enter_work(syscall);
+
 	/* Handle ptrace */
 	if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
 		ret = ptrace_report_syscall_entry(regs);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -491,6 +491,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
 DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
 
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+	/*
+	 * The interrupt guard is required to prevent inconsistent state in
+	 * this case:
+	 *
+	 * set_tsk_need_resched()
+	 * --> Interrupt
+	 *       wakeup()
+	 *        set_tsk_need_resched()
+	 *	  set_preempt_need_resched()
+	 *     schedule_on_return()
+	 *        clear_tsk_need_resched()
+	 *	  clear_preempt_need_resched()
+	 * set_preempt_need_resched()		<- Inconsistent state
+	 *
+	 * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+	 * only sets the already set bit and does not create inconsistent
+	 * state.
+	 */
+	scoped_guard(irq)
+		set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+	u32 __user *sctrl = &current->rseq.usrptr->slice_ctrl;
+	u32 uval;
+
+	if (get_user_masked_u32(&uval, sctrl) || uval != expected)
+		force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+	struct task_struct *curr = current;
+	bool granted = curr->rseq.slice.state.granted;
+
+	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+	if (static_branch_unlikely(&rseq_debug_enabled))
+		rseq_slice_validate_ctrl(granted ? RSEQ_SLICE_EXT_GRANTED : 0);
+
+	/*
+	 * The kernel might have raced, revoked the grant and updated
+	 * userspace, but kept the SLICE work set.
+	 */
+	if (!granted)
+		return;
+
+	rseq_stat_inc(rseq_stats.s_yielded);
+
+	/*
+	 * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+	 * kernels.
+	 */
+	scoped_guard(preempt) {
+		/*
+		 * Now that preemption is disabled, quickly check whether
+		 * the task was already rescheduled before arriving here.
+		 */
+		if (!curr->rseq.event.sched_switch)
+			rseq_slice_set_need_resched(curr);
+	}
+
+	curr->rseq.slice.state.granted = false;
+	/*
+	 * Clear the grant in user space and check whether this was the
+	 * correct syscall to yield. If the user access fails or the task
+	 * used an arbitrary syscall, terminate it.
+	 */
+	if (put_user_masked_u32(0U, &curr->rseq.usrptr->slice_ctrl) ||
+	    syscall != __NR_rseq_slice_yield)
+		force_sig(SIGSEGV);
+}
+
 int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
 {
 	switch (arg2) {


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 08/12] rseq: Implement time slice extension enforcement timer
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (6 preceding siblings ...)
  2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-10 11:20   ` K Prateek Nayak
  2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.

It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:

   1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
      independently of CONFIG_HIGHRES_TIMERS

   2) HRTICK usage in the scheduler can be runtime disabled or is only used
      for certain aspects of scheduling.

   3) The function is calling into the scheduler code and that might have
      unexpected consequences when this is invoked due to a time slice
      enforcement expiry. Especially when the task managed to clear the
      grant via sched_yield(0).

It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.

Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.

The timer is armed when an extenstion was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().

It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq_entry.h |   22 +++++++-
 include/linux/rseq_types.h |    2 
 kernel/rseq.c              |  119 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 140 insertions(+), 3 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -88,8 +88,24 @@ static __always_inline bool rseq_slice_e
 {
 	return static_branch_likely(&rseq_slice_extension_key);
 }
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+	if (!rseq_slice_extension_enabled())
+		return false;
+
+	if (likely(!current->rseq.slice.state.granted))
+		return false;
+
+	return __rseq_arm_slice_extension_timer();
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -560,8 +576,12 @@ static __always_inline void clear_tif_rs
 static __always_inline bool
 rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
 {
+	/*
+	 * Arm the slice extension timer if nothing to do anymore and the
+	 * task really goes out to user space.
+	 */
 	if (likely(!test_tif_rseq(ti_work)))
-		return false;
+		return rseq_arm_slice_extension_timer();
 
 	if (unlikely(__rseq_exit_to_user_mode_restart(regs)))
 		return true;
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -87,9 +87,11 @@ union rseq_slice_state {
 /**
  * struct rseq_slice - Status information for rseq time slice extension
  * @state:	Time slice extension state
+ * @expires:	The time when a grant expires
  */
 struct rseq_slice {
 	union rseq_slice_state	state;
+	u64			expires;
 };
 
 /**
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
 #define RSEQ_BUILD_SLOW_PATH
 
 #include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
 #include <linux/prctl.h>
 #include <linux/ratelimit.h>
 #include <linux/rseq_entry.h>
@@ -489,8 +491,82 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 }
 
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+	struct hrtimer	timer;
+	void		*cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
 DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
 
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+	if (st->cookie == current && current->rseq.slice.state.granted) {
+		rseq_stat_inc(rseq_stats.s_expired);
+		set_need_resched_current();
+	}
+	return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+	struct slice_timer *st = this_cpu_ptr(&slice_timer);
+	struct task_struct *curr = current;
+
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * This check prevents that a granted time slice extension exceeds
+	 * the maximum scheduling latency when the grant expired before
+	 * going out to user space. Don't bother to clear the grant here,
+	 * it will be cleaned up automatically before going out to user
+	 * space.
+	 */
+	if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+		set_need_resched_current();
+		return true;
+	}
+
+	/*
+	 * Store the task pointer as a cookie for comparison in the timer
+	 * function. This is safe as the timer is CPU local and cannot be
+	 * in the expiry function at this point.
+	 */
+	st->cookie = curr;
+	hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+	/* Arm the syscall entry work */
+	set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+	return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+	struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+	/*
+	 * st->cookie can be safely read as preemption is disabled and the
+	 * timer is CPU local. The active check can obviously race with the
+	 * hrtimer interrupt, but that's better than disabling interrupts
+	 * unconditionaly right away.
+	 *
+	 * As this is most probably the first expiring timer, the cancel is
+	 * expensive as it has to reprogram the hardware, but that's less
+	 * expensive than going through a full hrtimer_interrupt() cycle
+	 * for nothing.
+	 *
+	 * hrtimer_try_to_cancel() is sufficient here as with interrupts
+	 * disabled the timer callback cannot be running and the timer base
+	 * is well determined as the timer is pinned on the local CPU.
+	 */
+	if (st->cookie == current && hrtimer_active(&st->timer)) {
+		scoped_guard(irq)
+			hrtimer_try_to_cancel(&st->timer);
+	}
+}
+
 static inline void rseq_slice_set_need_resched(struct task_struct *curr)
 {
 	/*
@@ -548,10 +624,11 @@ void rseq_syscall_enter_work(long syscal
 	rseq_stat_inc(rseq_stats.s_yielded);
 
 	/*
-	 * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
-	 * kernels.
+	 * Required to stabilize the per CPU timer pointer and to make
+	 * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
 	 */
 	scoped_guard(preempt) {
+		rseq_cancel_slice_extension_timer();
 		/*
 		 * Now that preemption is disabled, quickly check whether
 		 * the task was already rescheduled before arriving here.
@@ -631,6 +708,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
 	return 0;
 }
 
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+	{
+		.procname	= "rseq_slice_extension_nsec",
+		.data		= &rseq_slice_ext_nsecs,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= (unsigned int *)&rseq_slice_ext_nsecs_min,
+		.extra2		= (unsigned int *)&rseq_slice_ext_nsecs_max,
+	},
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+	if (rseq_slice_extension_enabled())
+		register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif  /* !CONFIG_SYSCTL */
+
 static int __init rseq_slice_cmdline(char *str)
 {
 	bool on;
@@ -643,4 +745,17 @@ static int __init rseq_slice_cmdline(cha
 	return 0;
 }
 __setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+			      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+	}
+	rseq_slice_sysctl_init();
+	return 0;
+}
+device_initcall(rseq_slice_init);
 #endif /* CONFIG_RSEQ_SLICE_EXTENSION */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 09/12] rseq: Reset slice extension when scheduled
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (7 preceding siblings ...)
  2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.

RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.

Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq_entry.h |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -103,9 +103,17 @@ static __always_inline bool rseq_arm_sli
 	return __rseq_arm_slice_extension_timer();
 }
 
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+	if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+		rseq_stat_inc(rseq_stats.s_revoked);
+	t->rseq.slice.state.granted = false;
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
 static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -404,6 +412,13 @@ bool rseq_set_ids_get_csaddr(struct task
 	unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
 	if (csaddr)
 		unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+	/* Open coded, so it's in the same user access region */
+	if (rseq_slice_extension_enabled()) {
+		/* Unconditionally clear it, no point in conditionals */
+		unsafe_put_user(0U, &rseq->slice_ctrl, efault);
+		rseq_slice_clear_grant(t);
+	}
 	user_access_end();
 
 	/* Cache the new values */
@@ -518,10 +533,19 @@ static __always_inline bool __rseq_exit_
 		 * If IDs have not changed rseq_event::user_irq must be true
 		 * See rseq_sched_switch_event().
 		 */
+		struct rseq __user *rseq = t->rseq.usrptr;
 		u64 csaddr;
 
-		if (unlikely(get_user_masked_u64(&csaddr, &t->rseq.usrptr->rseq_cs)))
+		if (!user_rw_masked_begin(rseq))
 			goto fail;
+		unsafe_get_user(csaddr, &rseq->rseq_cs, fault);
+		/* Open coded, so it's in the same user access region */
+		if (rseq_slice_extension_enabled()) {
+			/* Unconditionally clear it, no point in conditionals */
+			unsafe_put_user(0U, &rseq->slice_ctrl, fault);
+			rseq_slice_clear_grant(t);
+		}
+		user_access_end();
 
 		if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
 			if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -545,6 +569,8 @@ static __always_inline bool __rseq_exit_
 	t->rseq.event.events = 0;
 	return false;
 
+fault:
+	user_access_end();
 fail:
 	pagefault_enable();
 	/* Force it into the slow path. Don't clear the state! */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 10/12] rseq: Implement rseq_grant_slice_extension()
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (8 preceding siblings ...)
  2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-09  8:14   ` K Prateek Nayak
  2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.

The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:

 #1 the functionality is enabled

 #2 the exit is a return from interrupt to user mode

 #3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
    which means the task was already scheduled out.
 
The slow path, which implements the actual user space ABI, is invoked
when:

  A) #1 is true, #2 is true and #3 is false

     It checks whether user space requested a slice extension by setting
     the request bit in the rseq slice_ctrl field. If so, it grants the
     extension and stores the slice expiry time, so that the actual exit
     code can double check whether the slice is already exhausted before
     going back.

  B) #1 - #3 are true _and_ a slice extension was granted in a previous
     loop iteration

     In this case the grant is revoked.

In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq_entry.h |  111 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -41,6 +41,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
 #ifdef CONFIG_RSEQ
 #include <linux/jump_label.h>
 #include <linux/rseq.h>
+#include <linux/sched/signal.h>
 #include <linux/uaccess.h>
 
 #include <uapi/linux/rseq.h>
@@ -110,10 +111,120 @@ static __always_inline void rseq_slice_c
 	t->rseq.slice.state.granted = false;
 }
 
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+	struct task_struct *curr = current;
+	union rseq_slice_state state;
+	struct rseq __user *rseq;
+	u32 usr_ctrl;
+
+	if (!rseq_slice_extension_enabled())
+		return false;
+
+	/* If not enabled or not a return from interrupt, nothing to do. */
+	state = curr->rseq.slice.state;
+	state.enabled &= curr->rseq.event.user_irq;
+	if (likely(!state.state))
+		return false;
+
+	rseq = curr->rseq.usrptr;
+	if (!user_rw_masked_begin(rseq))
+		goto die;
+
+	/*
+	 * Quick check conditions where a grant is not possible or
+	 * needs to be revoked.
+	 *
+	 *  1) Any TIF bit which needs to do extra work aside of
+	 *     rescheduling prevents a grant.
+	 *
+	 *  2) A previous rescheduling request resulted in a slice
+	 *     extension grant.
+	 */
+	if (unlikely(work_pending || state.granted)) {
+		/* Clear user control unconditionally. No point for checking */
+		unsafe_put_user(0U, &rseq->slice_ctrl, fail);
+		user_access_end();
+		rseq_slice_clear_grant(curr);
+		return false;
+	}
+
+	unsafe_get_user(usr_ctrl, &rseq->slice_ctrl, fail);
+	if (likely(!(usr_ctrl & RSEQ_SLICE_EXT_REQUEST))) {
+		user_access_end();
+		return false;
+	}
+
+	/* Grant the slice extention */
+	unsafe_put_user(RSEQ_SLICE_EXT_GRANTED, &rseq->slice_ctrl, fail);
+	user_access_end();
+
+	rseq_stat_inc(rseq_stats.s_granted);
+
+	curr->rseq.slice.state.granted = true;
+	/* Store expiry time for arming the timer on the way out */
+	curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+	/*
+	 * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+	 * several ways:
+	 *
+	 * 1)
+	 *	CPU0			CPU1
+	 *	clear_tsk()
+	 *				set_tsk()
+	 *	clear_preempt()
+	 *				Raise scheduler IPI on CPU0
+	 *	--> IPI
+	 *	    fold_need_resched() -> Folds correctly
+	 * 2)
+	 *	CPU0			CPU1
+	 *				set_tsk()
+	 *	clear_tsk()
+	 *	clear_preempt()
+	 *				Raise scheduler IPI on CPU0
+	 *	--> IPI
+	 *	    fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+	 *
+	 * #1 is not any different from a regular remote reschedule as it
+	 *    sets the previously not set bit and then raises the IPI which
+	 *    folds it into the preempt counter
+	 *
+	 * #2 is obviously incorrect from a scheduler POV, but it's not
+	 *    differently incorrect than the code below clearing the
+	 *    reschedule request with the safety net of the timer.
+	 *
+	 * The important part is that the clearing is protected against the
+	 * scheduler IPI and also against any other interrupt which might
+	 * end up waking up a task and setting the bits in the middle of
+	 * the operation:
+	 *
+	 *	clear_tsk()
+	 *	---> Interrupt
+	 *		wakeup_on_this_cpu()
+	 *		set_tsk()
+	 *		set_preempt()
+	 *	clear_preempt()
+	 *
+	 * which would be inconsistent state.
+	 */
+	scoped_guard(irq) {
+		clear_tsk_need_resched(curr);
+		clear_preempt_need_resched();
+	}
+	return true;
+
+fail:
+	user_access_end();
+die:
+	force_sig(SIGSEGV);
+	return false;
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
 static inline bool rseq_arm_slice_extension_timer(void) { return false; }
 static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 11/12] entry: Hook up rseq time slice extension
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (9 preceding siblings ...)
  2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Wire the grant decision function up in exit_to_user_mode_loop()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/entry/common.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
 #define EXIT_TO_USER_MODE_WORK_LOOP	(EXIT_TO_USER_MODE_WORK)
 #endif
 
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY	(EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
 							      unsigned long ti_work)
 {
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+			if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 12/12] selftests/rseq: Implement time slice extension test
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (10 preceding siblings ...)
  2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
@ 2025-09-08 23:00 ` Thomas Gleixner
  2025-09-10 11:23   ` K Prateek Nayak
  2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-08 23:00 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and ariving in rseq_slice_yield().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/rseq/.gitignore   |    1 
 tools/testing/selftests/rseq/Makefile     |    5 
 tools/testing/selftests/rseq/rseq-abi.h   |    2 
 tools/testing/selftests/rseq/slice_test.c |  217 ++++++++++++++++++++++++++++++
 4 files changed, 224 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
 param_test_mm_cid_benchmark
 param_test_mm_cid_compare_twice
 syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
 		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
-		syscall_errors_test
+		syscall_errors_test slice_test
 
 TEST_GEN_PROGS_EXTENDED = librseq.so
 
@@ -59,3 +59,6 @@ include ../lib.mk
 $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
 					rseq.h rseq-*.h
 	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -164,6 +164,8 @@ struct rseq_abi {
 	 */
 	__u32 mm_cid;
 
+	__u32 slice_ctrl;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield	470
+#endif
+
+#define BITS_PER_INT	32
+#define BITS_PER_BYTE	8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION		79
+#  define PR_RSEQ_SLICE_EXTENSION_GET		1
+#  define PR_RSEQ_SLICE_EXTENSION_SET		2
+#  define PR_RSEQ_SLICE_EXT_ENABLE		0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT	0
+# define RSEQ_SLICE_EXT_GRANTED_BIT	1
+#endif
+
+#ifndef asm_inline
+# define asm_inline	asm __inline
+#endif
+
+#if defined(__x86_64__) || defined(__i386__)
+static __always_inline bool test_and_clear_request(unsigned int *addr)
+{
+	const unsigned int bit = RSEQ_SLICE_EXT_REQUEST_BIT;
+	bool res;
+
+	asm inline volatile("btrl %[__bit], %[__addr]\n"
+			    : [__addr] "+m" (*addr), "=@cc" "c" (res)
+			    : [__bit] "Ir" (bit)
+			    : "memory");
+	return res;
+}
+#else
+static __always_inline bool test_and_clear_request(unsigned int *addr)
+{
+	const unsigned int mask = (1U << RSEQ_SLICE_EXT_REQUEST_BIT);
+
+	return __atomic_fetch_and(addr, ~mask, __ATOMIC_RELAXED) & mask;
+}
+#endif
+
+static __always_inline void set_request(unsigned int *addr)
+{
+	*addr = 1U << RSEQ_SLICE_EXT_REQUEST_BIT;
+}
+
+static __always_inline bool test_granted(unsigned int *addr)
+{
+	return !!(*addr & (1U << RSEQ_SLICE_EXT_GRANTED_BIT));
+}
+
+#define NSEC_PER_SEC	1000000000L
+#define NSEC_PER_USEC	      1000L
+
+struct noise_params {
+	int	noise_nsecs;
+	int	sleep_nsecs;
+	int	run;
+};
+
+FIXTURE(slice_ext)
+{
+	pthread_t		noise_thread;
+	struct noise_params	noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+	int64_t	total_nsecs;
+	int	slice_nsecs;
+	int	noise_nsecs;
+	int	sleep_nsecs;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+	.total_nsecs	=  5 * NSEC_PER_SEC,
+	.slice_nsecs	=  2 * NSEC_PER_USEC,
+	.noise_nsecs    =  2 * NSEC_PER_USEC,
+	.sleep_nsecs	= 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+	.total_nsecs	=  5 * NSEC_PER_SEC,
+	.slice_nsecs	= 50 * NSEC_PER_USEC,
+	.noise_nsecs    =  2 * NSEC_PER_USEC,
+	.sleep_nsecs	= 50 * NSEC_PER_USEC,
+};
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+			   int64_t span)
+{
+	int64_t delta = now->tv_sec - start->tv_sec;
+
+	delta *= NSEC_PER_SEC;
+	delta += now->tv_nsec - start->tv_nsec;
+	return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+	struct noise_params *p = arg;
+
+	while (RSEQ_READ_ONCE(p->run)) {
+		struct timespec ts_start, ts_now;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
+		do {
+			clock_gettime(CLOCK_MONOTONIC, &ts_now);
+		} while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+		ts_start.tv_sec = 0;
+		ts_start.tv_nsec = p->sleep_nsecs;
+		clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+	}
+	return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+	cpu_set_t affinity;
+
+	ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+	/* Pin it on a single CPU. Avoid CPU 0 */
+	for (int i = 1; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &affinity))
+			continue;
+
+		CPU_ZERO(&affinity);
+		CPU_SET(i, &affinity);
+		ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+		break;
+	}
+
+	ASSERT_EQ(rseq_register_current_thread(), 0);
+
+	ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+			PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+	self->noise_params.noise_nsecs = variant->noise_nsecs;
+	self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+	self->noise_params.run = 1;
+
+	ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+	self->noise_params.run = 0;
+	pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+	unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+	struct rseq_abi *rs = rseq_get_abi();
+	struct timespec ts_start, ts_now;
+
+	ASSERT_NE(rs, NULL);
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+	do {
+		struct timespec ts_cs;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+		set_request(&rs->slice_ctrl);
+		do {
+			clock_gettime(CLOCK_MONOTONIC, &ts_now);
+		} while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+		if (!test_and_clear_request(&rs->slice_ctrl)) {
+			if (test_granted(&rs->slice_ctrl)) {
+				yielded++;
+				if (!syscall(__NR_rseq_slice_yield))
+					raced++;
+			} else {
+				scheduled++;
+			}
+		} else {
+			success++;
+		}
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_now);
+	} while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+	printf("# Success   %12ld\n", success);
+	printf("# Yielded   %12ld\n", yielded);
+	printf("# Scheduled %12ld\n", scheduled);
+	printf("# Raced     %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-09-09  0:04   ` Randy Dunlap
  2025-09-11 15:41   ` Mathieu Desnoyers
  2025-09-22  5:28   ` Prakash Sangappa
  2 siblings, 0 replies; 54+ messages in thread
From: Randy Dunlap @ 2025-09-09  0:04 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch

Hi Thomas,

On 9/8/25 3:59 PM, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
> 

> ---
>  Documentation/userspace-api/index.rst |    1 
>  Documentation/userspace-api/rseq.rst  |  129 ++++++++++++++++++++++++++++++++++
>  include/linux/rseq_types.h            |   26 ++++++
>  include/uapi/linux/rseq.h             |   28 +++++++
>  init/Kconfig                          |   12 +++
>  kernel/rseq.c                         |    8 ++
>  6 files changed, 204 insertions(+)
> 

> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:

userspace or user-space or user space -- be consistent, please.
(above 2 times, and more below)

FWIW, "userspace" overwhelmingly wins in the kernel source tree.
On the $internet it looks like "user space" wins (quick look).


> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space
> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
just                              heavyweight

> +ABI is unfortunately only available in the code and selftests.
> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(
> +
> +Scheduler time slice extensions
> +-------------------------------
> +
> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> +    * Enabled in Kconfig
> +
> +    * Enabled at boot time (default is enabled)
> +
> +    * A rseq user space pointer has been registered for the thread

                ^^^^^^^^^^

> +
> +The thread has to enable the functionality via prctl(2)::
> +
> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg4 and arg5 must be zero
> +ENOTSUPP  Functionality was disabled on the kernel command line
> +ENXIO	  Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> +  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user

                                                          read-only for

> +space and only for informational purposes.

   userspace ?

> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling

                                      ^^^^^^^^^^

> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.

                                                         ?

> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.

   ?

> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()

                                       ?

> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from

OK, I think that you like "user space".  :)

> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> +    rseq->slice_ctrl = REQUEST;
> +    critical_section();
> +    if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> +        if (rseq->slice_ctrl & GRANTED)
> +                rseq_slice_yield();
> +    }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU

                          no comma      ^

> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> +    if (rseq->slice_ctrl & GRANTED)
> +      -> Interrupt results in schedule and grant revocation
> +        rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.
> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.

> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
>  };
>  
>  enum rseq_cs_flags_bit {
> +	/* Historical and unsupported bits */
>  	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
>  	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
>  	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
> +	/* (3) Intentional gap to put new bits into a seperate byte */

	                                              separate
("There is a rat in separate." -- old clue)
           'arat'

> +
> +	/* User read only feature flags */
> +	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
> +	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
>  };
>  
>  enum rseq_cs_flags {

> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>  
>  	  If unsure, say N.
>  
> +config RSEQ_SLICE_EXTENSION
> +	bool "Enable rseq based time slice extension mechanism"

	             rseq-based

> +	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> +	help
> +          Allows userspace to request a limited time slice extension when

Use tab + 2 spaces above instead of N spaces.

> +	  returning from an interrupt to user space via the RSEQ shared
> +	  data ABI. If granted, that allows to complete a critical section,
> +	  so that other threads are not stuck on a conflicted resource,
> +	  while the task is scheduled out.
-- 
~Randy


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
  2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-09-09  3:10   ` K Prateek Nayak
  2025-09-09  4:11     ` Randy Dunlap
  2025-09-11 15:42   ` Mathieu Desnoyers
  1 sibling, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-09  3:10 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static int __init rseq_slice_cmdline(char *str)
> +{
> +	bool on;
> +
> +	if (kstrtobool(str, &on))
> +		return -EINVAL;
> +
> +	if (!on)
> +		static_branch_disable(&rseq_slice_extension_key);
> +	return 0;

I believe this should return "1" signalling that the cmdline was handled
correctly to avoid an "Unknown kernel command line parameters" message.

> +}
> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
  2025-09-09  3:10   ` K Prateek Nayak
@ 2025-09-09  4:11     ` Randy Dunlap
  2025-09-09 12:12       ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Randy Dunlap @ 2025-09-09  4:11 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch



On 9/8/25 8:10 PM, K Prateek Nayak wrote:
> Hello Thomas,
> 
> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>> +
>> +static int __init rseq_slice_cmdline(char *str)
>> +{
>> +	bool on;
>> +
>> +	if (kstrtobool(str, &on))
>> +		return -EINVAL;
>> +
>> +	if (!on)
>> +		static_branch_disable(&rseq_slice_extension_key);
>> +	return 0;
> 
> I believe this should return "1" signalling that the cmdline was handled
> correctly to avoid an "Unknown kernel command line parameters" message.

Good catch. I agree.
Thanks.

>> +}
>> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
>> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
>>
> 

-- 
~Randy


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 10/12] rseq: Implement rseq_grant_slice_extension()
  2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-09-09  8:14   ` K Prateek Nayak
  2025-09-09 12:16     ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-09  8:14 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>  #else /* CONFIG_RSEQ_SLICE_EXTENSION */
>  static inline bool rseq_slice_extension_enabled(void) { return false; }
>  static inline bool rseq_arm_slice_extension_timer(void) { return false; }
>  static inline void rseq_slice_clear_grant(struct task_struct *t) { }
> +static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }

This is still under the CONFIG_RSEQ block and when building with
CONFIG_RSEQ disabled gives the following error with changes from
Patch 11:

    kernel/entry/common.c:40:30: error: implicit declaration of function ‘rseq_grant_slice_extension’ [-Werror=implicit-function-declaration]
       40 |                         if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))

Putting the rseq_grant_slice_extension() definition from above in
a separate "ifndef CONFIG_RSEQ_SLICE_EXTENSION" block at the end
keeps the build happy.

>  #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
>  
>  bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 06/12] rseq: Implement sys_rseq_slice_yield()
  2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-09-09  9:52   ` K Prateek Nayak
  2025-09-09 12:23     ` Thomas Gleixner
  2025-09-10 11:15   ` K Prateek Nayak
  1 sibling, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-09  9:52 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior

Hello Thomas,

On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned
>  	return -EFAULT;
>  }
>  
> +SYSCALL_DEFINE0(rseq_slice_yield)
> +{
> +	if (need_resched()) {
> +		schedule();
> +		return 1;
> +	}
> +	return 0;
> +}
> +
>  static int __init rseq_slice_cmdline(char *str)
>  {
>  	bool on;
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -390,5 +390,6 @@ COND_SYSCALL(setuid16);
>  
>  /* restartable sequence */
>  COND_SYSCALL(rseq);
> +COND_SYSCALL(rseq_sched_yield);

I'm not sure if it is my toolchain but when I try to build a version
with CONFIG_RSEQ_SLICE_EXTENSION disabled, I see:

    ld: vmlinux.o: in function `x64_sys_call':
    arch/x86/include/generated/asm/syscalls_64.h:471: undefined reference to `__x64_sys_rseq_slice_yield'
    ld: vmlinux.o: in function `ia32_sys_call':
    arch/x86/include/generated/asm/syscalls_32.h:471: undefined reference to `__ia32_sys_rseq_slice_yield'
    ld: vmlinux.o:(.rodata+0x12d0): undefined reference to `__x64_sys_rseq_slice_yield'

I would have assumed the COND_SYSCALL() above would have stubbed this
but that doesn't seem to be the case. Am I missing something?
P.S. I'm running with:

    gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04.2)
    GNU ld (GNU Binutils for Ubuntu) 2.38

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
  2025-09-09  4:11     ` Randy Dunlap
@ 2025-09-09 12:12       ` Thomas Gleixner
  2025-09-09 16:01         ` Randy Dunlap
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:12 UTC (permalink / raw)
  To: Randy Dunlap, K Prateek Nayak, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On Mon, Sep 08 2025 at 21:11, Randy Dunlap wrote:
> On 9/8/25 8:10 PM, K Prateek Nayak wrote:
>> Hello Thomas,
>> 
>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>>> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>>> +
>>> +static int __init rseq_slice_cmdline(char *str)
>>> +{
>>> +	bool on;
>>> +
>>> +	if (kstrtobool(str, &on))
>>> +		return -EINVAL;
>>> +
>>> +	if (!on)
>>> +		static_branch_disable(&rseq_slice_extension_key);
>>> +	return 0;
>> 
>> I believe this should return "1" signalling that the cmdline was handled
>> correctly to avoid an "Unknown kernel command line parameters" message.
>
> Good catch. I agree.
> Thanks.

It seems I can't get that right ever ....

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 10/12] rseq: Implement rseq_grant_slice_extension()
  2025-09-09  8:14   ` K Prateek Nayak
@ 2025-09-09 12:16     ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:16 UTC (permalink / raw)
  To: K Prateek Nayak, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On Tue, Sep 09 2025 at 13:44, K. Prateek Nayak wrote:

> Hello Thomas,
>
> On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>>  #else /* CONFIG_RSEQ_SLICE_EXTENSION */
>>  static inline bool rseq_slice_extension_enabled(void) { return false; }
>>  static inline bool rseq_arm_slice_extension_timer(void) { return false; }
>>  static inline void rseq_slice_clear_grant(struct task_struct *t) { }
>> +static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
>
> This is still under the CONFIG_RSEQ block and when building with
> CONFIG_RSEQ disabled gives the following error with changes from
> Patch 11:
>
>     kernel/entry/common.c:40:30: error: implicit declaration of function ‘rseq_grant_slice_extension’ [-Werror=implicit-function-declaration]
>        40 |                         if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
>
> Putting the rseq_grant_slice_extension() definition from above in
> a separate "ifndef CONFIG_RSEQ_SLICE_EXTENSION" block at the end
> keeps the build happy.

Duh, yes.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 06/12] rseq: Implement sys_rseq_slice_yield()
  2025-09-09  9:52   ` K Prateek Nayak
@ 2025-09-09 12:23     ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:23 UTC (permalink / raw)
  To: K Prateek Nayak, LKML
  Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior

On Tue, Sep 09 2025 at 15:22, K. Prateek Nayak wrote:
> On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>>  /* restartable sequence */
>>  COND_SYSCALL(rseq);
>> +COND_SYSCALL(rseq_sched_yield);
>
> I'm not sure if it is my toolchain but when I try to build a version
> with CONFIG_RSEQ_SLICE_EXTENSION disabled, I see:
>
>     ld: vmlinux.o: in function `x64_sys_call':
>     arch/x86/include/generated/asm/syscalls_64.h:471: undefined reference to `__x64_sys_rseq_slice_yield'
>     ld: vmlinux.o: in function `ia32_sys_call':
>     arch/x86/include/generated/asm/syscalls_32.h:471: undefined reference to `__ia32_sys_rseq_slice_yield'
>     ld: vmlinux.o:(.rodata+0x12d0): undefined reference to `__x64_sys_rseq_slice_yield'
>
> I would have assumed the COND_SYSCALL() above would have stubbed this
> but that doesn't seem to be the case. Am I missing something?

Yes.

>> +COND_SYSCALL(rseq_sched_yield);

does not create a stub for rseq_slice_yield() obviously :)

/me looks for a brown paperbag.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (11 preceding siblings ...)
  2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-09-09 12:37 ` Thomas Gleixner
  2025-09-10  4:42   ` K Prateek Nayak
  2025-09-10 11:28 ` K Prateek Nayak
  2025-09-11 15:27 ` Mathieu Desnoyers
  14 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-09 12:37 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
	Randy Dunlap

On Tue, Sep 09 2025 at 00:59, Thomas Gleixner wrote:
> For your convenience all of it is also available as a conglomerate from
> git:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Force pushed a new version into the branch, which addresses the initial
feedback and fallout.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
  2025-09-09 12:12       ` Thomas Gleixner
@ 2025-09-09 16:01         ` Randy Dunlap
  0 siblings, 0 replies; 54+ messages in thread
From: Randy Dunlap @ 2025-09-09 16:01 UTC (permalink / raw)
  To: Thomas Gleixner, K Prateek Nayak, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch



On 9/9/25 5:12 AM, Thomas Gleixner wrote:
> On Mon, Sep 08 2025 at 21:11, Randy Dunlap wrote:
>> On 9/8/25 8:10 PM, K Prateek Nayak wrote:
>>> Hello Thomas,
>>>
>>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>>>> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>>>> +
>>>> +static int __init rseq_slice_cmdline(char *str)
>>>> +{
>>>> +	bool on;
>>>> +
>>>> +	if (kstrtobool(str, &on))
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!on)
>>>> +		static_branch_disable(&rseq_slice_extension_key);
>>>> +	return 0;
>>>
>>> I believe this should return "1" signalling that the cmdline was handled
>>> correctly to avoid an "Unknown kernel command line parameters" message.
>>
>> Good catch. I agree.
>> Thanks.
> 
> It seems I can't get that right ever ....

Yeah, it's bass-ackwards.

I guess that's partly why we have early_param() and friends.

-- 
~Randy


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-09-10  4:42   ` K Prateek Nayak
  0 siblings, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10  4:42 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch, Randy Dunlap

Hello Thomas,

On 9/9/2025 6:07 PM, Thomas Gleixner wrote:
> On Tue, Sep 09 2025 at 00:59, Thomas Gleixner wrote:
>> For your convenience all of it is also available as a conglomerate from
>> git:
>>
>>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
> 
> Force pushed a new version into the branch, which addresses the initial
> feedback and fallout.

Everything build fine now and the rseq selftests are happy too. Feel
free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 07/12] rseq: Implement syscall entry work for time slice extensions
  2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-09-10  5:22   ` K Prateek Nayak
  2025-09-10  7:49     ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10  5:22 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> +static inline void rseq_slice_set_need_resched(struct task_struct *curr)
> +{
> +	/*
> +	 * The interrupt guard is required to prevent inconsistent state in
> +	 * this case:
> +	 *
> +	 * set_tsk_need_resched()
> +	 * --> Interrupt
> +	 *       wakeup()
> +	 *        set_tsk_need_resched()
> +	 *	  set_preempt_need_resched()
> +	 *     schedule_on_return()
> +	 *        clear_tsk_need_resched()
> +	 *	  clear_preempt_need_resched()
> +	 * set_preempt_need_resched()		<- Inconsistent state
> +	 *
> +	 * This is safe vs. a remote set of TIF_NEED_RESCHED because that
> +	 * only sets the already set bit and does not create inconsistent
> +	 * state.
> +	 */
> +	scoped_guard(irq)
> +		set_need_resched_current();

nit. any specific reason for using a scoped_guard() instead of just a
guard() here (and in rseq_cancel_slice_extension_timer()) other than to
prominently highlight what is being guarded?

> +}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 07/12] rseq: Implement syscall entry work for time slice extensions
  2025-09-10  5:22   ` K Prateek Nayak
@ 2025-09-10  7:49     ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-10  7:49 UTC (permalink / raw)
  To: K Prateek Nayak, LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On Wed, Sep 10 2025 at 10:52, K. Prateek Nayak wrote:
> On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
>> +static inline void rseq_slice_set_need_resched(struct task_struct *curr)
>> +{
>> +	/*
>> +	 * The interrupt guard is required to prevent inconsistent state in
>> +	 * this case:
>> +	 *
>> +	 * set_tsk_need_resched()
>> +	 * --> Interrupt
>> +	 *       wakeup()
>> +	 *        set_tsk_need_resched()
>> +	 *	  set_preempt_need_resched()
>> +	 *     schedule_on_return()
>> +	 *        clear_tsk_need_resched()
>> +	 *	  clear_preempt_need_resched()
>> +	 * set_preempt_need_resched()		<- Inconsistent state
>> +	 *
>> +	 * This is safe vs. a remote set of TIF_NEED_RESCHED because that
>> +	 * only sets the already set bit and does not create inconsistent
>> +	 * state.
>> +	 */
>> +	scoped_guard(irq)
>> +		set_need_resched_current();
>
> nit. any specific reason for using a scoped_guard() instead of just a
> guard() here (and in rseq_cancel_slice_extension_timer()) other than to
> prominently highlight what is being guarded?

Yes, the intention was to highlight it and scoped_guard() really
does. From a code generation perspective it's the same outcome.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 06/12] rseq: Implement sys_rseq_slice_yield()
  2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
  2025-09-09  9:52   ` K Prateek Nayak
@ 2025-09-10 11:15   ` K Prateek Nayak
  1 sibling, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:15 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior

Hello Thomas,

On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned
>  	return -EFAULT;
>  }
>  

nit.

Perhaps a small note here to highlight how need_resched() is true
for tasks who had the slice extension granted. Something like:

/**
 * sys_rseq_slice_yield - yield the current processor if a task granted with
 * slice extension is done with the critical work before being forced out.
 *
 * This syscall entry work ensures NEED_RESCHED is set if the task was granted
 * a slice extension before arriving here.
 *
 * Return: 1 if the task successfully yielded the CPU within the granted slice.
 *	   0 if the slice extension was either never granted or was revoked by
 *	   going over the granted extension.
 */

> +SYSCALL_DEFINE0(rseq_slice_yield)
> +{
> +	if (need_resched()) {
> +		schedule();
> +		return 1;
> +	}
> +	return 0;
> +}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 08/12] rseq: Implement time slice extension enforcement timer
  2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-09-10 11:20   ` K Prateek Nayak
  0 siblings, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:20 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> The timer is armed when an extenstion was granted right before actually

nit. s/extenstion/extension/ 

> returning to user mode in rseq_exit_to_user_mode_restart().

[..snip..]

> +static void rseq_cancel_slice_extension_timer(void)
> +{
> +	struct slice_timer *st = this_cpu_ptr(&slice_timer);
> +
> +	/*
> +	 * st->cookie can be safely read as preemption is disabled and the
> +	 * timer is CPU local. The active check can obviously race with the
> +	 * hrtimer interrupt, but that's better than disabling interrupts
> +	 * unconditionaly right away.

nit. s/unconditionaly/unconditionally/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 12/12] selftests/rseq: Implement time slice extension test
  2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-09-10 11:23   ` K Prateek Nayak
  0 siblings, 0 replies; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:23 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/9/2025 4:30 AM, Thomas Gleixner wrote:
> Provide an initial test case to evaluate the functionality. This needs to be
> extended to cover the ABI violations and expose the race condition between
> observing granted and ariving in rseq_slice_yield().

nit. s/ariving/arriving/

I finally managed to trigger that cheeky race condition too :)

# Starting 2 tests from 2 test cases.
#  RUN           slice_ext.n2_2_50.slice_test ...
# Success        2088616
# Yielded          45097
# Scheduled          174
# Raced                2
#            OK  slice_ext.n2_2_50.slice_test

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (12 preceding siblings ...)
  2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-09-10 11:28 ` K Prateek Nayak
  2025-09-10 14:50   ` Thomas Gleixner
  2025-09-11 15:27 ` Mathieu Desnoyers
  14 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-10 11:28 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Apart from a couple of nit picks, I couldn't spot anything out of place
and the overall approach looks solid. Please feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-10 11:28 ` K Prateek Nayak
@ 2025-09-10 14:50   ` Thomas Gleixner
  2025-09-11  3:03     ` K Prateek Nayak
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-10 14:50 UTC (permalink / raw)
  To: K Prateek Nayak, LKML
  Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>> For your convenience all of it is also available as a conglomerate from
>> git:
>> 
>>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> Apart from a couple of nit picks, I couldn't spot anything out of place
> and the overall approach looks solid. Please feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks a lot for going through it and testing.

Do you have a real workload or a mockup at hand, which benefits
from that slice extension functionality?

It would be really nice to have more than a pretty lame selftest.

thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-10 14:50   ` Thomas Gleixner
@ 2025-09-11  3:03     ` K Prateek Nayak
  2025-09-11  7:36       ` Prakash Sangappa
  0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-11  3:03 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

Hello Thomas,

On 9/10/2025 8:20 PM, Thomas Gleixner wrote:
> On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>> For your convenience all of it is also available as a conglomerate from
>>> git:
>>>
>>>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>
>> Apart from a couple of nit picks, I couldn't spot anything out of place
>> and the overall approach looks solid. Please feel free to include:
>>
>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Thanks a lot for going through it and testing.
> 
> Do you have a real workload or a mockup at hand, which benefits
> from that slice extension functionality?

Not at the moment but we did have some interest for this feature
internally. Give me a week and I'll let you know if they had found a
use-case / have a prototype to test this.

In the meantime, Prakash should have a test bench that he used to
test his early RFC
https://lore.kernel.org/lkml/20241113000126.967713-1-prakash.sangappa@oracle.com/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-11  3:03     ` K Prateek Nayak
@ 2025-09-11  7:36       ` Prakash Sangappa
  0 siblings, 0 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-11  7:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, LKML, Peter Zilstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Madadi Vineeth Reddy, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch@vger.kernel.org



> On Sep 11, 2025, at 5:03 AM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Thomas,
> 
> On 9/10/2025 8:20 PM, Thomas Gleixner wrote:
>> On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
>>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>>> For your convenience all of it is also available as a conglomerate from
>>>> git:
>>>> 
>>>>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>> 
>>> Apart from a couple of nit picks, I couldn't spot anything out of place
>>> and the overall approach looks solid. Please feel free to include:
>>> 
>>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> 
>> Thanks a lot for going through it and testing.
>> 
>> Do you have a real workload or a mockup at hand, which benefits
>> from that slice extension functionality?
> 
> Not at the moment but we did have some interest for this feature
> internally. Give me a week and I'll let you know if they had found a
> use-case / have a prototype to test this.
> 
> In the meantime, Prakash should have a test bench that he used to
> test his early RFC
> https://lore.kernel.org/lkml/20241113000126.967713-1-prakash.sangappa@oracle.com/
> 

(Have been AFK, and will be for few more days)

The above was with a database workload. Will coordinate with our database team to get it tested 
with the updated API from this patch series.

Thanks,
-Prakash

> -- 
> Thanks and Regards,
> Prateek
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (13 preceding siblings ...)
  2025-09-10 11:28 ` K Prateek Nayak
@ 2025-09-11 15:27 ` Mathieu Desnoyers
  2025-09-11 20:18   ` Thomas Gleixner
  14 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:27 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On 2025-09-08 18:59, Thomas Gleixner wrote:
> This is the proper implementation of the PoC code, which I posted in reply
> to the latest iteration of Prakash's time slice extension patches:
> 
>       https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
> 
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
> 
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.
> 
> This has been attempted to solve at least for a decade, but so far this
> went nowhere.  The recent attempts, which started to integrate with the
> already existing RSEQ mechanism, have been at least going into the right
> direction. The full history is partially in the above mentioned mail thread
> and it's ancestors, but also in various threads in the LKML archives, which

it's -> its

> require archaeological efforts to retrieve.
> 
> When trying to morph the PoC into actual mergeable code, I stumbled over
> various shortcomings in the RSEQ code, which have been addressed in a
> separate effort. The latest iteration can be found here:
> 
>       https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
> 
> That is a prerequisite for this series as it allows a tight integration
> into the RSEQ code without inflicting a lot of extra overhead into the hot
> paths.
> 
> The main change vs. the PoC and the previous attempts is that it utilizes a
> new field in the user space ABI rseq struct, which allows to reduce the
> atomic operations in user space to a bare minimum. If the architecture
> supports CPU local atomics, which protect against the obvious RMW race
> vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
> x86, required.

Good!

> 
> The kernel user space ABI consists only of two bits in this new field:
> 
> 	REQUEST and GRANTED
> 
> User space sets REQUEST at the begin of the critical section. If it

beginning

> finishes the critical section without interruption then it can clear the
> bit and move on.
> 
> If it is interrupted and the interrupt return path in the kernel observes a
> rescheduling request, then the kernel can grant a time slice extension. The
> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
> non-atomic store operation. If it does not grant the extension only the
> REQUEST bit is cleared.
> 
> If user space observes the REQUEST bit cleared, when it finished the
> critical section, then it has to check the GRANTED bit. If that is set,
> then it has to invoke the rseq_slice_yield() syscall to terminate the

Does it "have" to ? What is the consequence of misbehaving ?

> extension and yield the CPU.
> 
> The code flow in user space is:
> 
>     	  // Simple store as there is no concurrency vs. the GRANTED bit
>        	  rseq->slice_ctrl = REQUEST;
> 
> 	  critical_section();
> 
> 	  // CPU local atomic required here:
> 	  if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> 	     	// Non-atomic check is sufficient as this can race
> 		// against an interrupt, which revokes the grant
> 		//
> 		// If not set, then the request was either cleared by the kernel
> 		// without grant or the grant was revoked.
> 		//
> 		// If set, tell the kernel that the critical section is done
> 		// so it can reschedule
> 	  	if (rseq->slice_ctrl & GRANTED)
> 			rseq_slice_yield();

I wonder if we could achieve this without the cpu-local atomic, and
just rely on simple relaxed-atomic or volatile loads/stores and compiler
barriers in userspace. Let's say we have:

union {
	u16 slice_ctrl;
	struct {
		u8 rseq->slice_request;
		u8 rseq->slice_grant;
	};
};

With userspace doing:

rseq->slice_request = true;  /* WRITE_ONCE() */
barrier();
critical_section();
barrier();
rseq->slice_request = false; /* WRITE_ONCE() */
if (rseq->slice_grant)       /* READ_ONCE() */
   rseq_slice_yield();

In the kernel interrupt return path, if the kernel observes
"rseq->slice_request" set and "rseq->slice_grant" cleared,
it grants the extension and sets "rseq->slice_grant".

rseq_slice_yield() clears rseq->slice_grant.


> 	  }
> 
> The other details, which differ from earlier attempts and the PoC, are:
> 
>      - A separate syscall for terminating the extension to avoid side
>        effects and overloading of the already ill defined sched_yield(2)
> 
>      - A separate per CPU timer, which again does not inflict side effects
>        on the scheduler internal hrtick timer. The hrtick timer can be
>        disabled at run-time and an expiry can cause interesting problems in
>        the scheduler code when it is unexpectedly invoked.
> 
>      - Tight integration into the rseq exit to user mode code. It utilizes
>        the path when TIF_RESQ is not set at the end of exit_to_user_mode()

TIF_RSEQ

>        to arm the timer if an extension was granted. TIF_RSEQ indicates that
>        the task was scheduled and therefore would revoke the grant anyway.
> 
>      - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
>        model which is utilized by PREEMPT_RT.

Can you clarify why this attempt is "futile" ?

Thanks,

Mathieu

> 
>        It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
>        but not TIF_PREEMPT.
> 
>        Pretending that this can be made work for TIF_PREEMPT on a fully
>        preemptible kernel is just wishful thinking as the chance that
>        TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
>        obvious reasons.
> 
>        This only "works" by some definition of works, i.e. on a best effort
>        basis, for the PREEMPT_NONE model and nothing else. Though given the
>        problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
>        running code sections, the days of these models should be hopefully
>        numbered and everything consolidated on the LAZY model.
> 
>        That makes this distinction moot and everything restricted to
>        TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
>        extension mechanism into the scheduler hotpath. I'm sure there will
>        be attempts to do that as there is no lack of crazy folks out
>        there...
> 
>      - Actual documentation of the user space ABI and a initial self test.
> 
> The RSEQ modifications on which this series is based can be found here:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
> 
> Thanks,
> 
> 	tglx
> ---
>   Documentation/userspace-api/index.rst       |    1
>   Documentation/userspace-api/rseq.rst        |  129 ++++++++++++
>   arch/alpha/kernel/syscalls/syscall.tbl      |    1
>   arch/arm/tools/syscall.tbl                  |    1
>   arch/arm64/tools/syscall_32.tbl             |    1
>   arch/m68k/kernel/syscalls/syscall.tbl       |    1
>   arch/microblaze/kernel/syscalls/syscall.tbl |    1
>   arch/mips/kernel/syscalls/syscall_n32.tbl   |    1
>   arch/mips/kernel/syscalls/syscall_n64.tbl   |    1
>   arch/mips/kernel/syscalls/syscall_o32.tbl   |    1
>   arch/parisc/kernel/syscalls/syscall.tbl     |    1
>   arch/powerpc/kernel/syscalls/syscall.tbl    |    1
>   arch/s390/kernel/syscalls/syscall.tbl       |    1
>   arch/s390/mm/pfault.c                       |    3
>   arch/sh/kernel/syscalls/syscall.tbl         |    1
>   arch/sparc/kernel/syscalls/syscall.tbl      |    1
>   arch/x86/entry/syscalls/syscall_32.tbl      |    1
>   arch/x86/entry/syscalls/syscall_64.tbl      |    1
>   arch/xtensa/kernel/syscalls/syscall.tbl     |    1
>   include/linux/entry-common.h                |    2
>   include/linux/rseq.h                        |   11 +
>   include/linux/rseq_entry.h                  |  176 ++++++++++++++++
>   include/linux/rseq_types.h                  |   28 ++
>   include/linux/sched.h                       |    7
>   include/linux/syscalls.h                    |    1
>   include/linux/thread_info.h                 |   16 -
>   include/uapi/asm-generic/unistd.h           |    5
>   include/uapi/linux/prctl.h                  |   10
>   include/uapi/linux/rseq.h                   |   28 ++
>   init/Kconfig                                |   12 +
>   kernel/entry/common.c                       |   14 +
>   kernel/entry/syscall-common.c               |   11 -
>   kernel/rcu/tiny.c                           |    8
>   kernel/rcu/tree.c                           |   14 -
>   kernel/rcu/tree_exp.h                       |    3
>   kernel/rcu/tree_plugin.h                    |    9
>   kernel/rcu/tree_stall.h                     |    3
>   kernel/rseq.c                               |  293 ++++++++++++++++++++++++++++
>   kernel/sys.c                                |    6
>   kernel/sys_ni.c                             |    1
>   scripts/syscall.tbl                         |    1
>   tools/testing/selftests/rseq/.gitignore     |    1
>   tools/testing/selftests/rseq/Makefile       |    5
>   tools/testing/selftests/rseq/rseq-abi.h     |    2
>   tools/testing/selftests/rseq/slice_test.c   |  217 ++++++++++++++++++++
>   45 files changed, 991 insertions(+), 42 deletions(-)
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
  2025-09-09  0:04   ` Randy Dunlap
@ 2025-09-11 15:41   ` Mathieu Desnoyers
  2025-09-11 15:49     ` Mathieu Desnoyers
  2025-09-22  5:28   ` Prakash Sangappa
  2 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:41 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch, Michael Jeanson

On 2025-09-08 18:59, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
> 
>     - Two flag bits for the rseq user space ABI, which allow user space to
>       query the availability and enablement without a syscall.
> 
>     - A new member to the user space ABI struct rseq, which is going to be
>       used to communicate request and grant between kernel and user space.
> 
>     - A rseq state struct to hold the kernel state of this
> 
>     - Documentation of the new mechanism
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>   Documentation/userspace-api/index.rst |    1
>   Documentation/userspace-api/rseq.rst  |  129 ++++++++++++++++++++++++++++++++++
>   include/linux/rseq_types.h            |   26 ++++++
>   include/uapi/linux/rseq.h             |   28 +++++++
>   init/Kconfig                          |   12 +++
>   kernel/rseq.c                         |    8 ++
>   6 files changed, 204 insertions(+)
> 
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -21,6 +21,7 @@ System calls
>      ebpf/index
>      ioctl/index
>      mseal
> +   rseq
>   
>   Security-related interfaces
>   ===========================
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:
> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space

Also reading the "concurrency ID" (mm_cid).

> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
> +ABI is unfortunately only available in the code and selftests.

Note that I've done a man page available here:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2

which describes the ABI.

> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(

At what level should we document this here ? Would it be OK to show examples
that rely on librseq helpers ?

> +
> +Scheduler time slice extensions
> +-------------------------------
> +

Note: I suspect we'll also want to add this section to the rseq(2) man page.

> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> +    * Enabled in Kconfig
> +
> +    * Enabled at boot time (default is enabled)
> +
> +    * A rseq user space pointer has been registered for the thread
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg4 and arg5 must be zero
> +ENOTSUPP  Functionality was disabled on the kernel command line
> +ENXIO	  Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> +  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
> +space and only for informational purposes.

Do those flags have a meaning within the struct rseq_cs @flags field as
well, or just within the struct rseq flags field ?

> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling
> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.
> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()
> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from
> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> +    rseq->slice_ctrl = REQUEST;
> +    critical_section();
> +    if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> +        if (rseq->slice_ctrl & GRANTED)
> +                rseq_slice_yield();
> +    }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU
> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> +    if (rseq->slice_ctrl & GRANTED)
> +      -> Interrupt results in schedule and grant revocation
> +        rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.

See my cover letter comments about the algorithm above.

Thanks,

Mathieu

> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -71,12 +71,35 @@ struct rseq_ids {
>   };
>   
>   /**
> + * union rseq_slice_state - Status information for rseq time slice extension
> + * @state:	Compound to access the overall state
> + * @enabled:	Time slice extension is enabled for the task
> + * @granted:	Time slice extension was granted to the task
> + */
> +union rseq_slice_state {
> +	u16			state;
> +	struct {
> +		u8		enabled;
> +		u8		granted;
> +	};
> +};
> +
> +/**
> + * struct rseq_slice - Status information for rseq time slice extension
> + * @state:	Time slice extension state
> + */
> +struct rseq_slice {
> +	union rseq_slice_state	state;
> +};
> +
> +/**
>    * struct rseq_data - Storage for all rseq related data
>    * @usrptr:	Pointer to the registered user space RSEQ memory
>    * @len:	Length of the RSEQ region
>    * @sig:	Signature of critial section abort IPs
>    * @event:	Storage for event management
>    * @ids:	Storage for cached CPU ID and MM CID
> + * @slice:	Storage for time slice extension data
>    */
>   struct rseq_data {
>   	struct rseq __user		*usrptr;
> @@ -84,6 +107,9 @@ struct rseq_data {
>   	u32				sig;
>   	struct rseq_event		event;
>   	struct rseq_ids			ids;
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +	struct rseq_slice		slice;
> +#endif
>   };
>   
>   #else /* CONFIG_RSEQ */
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
>   };
>   
>   enum rseq_cs_flags_bit {
> +	/* Historical and unsupported bits */
>   	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
>   	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
>   	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
> +	/* (3) Intentional gap to put new bits into a seperate byte */
> +
> +	/* User read only feature flags */
> +	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
> +	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
>   };
>   
>   enum rseq_cs_flags {
> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
>   	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
> +
> +	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE	=
> +		(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
> +	RSEQ_CS_FLAG_SLICE_EXT_ENABLED		=
> +		(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
> +};
> +
> +enum rseq_slice_bits {
> +	/* Time slice extension ABI bits */
> +	RSEQ_SLICE_EXT_REQUEST_BIT		= 0,
> +	RSEQ_SLICE_EXT_GRANTED_BIT		= 1,
> +};
> +
> +enum rseq_slice_masks {
> +	RSEQ_SLICE_EXT_REQUEST	= (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> +	RSEQ_SLICE_EXT_GRANTED	= (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>   };
>   
>   /*
> @@ -142,6 +164,12 @@ struct rseq {
>   	__u32 mm_cid;
>   
>   	/*
> +	 * Time slice extension control word. CPU local atomic updates from
> +	 * kernel and user space.
> +	 */
> +	__u32 slice_ctrl;
> +
> +	/*
>   	 * Flexible array member at end of structure, after last feature field.
>   	 */
>   	char end[];
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>   
>   	  If unsure, say N.
>   
> +config RSEQ_SLICE_EXTENSION
> +	bool "Enable rseq based time slice extension mechanism"
> +	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> +	help
> +          Allows userspace to request a limited time slice extension when
> +	  returning from an interrupt to user space via the RSEQ shared
> +	  data ABI. If granted, that allows to complete a critical section,
> +	  so that other threads are not stuck on a conflicted resource,
> +	  while the task is scheduled out.
> +
> +	  If unsure, say N.
> +
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
>    */
>   SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
>   {
> +	u32 rseqfl = 0;
> +
>   	if (flags & RSEQ_FLAG_UNREGISTER) {
>   		if (flags & ~RSEQ_FLAG_UNREGISTER)
>   			return -EINVAL;
> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>   		return -EFAULT;
>   
> +	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> +		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> +
> +	if (put_user_masked_u32(rseqfl, &rseq->flags))
> +		return -EFAULT;
> +
>   	/*
>   	 * Activate the registration by setting the rseq area address, length
>   	 * and signature in the task struct.
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 03/12] rseq: Provide static branch for time slice extensions
  2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
  2025-09-09  3:10   ` K Prateek Nayak
@ 2025-09-11 15:42   ` Mathieu Desnoyers
  1 sibling, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:42 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On 2025-09-08 18:59, Thomas Gleixner wrote:
> Guard the time slice extension functionality with a static key, which can
> be disabled on the kernel command line.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>   include/linux/rseq_entry.h |   11 +++++++++++
>   kernel/rseq.c              |   17 +++++++++++++++++
>   2 files changed, 28 insertions(+)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -77,6 +77,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
>   #define rseq_inline __always_inline
>   #endif
>   
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static __always_inline bool rseq_slice_extension_enabled(void)
> +{
> +	return static_branch_likely(&rseq_slice_extension_key);
> +}
> +#else /* CONFIG_RSEQ_SLICE_EXTENSION */
> +static inline bool rseq_slice_extension_enabled(void) { return false; }
> +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
> +
>   bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
>   bool rseq_debug_validate_ids(struct task_struct *t);
>   
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -474,3 +474,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   
>   	return 0;
>   }
> +
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static int __init rseq_slice_cmdline(char *str)
> +{
> +	bool on;
> +
> +	if (kstrtobool(str, &on))
> +		return -EINVAL;
> +
> +	if (!on)
> +		static_branch_disable(&rseq_slice_extension_key);
> +	return 0;

as pointed out elsewhere, this should be return 1.

Other than that:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> +}
> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 04/12] rseq: Add statistics for time slice extensions
  2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-09-11 15:43   ` Mathieu Desnoyers
  0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:43 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On 2025-09-08 18:59, Thomas Gleixner wrote:
> Extend the quick statistics with time slice specific fields.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/linux/rseq_entry.h |    4 ++++
>   kernel/rseq.c              |   12 ++++++++++++
>   2 files changed, 16 insertions(+)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -15,6 +15,10 @@ struct rseq_stats {
>   	unsigned long	cs;
>   	unsigned long	clear;
>   	unsigned long	fixup;
> +	unsigned long	s_granted;
> +	unsigned long	s_expired;
> +	unsigned long	s_revoked;
> +	unsigned long	s_yielded;
>   };
>   
>   DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi
>   		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
>   		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
>   		stats.fixup	+= data_race(per_cpu(rseq_stats.fixup, cpu));
> +		if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> +			stats.s_granted	+= data_race(per_cpu(rseq_stats.s_granted, cpu));
> +			stats.s_expired	+= data_race(per_cpu(rseq_stats.s_expired, cpu));
> +			stats.s_revoked	+= data_race(per_cpu(rseq_stats.s_revoked, cpu));
> +			stats.s_yielded	+= data_race(per_cpu(rseq_stats.s_yielded, cpu));
> +		}
>   	}
>   
>   	seq_printf(m, "exit:   %16lu\n", stats.exit);
> @@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi
>   	seq_printf(m, "cs:     %16lu\n", stats.cs);
>   	seq_printf(m, "clear:  %16lu\n", stats.clear);
>   	seq_printf(m, "fixup:  %16lu\n", stats.fixup);
> +	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> +		seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
> +		seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
> +		seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
> +		seq_printf(m, "syield: %16lu\n", stats.s_yielded);
> +	}
>   	return 0;
>   }
>   
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-11 15:41   ` Mathieu Desnoyers
@ 2025-09-11 15:49     ` Mathieu Desnoyers
  0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:49 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch, Michael Jeanson

On 2025-09-11 11:41, Mathieu Desnoyers wrote:
> On 2025-09-08 18:59, Thomas Gleixner wrote:
[...]

> 
>> +
>> +The kernel enforces flag consistency and terminates the thread with 
>> SIGSEGV
>> +if it detects a violation.
>> --- a/include/linux/rseq_types.h
>> +++ b/include/linux/rseq_types.h
>> @@ -71,12 +71,35 @@ struct rseq_ids {
>>   };
>>   /**
>> + * union rseq_slice_state - Status information for rseq time slice 
>> extension
>> + * @state:    Compound to access the overall state
>> + * @enabled:    Time slice extension is enabled for the task
>> + * @granted:    Time slice extension was granted to the task
>> + */
>> +union rseq_slice_state {
>> +    u16            state;
>> +    struct {
>> +        u8        enabled;
>> +        u8        granted;
>> +    };
>> +};
>> +
>> +/**
>> + * struct rseq_slice - Status information for rseq time slice extension
>> + * @state:    Time slice extension state
>> + */
>> +struct rseq_slice {
>> +    union rseq_slice_state    state;
>> +};
>> +
>> +/**
>>    * struct rseq_data - Storage for all rseq related data
>>    * @usrptr:    Pointer to the registered user space RSEQ memory
>>    * @len:    Length of the RSEQ region
>>    * @sig:    Signature of critial section abort IPs
>>    * @event:    Storage for event management
>>    * @ids:    Storage for cached CPU ID and MM CID
>> + * @slice:    Storage for time slice extension data
>>    */
>>   struct rseq_data {
>>       struct rseq __user        *usrptr;
>> @@ -84,6 +107,9 @@ struct rseq_data {
>>       u32                sig;
>>       struct rseq_event        event;
>>       struct rseq_ids            ids;
>> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
>> +    struct rseq_slice        slice;
>> +#endif

Note: we could move this #ifdef to surround the definition
of both union rseq_slice_state and struct rseq_slice,
and emit an empty structure in the #else case rather than
do the ifdef here.

Thanks,

Mathieu

>>   };
>>   #else /* CONFIG_RSEQ */
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -23,9 +23,15 @@ enum rseq_flags {
>>   };
>>   enum rseq_cs_flags_bit {
>> +    /* Historical and unsupported bits */
>>       RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT    = 0,
>>       RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT    = 1,
>>       RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT    = 2,
>> +    /* (3) Intentional gap to put new bits into a seperate byte */
>> +
>> +    /* User read only feature flags */
>> +    RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT    = 4,
>> +    RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT    = 5,
>>   };
>>   enum rseq_cs_flags {
>> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
>>           (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
>>       RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE    =
>>           (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>> +
>> +    RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE    =
>> +        (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
>> +    RSEQ_CS_FLAG_SLICE_EXT_ENABLED        =
>> +        (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
>> +};
>> +
>> +enum rseq_slice_bits {
>> +    /* Time slice extension ABI bits */
>> +    RSEQ_SLICE_EXT_REQUEST_BIT        = 0,
>> +    RSEQ_SLICE_EXT_GRANTED_BIT        = 1,
>> +};
>> +
>> +enum rseq_slice_masks {
>> +    RSEQ_SLICE_EXT_REQUEST    = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>> +    RSEQ_SLICE_EXT_GRANTED    = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>>   };
>>   /*
>> @@ -142,6 +164,12 @@ struct rseq {
>>       __u32 mm_cid;
>>       /*
>> +     * Time slice extension control word. CPU local atomic updates from
>> +     * kernel and user space.
>> +     */
>> +    __u32 slice_ctrl;
>> +
>> +    /*
>>        * Flexible array member at end of structure, after last feature 
>> field.
>>        */
>>       char end[];
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>>         If unsure, say N.
>> +config RSEQ_SLICE_EXTENSION
>> +    bool "Enable rseq based time slice extension mechanism"
>> +    depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && 
>> HAVE_GENERIC_TIF_BITS
>> +    help
>> +          Allows userspace to request a limited time slice extension 
>> when
>> +      returning from an interrupt to user space via the RSEQ shared
>> +      data ABI. If granted, that allows to complete a critical section,
>> +      so that other threads are not stuck on a conflicted resource,
>> +      while the task is scheduled out.
>> +
>> +      If unsure, say N.
>> +
>>   config DEBUG_RSEQ
>>       default n
>>       bool "Enable debugging of rseq() system call" if EXPERT
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
>>    */
>>   SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, 
>> int, flags, u32, sig)
>>   {
>> +    u32 rseqfl = 0;
>> +
>>       if (flags & RSEQ_FLAG_UNREGISTER) {
>>           if (flags & ~RSEQ_FLAG_UNREGISTER)
>>               return -EINVAL;
>> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>>       if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>>           return -EFAULT;
>> +    if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
>> +        rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> +
>> +    if (put_user_masked_u32(rseqfl, &rseq->flags))
>> +        return -EFAULT;
>> +
>>       /*
>>        * Activate the registration by setting the rseq area address, 
>> length
>>        * and signature in the task struct.
>>
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 05/12] rseq: Add prctl() to enable time slice extensions
  2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-09-11 15:50   ` Mathieu Desnoyers
  2025-09-11 16:52     ` K Prateek Nayak
  0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 15:50 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On 2025-09-08 18:59, Thomas Gleixner wrote:
> Implement a prctl() so that tasks can enable the time slice extension
> mechanism. This fails, when time slice extensions are disabled at compile
> time or on the kernel command line and when no rseq pointer is registered
> in the kernel.
> 
> That allows to implement a single trivial check in the exit to user mode
> hotpath, to decide whether the whole mechanism needs to be invoked.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>   include/linux/rseq.h       |    9 +++++++
>   include/uapi/linux/prctl.h |   10 ++++++++
>   kernel/rseq.c              |   52 +++++++++++++++++++++++++++++++++++++++++++++
>   kernel/sys.c               |    6 +++++
>   4 files changed, 77 insertions(+)
> 
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -190,4 +190,13 @@ void rseq_syscall(struct pt_regs *regs);
>   static inline void rseq_syscall(struct pt_regs *regs) { }
>   #endif /* !CONFIG_DEBUG_RSEQ */
>   
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
> +#else /* CONFIG_RSEQ_SLICE_EXTENSION */
> +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
> +{
> +	return -EINVAL;
> +}
> +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
> +
>   #endif /* _LINUX_RSEQ_H */
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -376,4 +376,14 @@ struct prctl_mm_map {
>   # define PR_FUTEX_HASH_SET_SLOTS	1
>   # define PR_FUTEX_HASH_GET_SLOTS	2
>   
> +/* RSEQ time slice extensions */
> +#define PR_RSEQ_SLICE_EXTENSION			79
> +# define PR_RSEQ_SLICE_EXTENSION_GET		1
> +# define PR_RSEQ_SLICE_EXTENSION_SET		2
> +/*
> + * Bits for RSEQ_SLICE_EXTENSION_GET/SET
> + * PR_RSEQ_SLICE_EXT_ENABLE:	Enable
> + */
> +# define PR_RSEQ_SLICE_EXT_ENABLE		0x01
> +
>   #endif /* _LINUX_PRCTL_H */
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -71,6 +71,7 @@
>   #define RSEQ_BUILD_SLOW_PATH
>   
>   #include <linux/debugfs.h>
> +#include <linux/prctl.h>
>   #include <linux/ratelimit.h>
>   #include <linux/rseq_entry.h>
>   #include <linux/sched.h>
> @@ -490,6 +491,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   #ifdef CONFIG_RSEQ_SLICE_EXTENSION
>   DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>   
> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
> +{
> +	switch (arg2) {
> +	case PR_RSEQ_SLICE_EXTENSION_GET:
> +		if (arg3)
> +			return -EINVAL;
> +		return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
> +
> +	case PR_RSEQ_SLICE_EXTENSION_SET: {
> +		u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> +		bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
> +
> +		if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
> +			return -EINVAL;
> +		if (!rseq_slice_extension_enabled())
> +			return -ENOTSUPP;
> +		if (!current->rseq.usrptr)
> +			return -ENXIO;
> +
> +		/* No change? */
> +		if (enable == !!current->rseq.slice.state.enabled)
> +			return 0;
> +
> +		if (get_user(rflags, &current->rseq.usrptr->flags))
> +			goto die;
> +
> +		if (current->rseq.slice.state.enabled)
> +			valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
> +
> +		if ((rflags & valid) != valid)
> +			goto die;
> +
> +		rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
> +		rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> +		if (enable)
> +			rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
> +
> +		if (put_user(rflags, &current->rseq.usrptr->flags))
> +			goto die;
> +
> +		current->rseq.slice.state.enabled = enable;

What should happen to this enabled state if rseq is unregistered
after this prctl ?

Thanks,

Mathieu

> +		return 0;
> +	}
> +	default:
> +		return -EINVAL;
> +	}
> +die:
> +	force_sig(SIGSEGV);
> +	return -EFAULT;
> +}
> +
>   static int __init rseq_slice_cmdline(char *str)
>   {
>   	bool on;
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -53,6 +53,7 @@
>   #include <linux/time_namespace.h>
>   #include <linux/binfmts.h>
>   #include <linux/futex.h>
> +#include <linux/rseq.h>
>   
>   #include <linux/sched.h>
>   #include <linux/sched/autogroup.h>
> @@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
>   	case PR_FUTEX_HASH:
>   		error = futex_hash_prctl(arg2, arg3, arg4);
>   		break;
> +	case PR_RSEQ_SLICE_EXTENSION:
> +		if (arg4 || arg5)
> +			return -EINVAL;
> +		error = rseq_slice_extension_prctl(arg2, arg3);
> +		break;
>   	default:
>   		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
>   		error = -EINVAL;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 05/12] rseq: Add prctl() to enable time slice extensions
  2025-09-11 15:50   ` Mathieu Desnoyers
@ 2025-09-11 16:52     ` K Prateek Nayak
  2025-09-11 17:18       ` Mathieu Desnoyers
  0 siblings, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-11 16:52 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Hello Mathieu,

On 9/11/2025 9:20 PM, Mathieu Desnoyers wrote:
>>   +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
>> +{
>> +    switch (arg2) {
>> +    case PR_RSEQ_SLICE_EXTENSION_GET:
>> +        if (arg3)
>> +            return -EINVAL;
>> +        return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
>> +
>> +    case PR_RSEQ_SLICE_EXTENSION_SET: {
>> +        u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> +        bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
>> +
>> +        if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
>> +            return -EINVAL;
>> +        if (!rseq_slice_extension_enabled())
>> +            return -ENOTSUPP;
>> +        if (!current->rseq.usrptr)
>> +            return -ENXIO;
>> +
>> +        /* No change? */
>> +        if (enable == !!current->rseq.slice.state.enabled)
>> +            return 0;
>> +
>> +        if (get_user(rflags, &current->rseq.usrptr->flags))
>> +            goto die;
>> +
>> +        if (current->rseq.slice.state.enabled)
>> +            valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>> +
>> +        if ((rflags & valid) != valid)
>> +            goto die;
>> +
>> +        rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>> +        rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> +        if (enable)
>> +            rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>> +
>> +        if (put_user(rflags, &current->rseq.usrptr->flags))
>> +            goto die;
>> +
>> +        current->rseq.slice.state.enabled = enable;
> 
> What should happen to this enabled state if rseq is unregistered
> after this prctl ?

Wouldn't rseq_reset() deal with it since it does a:

    memset(&t->rseq, 0, sizeof(t->rseq));

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 05/12] rseq: Add prctl() to enable time slice extensions
  2025-09-11 16:52     ` K Prateek Nayak
@ 2025-09-11 17:18       ` Mathieu Desnoyers
  0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-11 17:18 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

On 2025-09-11 12:52, K Prateek Nayak wrote:
> Hello Mathieu,
> 
> On 9/11/2025 9:20 PM, Mathieu Desnoyers wrote:
>>>    +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
>>> +{
>>> +    switch (arg2) {
>>> +    case PR_RSEQ_SLICE_EXTENSION_GET:
>>> +        if (arg3)
>>> +            return -EINVAL;
>>> +        return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
>>> +
>>> +    case PR_RSEQ_SLICE_EXTENSION_SET: {
>>> +        u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>>> +        bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
>>> +
>>> +        if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
>>> +            return -EINVAL;
>>> +        if (!rseq_slice_extension_enabled())
>>> +            return -ENOTSUPP;
>>> +        if (!current->rseq.usrptr)
>>> +            return -ENXIO;
>>> +
>>> +        /* No change? */
>>> +        if (enable == !!current->rseq.slice.state.enabled)
>>> +            return 0;
>>> +
>>> +        if (get_user(rflags, &current->rseq.usrptr->flags))
>>> +            goto die;
>>> +
>>> +        if (current->rseq.slice.state.enabled)
>>> +            valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>>> +
>>> +        if ((rflags & valid) != valid)
>>> +            goto die;
>>> +
>>> +        rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>>> +        rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>>> +        if (enable)
>>> +            rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
>>> +
>>> +        if (put_user(rflags, &current->rseq.usrptr->flags))
>>> +            goto die;
>>> +
>>> +        current->rseq.slice.state.enabled = enable;
>>
>> What should happen to this enabled state if rseq is unregistered
>> after this prctl ?
> 
> Wouldn't rseq_reset() deal with it since it does a:
> 
>      memset(&t->rseq, 0, sizeof(t->rseq));
> 

Good point, thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-11 15:27 ` Mathieu Desnoyers
@ 2025-09-11 20:18   ` Thomas Gleixner
  2025-09-12 12:33     ` Mathieu Desnoyers
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-11 20:18 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On Thu, Sep 11 2025 at 11:27, Mathieu Desnoyers wrote:
> On 2025-09-08 18:59, Thomas Gleixner wrote:
>> If it is interrupted and the interrupt return path in the kernel observes a
>> rescheduling request, then the kernel can grant a time slice extension. The
>> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
>> non-atomic store operation. If it does not grant the extension only the
>> REQUEST bit is cleared.
>> 
>> If user space observes the REQUEST bit cleared, when it finished the
>> critical section, then it has to check the GRANTED bit. If that is set,
>> then it has to invoke the rseq_slice_yield() syscall to terminate the
>
> Does it "have" to ? What is the consequence of misbehaving ?

It receives SIGSEGV because that means that it did not follow the rules
and stuck an arbitrary syscall into the critical section.

> I wonder if we could achieve this without the cpu-local atomic, and
> just rely on simple relaxed-atomic or volatile loads/stores and compiler
> barriers in userspace. Let's say we have:
>
> union {
> 	u16 slice_ctrl;
> 	struct {
> 		u8 rseq->slice_request;
> 		u8 rseq->slice_grant;

Interesting way to define a struct member :)

> 	};
> };
>
> With userspace doing:
>
> rseq->slice_request = true;  /* WRITE_ONCE() */
> barrier();
> critical_section();
> barrier();
> rseq->slice_request = false; /* WRITE_ONCE() */
> if (rseq->slice_grant)       /* READ_ONCE() */
>    rseq_slice_yield();

That should work as it's strictly CPU local. Good point, now that you
said it it's obvious :)

Let me rework it accordingly.

> In the kernel interrupt return path, if the kernel observes
> "rseq->slice_request" set and "rseq->slice_grant" cleared,
> it grants the extension and sets "rseq->slice_grant".

They can't be both set. If they are then user space fiddled with the
bits.

>>      - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
>>        model which is utilized by PREEMPT_RT.
>
> Can you clarify why this attempt is "futile" ?

Because on RT interrupts usually end up with TIF_PREEMPT set either due
to softirqs or interrupt threads. And no, we don't want to
overcomplicate things right now to make it "work" for real-time tasks in
the first place as that's just going to result either endless
discussions or subtle latency problems or both. For now allowing it for
the 'LAZY' case is good enough.

With the non-RT LAZY model that's not really a good idea either, because
when TIF_PREEMPT is set, then either the preempting task is in a RT
class or the to be preempted task already has overrun the LAZY granted
computation time and the scheduler sets TIF_PREEMPT to whack it over the
head.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-11 20:18   ` Thomas Gleixner
@ 2025-09-12 12:33     ` Mathieu Desnoyers
  2025-09-12 16:31       ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-12 12:33 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On 2025-09-11 16:18, Thomas Gleixner wrote:
> On Thu, Sep 11 2025 at 11:27, Mathieu Desnoyers wrote:
>> On 2025-09-08 18:59, Thomas Gleixner wrote:
[...]
>> Does it "have" to ? What is the consequence of misbehaving ?
> 
> It receives SIGSEGV because that means that it did not follow the rules
> and stuck an arbitrary syscall into the critical section.

Not following the rules could also be done by just looping for a long
time in userspace within or after the critical section, in which case
the timer should catch it.

> 
>> I wonder if we could achieve this without the cpu-local atomic, and
>> just rely on simple relaxed-atomic or volatile loads/stores and compiler
>> barriers in userspace. Let's say we have:
>>
>> union {
>> 	u16 slice_ctrl;
>> 	struct {
>> 		u8 rseq->slice_request;
>> 		u8 rseq->slice_grant;
> 
> Interesting way to define a struct member :)

This goes with the usual warning "this code has never even been
remotely close to a compiler, so handle with care" ;-)

> 
>> 	};
>> };
>>
>> With userspace doing:
>>
>> rseq->slice_request = true;  /* WRITE_ONCE() */
>> barrier();
>> critical_section();
>> barrier();
>> rseq->slice_request = false; /* WRITE_ONCE() */
>> if (rseq->slice_grant)       /* READ_ONCE() */
>>     rseq_slice_yield();
> 
> That should work as it's strictly CPU local. Good point, now that you
> said it it's obvious :)
> 
> Let me rework it accordingly.

I have two questions wrt ABI here:

1) Do we expect the slice requests to be done from C and higher level
    languages or only from assembly ?

2) Slice requests are a good fit for locking. Locking typically
    has nesting ability.

    We should consider making the slice request ABI a 8-bit
    or 16-bit nesting counter to allow nesting of its users.

3) Slice requests are also a good fit for rseq critical sections.
    Of course someone could explicitly increment/decrement the
    slice request counter before/after the rseq critical sections, but
    I think we could do better there and integrate this directly within
    the struct rseq_cs as a new critical section flag. Basically, a
    critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
    better name) set within its descriptor flags would behave as if
    the slice request counter is non-zero when preempted without
    requiring any extra instruction on the fast path. The only
    added overhead would be a check of the rseq->slice_grant flag
    when exiting the critical section to conditionally issue
    rseq_slice_yield().

    This point (3) is an optimization that could come as a future step
    if the overhead of incrementing the slice_request proves to be a
    bottleneck for rseq critical sections.

> 
>> In the kernel interrupt return path, if the kernel observes
>> "rseq->slice_request" set and "rseq->slice_grant" cleared,
>> it grants the extension and sets "rseq->slice_grant".
> 
> They can't be both set. If they are then user space fiddled with the
> bits.

Ah, yes, that's true if the kernel clears the slice_request when setting
the slice_grant.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-12 12:33     ` Mathieu Desnoyers
@ 2025-09-12 16:31       ` Thomas Gleixner
  2025-09-12 19:26         ` Mathieu Desnoyers
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-12 16:31 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch

On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
> On 2025-09-11 16:18, Thomas Gleixner wrote:
>> It receives SIGSEGV because that means that it did not follow the rules
>> and stuck an arbitrary syscall into the critical section.
>
> Not following the rules could also be done by just looping for a long
> time in userspace within or after the critical section, in which case
> the timer should catch it.

It's pretty much impossible to tell for the kernel without more
overhead, whether that's actually a violation of the rules or not.

The operation after the grant can be interrupted (without resulting in
scheduling), which is out of control of the task which got the extension
granted.

The timer is there to ensure that there is an upper bound to the grant
independent of the actual reason.

Going through a different syscall is an obvious deviation from the rule.

As far I understood the earlier discussions, scheduler folks want to
enforce that because of PREEMPT_NONE semantics, where a randomly chosen
syscall might not result in an immediate reschedule because the work,
which needs to be done takes arbitrary time to complete.

Though that's arguably not much different from

       syscall()
                -> tick -> NEED_RESCHED
        do_tons_of_work();
       exit_to_user()
          schedule();

except that in the slice extension case, the latency increases by the
slice extension time.

If we allow arbitrary syscalls to terminate the grant, then we need to
stick an immediate schedule() into the syscall entry work function. We'd
still need the separate yield() syscall to provide a side effect free
way of termination.

I have no strong opinions either way. Peter?

>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>> barrier();
>>> critical_section();
>>> barrier();
>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>     rseq_slice_yield();
>> 
>> That should work as it's strictly CPU local. Good point, now that you
>> said it it's obvious :)
>> 
>> Let me rework it accordingly.
>
> I have two questions wrt ABI here:
>
> 1) Do we expect the slice requests to be done from C and higher level
>     languages or only from assembly ?

It doesn't matter as long as the ordering is guaranteed.

> 2) Slice requests are a good fit for locking. Locking typically
>     has nesting ability.
>
>     We should consider making the slice request ABI a 8-bit
>     or 16-bit nesting counter to allow nesting of its users.

Making request a counter requires to keep request set when the
extension is granted. So the states would be:

     request    granted
     0          0               Neutral
     >0         0               Requested
     >=0        1               Granted

That should work.

Though I'm not really convinced that unconditionally embeddeding it into
random locking primitives is the right thing to do.

The extension makes only sense, when the actual critical section is
small and likely to complete within the extension time, which is usually
only true for highly optimized code and not for general usage, where the
lock held section is arbitrary long and might even result in syscalls
even if the critical section itself does not have an obvious explicit
syscall embedded:

     lock(a)
        lock(b) <- Contention results in syscall

Same applies for library functions within a critical section.

That then immediately conflicts with the yield mechanism rules, because
the extension could have been granted _before_ the syscall happens, so
we'd have remove that restriction too.

That said, we can make the ABI a counter and split the slice control
word into two u16. So the decision function would be:

     get_usr(ctrl);
     if (!ctrl.request)
     	return;
     ....
     ctrl.granted = 1;
     put_usr(ctrl);

Along with documentation why this should only be used nested when you
know what you are doing.

> 3) Slice requests are also a good fit for rseq critical sections.
>     Of course someone could explicitly increment/decrement the
>     slice request counter before/after the rseq critical sections, but
>     I think we could do better there and integrate this directly within
>     the struct rseq_cs as a new critical section flag. Basically, a
>     critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>     better name) set within its descriptor flags would behave as if
>     the slice request counter is non-zero when preempted without
>     requiring any extra instruction on the fast path. The only
>     added overhead would be a check of the rseq->slice_grant flag
>     when exiting the critical section to conditionally issue
>     rseq_slice_yield().

Plus checking first whether rseq->slice.request is actually zero,
i.e. whether the rseq critical section was the outermost one. If not,
you cannot invoke the yield even if granted is true, right?

But mixing state spaces is not really a good idea at all. Let's not go
there.

Also you'd make checking of rseq_cs unconditional, which means extra
work in the grant decision function as it would then have to do:

         if (!usr->slice.ctrl.request) {
            if (!usr->rseq_cs)
               return;
            if (!valid_ptr(usr->rseq_cs))
               goto die;
            if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
               return;
            if (!(use->rseq_cs.flags & REQUEST))
               return;
         }

IOW, we'd copy half of the rseq cs handling into that code.

Can we please keep it independent and simple?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-12 16:31       ` Thomas Gleixner
@ 2025-09-12 19:26         ` Mathieu Desnoyers
  2025-09-13 13:02           ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-12 19:26 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch, Florian Weimer, carlos@redhat.com, libc-coord

[ For those just CC'd on this thread, the discussion is about time slice
   extension for userspace critical sections. We are specifically
   discussing the kernel ABI we plan to expose to userspace. ]

On 2025-09-12 12:31, Thomas Gleixner wrote:
> On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
>> On 2025-09-11 16:18, Thomas Gleixner wrote:
>>> It receives SIGSEGV because that means that it did not follow the rules
>>> and stuck an arbitrary syscall into the critical section.
>>
>> Not following the rules could also be done by just looping for a long
>> time in userspace within or after the critical section, in which case
>> the timer should catch it.
> 
> It's pretty much impossible to tell for the kernel without more
> overhead, whether that's actually a violation of the rules or not.
> 
> The operation after the grant can be interrupted (without resulting in
> scheduling), which is out of control of the task which got the extension
> granted.
> 
> The timer is there to ensure that there is an upper bound to the grant
> independent of the actual reason.

If the worse side-effect of this feature is that the slice extension
is not granted when users misbehave, IMHO this would increase the
likelihood of adoption compared to failure modes that end up killing the
offending processes.

> 
> Going through a different syscall is an obvious deviation from the rule.

AFAIU, the grant is cleared when a signal handler is delivered, which
makes it OK for signals to issue system calls even if they are nested
on top of a granted extension critical section.

> 
> As far I understood the earlier discussions, scheduler folks want to
> enforce that because of PREEMPT_NONE semantics, where a randomly chosen
> syscall might not result in an immediate reschedule because the work,
> which needs to be done takes arbitrary time to complete.
> 
> Though that's arguably not much different from
> 
>         syscall()
>                  -> tick -> NEED_RESCHED
>          do_tons_of_work();
>         exit_to_user()
>            schedule();
> 
> except that in the slice extension case, the latency increases by the
> slice extension time.
> 
> If we allow arbitrary syscalls to terminate the grant, then we need to
> stick an immediate schedule() into the syscall entry work function. We'd
> still need the separate yield() syscall to provide a side effect free
> way of termination.
> 
> I have no strong opinions either way. Peter?

If it happens to not be too bothersome to allow arbitrary system calls
to act as implicit rseq_slice_yield() rather than result in a
segmentation fault, I think it would make this feature more widely
adopted.

Another scenario I have in mind is a userspace critical section that
would typically benefit from slice extension, but seldomly requires
to issue a system call. In C and higher level languages, that could be
very much outside of the user control, such as accessing a
global-dynamic TLS variable located within a global-dynamic shared
object, which can trigger memory allocation under the hood on first
access.

Handling syscall within granted extension by killing the process
will likely reserve this feature to the niche use-cases.

> 
>>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>>> barrier();
>>>> critical_section();
>>>> barrier();
>>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>>      rseq_slice_yield();
>>>
>>> That should work as it's strictly CPU local. Good point, now that you
>>> said it it's obvious :)
>>>
>>> Let me rework it accordingly.
>>
>> I have two questions wrt ABI here:
>>
>> 1) Do we expect the slice requests to be done from C and higher level
>>      languages or only from assembly ?
> 
> It doesn't matter as long as the ordering is guaranteed.

OK, so I understand that you intend to target higher level languages
as well, which makes my second point (nesting) relevant.

> 
>> 2) Slice requests are a good fit for locking. Locking typically
>>      has nesting ability.
>>
>>      We should consider making the slice request ABI a 8-bit
>>      or 16-bit nesting counter to allow nesting of its users.
> 
> Making request a counter requires to keep request set when the
> extension is granted. So the states would be:
> 
>       request    granted
>       0          0               Neutral
>       >0         0               Requested
>       >=0        1               Granted

Yes.

> 
> That should work.
> 
> Though I'm not really convinced that unconditionally embeddeding it into
> random locking primitives is the right thing to do.

Me neither. I wonder what would be a good approach to integrate this
with locking APIs. Here are a few ideas, some worse than others:

- Extend pthread_mutexattr_t to set whether the mutex should be
   slice-extended. Downside: if a mutex has some long and some
   short critical sections, it's really a one-size fits all decision
   for all critical sections for that mutex.

- Extend the pthread_mutex_lock/trylock with new APIs to allow
   specifying whether slice-extension is needed for the upcoming critical
   section.

- Just let the pthread_mutex_lock caller explicitly request the
   slice extension *after* grabbing the lock. Downside: this opens
   a window of a few instructions where preemption can happen
   and slice extension would have been useful. Should we care ?

> 
> The extension makes only sense, when the actual critical section is
> small and likely to complete within the extension time, which is usually
> only true for highly optimized code and not for general usage, where the
> lock held section is arbitrary long and might even result in syscalls
> even if the critical section itself does not have an obvious explicit
> syscall embedded:
> 
>       lock(a)
>          lock(b) <- Contention results in syscall

Nested locking is another scenario where _typically_ we'd want the
slice extension for the outer lock if it is expected to be a short
critical section, and sometimes hit futex while the extension is granted
and clear the grant if this happens without killing the process.

> 
> Same applies for library functions within a critical section.

Yes.

> 
> That then immediately conflicts with the yield mechanism rules, because
> the extension could have been granted _before_ the syscall happens, so
> we'd have remove that restriction too.

Yes.

> 
> That said, we can make the ABI a counter and split the slice control
> word into two u16. So the decision function would be:
> 
>       get_usr(ctrl);
>       if (!ctrl.request)
>       	return;
>       ....
>       ctrl.granted = 1;
>       put_usr(ctrl);
> 
> Along with documentation why this should only be used nested when you
> know what you are doing.

Yes.

This would turn the end of critical section into a
decrement-and-test-for-zero. It's only when the request counter
decrements back to zero that userspace should handle the granted
flag and yield.

> 
>> 3) Slice requests are also a good fit for rseq critical sections.
>>      Of course someone could explicitly increment/decrement the
>>      slice request counter before/after the rseq critical sections, but
>>      I think we could do better there and integrate this directly within
>>      the struct rseq_cs as a new critical section flag. Basically, a
>>      critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>>      better name) set within its descriptor flags would behave as if
>>      the slice request counter is non-zero when preempted without
>>      requiring any extra instruction on the fast path. The only
>>      added overhead would be a check of the rseq->slice_grant flag
>>      when exiting the critical section to conditionally issue
>>      rseq_slice_yield().
> 
> Plus checking first whether rseq->slice.request is actually zero,
> i.e. whether the rseq critical section was the outermost one. If not,
> you cannot invoke the yield even if granted is true, right?

Right.

> 
> But mixing state spaces is not really a good idea at all. Let's not go
> there.

I agree, let's keep this (3) for later if there is a strong use-case
justifying the complexity.

What is important for right now though is to figure out the behavior
with respect to an ongoing rseq critical section when a time slice
extension is granted: is the rseq critical section aborted or does
it keep going on return to userspace ?

> 
> Also you'd make checking of rseq_cs unconditional, which means extra
> work in the grant decision function as it would then have to do:
> 
>           if (!usr->slice.ctrl.request) {
>              if (!usr->rseq_cs)
>                 return;
>              if (!valid_ptr(usr->rseq_cs))
>                 goto die;
>              if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
>                 return;
>              if (!(use->rseq_cs.flags & REQUEST))
>                 return;
>           }
> 
> IOW, we'd copy half of the rseq cs handling into that code.
> 
> Can we please keep it independent and simple?

Of course.

So in summary, here is my current understanding:

- It would be good to support nested slice-extension requests,

- It would be preferable to allow arbitrary system calls to
   cancel an ongoing slice-extension grant rather than kill the
   process if we want the slice-extension feature to be useful
   outside of niche use-cases.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
>          tglx


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-12 19:26         ` Mathieu Desnoyers
@ 2025-09-13 13:02           ` Thomas Gleixner
  2025-09-19 17:30             ` Prakash Sangappa
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2025-09-13 13:02 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Peter Zilstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch, Florian Weimer, carlos@redhat.com, libc-coord

On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>> 2) Slice requests are a good fit for locking. Locking typically
>>>      has nesting ability.
>>>
>>>      We should consider making the slice request ABI a 8-bit
>>>      or 16-bit nesting counter to allow nesting of its users.
>> 
>> Making request a counter requires to keep request set when the
>> extension is granted. So the states would be:
>> 
>>       request    granted
>>       0          0               Neutral
>>       >0         0               Requested
>>       >=0        1               Granted
>

Second thoughts on this.

Such a scheme means that slice_ctrl.request must be read only for the
kernel because otherwise the user space decrement would need to be an
atomic dec_if_not_zero(). We just argued the one atomic operation away. :)

That means, the kernel can only set and clear Granted. That in turn
loses the information whether a slice extension was denied or revoked,
which was something the Oracle people wanted to have. I'm not sure
whether that was a functional or more a instrumentation feature.

But what's worse: this is a receipe for disaster as it creates obviously
subtle and hard to debug ways to leak an increment, which means the
request would stay active forever defeating the whole purpose.

And no, the kernel cannot keep track of the counter and observe whether
it became zero at some point or not. You surely could come up with a
convoluted scheme to work around that in form of sequence counters or
whatever, but that just creates extra complexity for a very dubious
value.

The point is that the time slice extension is just providing an
opportunistic priority ceiling mechanism with low overhead and without
guarantees.

Once a request is not granted or revoked, the performance of that
particular operation goes south no matter what. Nesting does not help
there at all, which is a strong argument for using KISS as the primary
engineering principle here.

The simple boolean request/granted pair is simple and very well
defined. It does not suffer from any of those problems.

If user space wants nesting, then it can do so on its own without
creating an ill defined and fragile kernel/user ABI. We created enough
of them in the past and all of them resulted in long term headaches.

> Handling syscall within granted extension by killing the process

I'm absolutely not opposed to lift the syscall restriction to make
things easier, but this is the wrong argument for it:

> will likely reserve this feature to the niche use-cases.

Having this used only by people who actually know what they are doing is
actually the preferred outcome.

We've seen it over and over that supposedly "easy" features result in
mindless overutilization because everyone and his dog thinks they need
them just because and for the very wrong reasons. The unconditional
usage of the most power hungry floating point extensions just because
they are available, is only one example of many.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-13 13:02           ` Thomas Gleixner
@ 2025-09-19 17:30             ` Prakash Sangappa
  2025-09-22 14:09               ` Mathieu Desnoyers
  0 siblings, 1 reply; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-19 17:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, LKML, Peter Zilstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch@vger.kernel.org, Florian Weimer,
	carlos@redhat.com, libc-coord@lists.openwall.com



> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>     has nesting ability.
>>>> 
>>>>     We should consider making the slice request ABI a 8-bit
>>>>     or 16-bit nesting counter to allow nesting of its users.
>>> 
>>> Making request a counter requires to keep request set when the
>>> extension is granted. So the states would be:
>>> 
>>>      request    granted
>>>      0          0               Neutral
>>>> 0         0               Requested
>>>> =0        1               Granted
>> 
> 
> Second thoughts on this.
> 
> Such a scheme means that slice_ctrl.request must be read only for the
> kernel because otherwise the user space decrement would need to be an
> atomic dec_if_not_zero(). We just argued the one atomic operation away. :)
> 
> That means, the kernel can only set and clear Granted. That in turn
> loses the information whether a slice extension was denied or revoked,
> which was something the Oracle people wanted to have. I'm not sure
> whether that was a functional or more a instrumentation feature.

The denied indication was mainly instrumentation for observability to see
if a user application would attempt to set ‘REQUEST' again without yielding. 

> 
> But what's worse: this is a receipe for disaster as it creates obviously
> subtle and hard to debug ways to leak an increment, which means the
> request would stay active forever defeating the whole purpose.
> 
> And no, the kernel cannot keep track of the counter and observe whether
> it became zero at some point or not. You surely could come up with a
> convoluted scheme to work around that in form of sequence counters or
> whatever, but that just creates extra complexity for a very dubious
> value.
> 
> The point is that the time slice extension is just providing an
> opportunistic priority ceiling mechanism with low overhead and without
> guarantees.
> 
> Once a request is not granted or revoked, the performance of that
> particular operation goes south no matter what. Nesting does not help
> there at all, which is a strong argument for using KISS as the primary
> engineering principle here.
> 
> The simple boolean request/granted pair is simple and very well
> defined. It does not suffer from any of those problems.

Agree, I think keeping the API simple will be preferable. The request/granted
sequence makes sense. 


> 
> If user space wants nesting, then it can do so on its own without
> creating an ill defined and fragile kernel/user ABI. We created enough
> of them in the past and all of them resulted in long term headaches.

Guess user space should be able to handle nesting, possibly without the need of a counter?

AFAICS can’t the nested request, to extend the slice, be handled by checking 
if both ‘REQUEST’ & ‘GRANTED’ bits are zero?  If so,  attempt to request
slice extension.  Otherwise If either REQUEST or GRANTED bit Is set, then a slice
extension has been already requested or granted. 

> 
>> Handling syscall within granted extension by killing the process
> 
> I'm absolutely not opposed to lift the syscall restriction to make
> things easier, but this is the wrong argument for it:

Killing the process seems drastic, and could deter use of this feature.
Can the consequence of calling the system be handled by calling schedule()
in syscall entry path if extension was granted, as you were implying?

Thanks
-Prakash

> 
>> will likely reserve this feature to the niche use-cases.
> 
> Having this used only by people who actually know what they are doing is
> actually the preferred outcome.
> 
> We've seen it over and over that supposedly "easy" features result in
> mindless overutilization because everyone and his dog thinks they need
> them just because and for the very wrong reasons. The unconditional
> usage of the most power hungry floating point extensions just because
> they are available, is only one example of many.
> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
  2025-09-09  0:04   ` Randy Dunlap
  2025-09-11 15:41   ` Mathieu Desnoyers
@ 2025-09-22  5:28   ` Prakash Sangappa
  2025-09-22  5:57     ` K Prateek Nayak
  2025-09-22 13:55     ` Mathieu Desnoyers
  2 siblings, 2 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-22  5:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch@vger.kernel.org



> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
..
> +enum rseq_slice_masks {
> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
> };
> 
> /*
> @@ -142,6 +164,12 @@ struct rseq {
> __u32 mm_cid;
> 
> /*
> + * Time slice extension control word. CPU local atomic updates from
> + * kernel and user space.
> + */
> + __u32 slice_ctrl;

We intend to backport the slice extension feature to older kernel versions.  

With use of a new structure member for slice control, could there be discrepancy 
with rseq structure size(older version) registered by libc?  In that case the application 
may  not be able to use slice extension feature unless Libc’s use of rseq is disabled.

Application would have to verify structure size, so should it be mentioned  in the 
documentation. Also, perhaps make the prctl() enable call return error, if structure size 
does not match?

With regards to application determining the address and size of rseq structure 
registered by libc, what are you thoughts on getting that thru the rseq(2) 
system call or a prctl() call instead of dealing with the __week symbols as was discussed here.

https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/

Thanks,
-Prakash

> +
> + /*
> * Flexible array member at end of structure, after last feature field.
> */
> char end[];


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-22  5:28   ` Prakash Sangappa
@ 2025-09-22  5:57     ` K Prateek Nayak
  2025-09-22 13:57       ` Mathieu Desnoyers
  2025-09-22 13:55     ` Mathieu Desnoyers
  1 sibling, 1 reply; 54+ messages in thread
From: K Prateek Nayak @ 2025-09-22  5:57 UTC (permalink / raw)
  To: Prakash Sangappa, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch@vger.kernel.org

Hello Prakash,

On 9/22/2025 10:58 AM, Prakash Sangappa wrote:
> With use of a new structure member for slice control, could there be discrepancy 
> with rseq structure size(older version) registered by libc?  In that case the application 
> may  not be able to use slice extension feature unless Libc’s use of rseq is disabled.

In this case, wouldn't GLIBC's rseq registration fail if presumed
__rseq_size is smaller than the "struct rseq" size?

And if it has allocated a large enough area, then the prctl() should
help to query the slice extension feature's availability.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-22  5:28   ` Prakash Sangappa
  2025-09-22  5:57     ` K Prateek Nayak
@ 2025-09-22 13:55     ` Mathieu Desnoyers
  2025-09-23  0:57       ` Prakash Sangappa
  1 sibling, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 13:55 UTC (permalink / raw)
  To: Prakash Sangappa, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch@vger.kernel.org, Michael Jeanson

On 2025-09-22 01:28, Prakash Sangappa wrote:
> 
> 
>> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
> ..
>> +enum rseq_slice_masks {
>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>> };
>>
>> /*
>> @@ -142,6 +164,12 @@ struct rseq {
>> __u32 mm_cid;
>>
>> /*
>> + * Time slice extension control word. CPU local atomic updates from
>> + * kernel and user space.
>> + */
>> + __u32 slice_ctrl;
> 
> We intend to backport the slice extension feature to older kernel versions.
> 
> With use of a new structure member for slice control, could there be discrepancy
> with rseq structure size(older version) registered by libc?  In that case the application
> may  not be able to use slice extension feature unless Libc’s use of rseq is disabled.

The rseq extension scheme allows this to seamlessly work.

You will need a glibc 2.41+, which uses the getauxval(3)
AT_RSEQ_FEATURE_SIZE and AT_RSEQ_ALIGN to query the feature size
supported by the Linux kernel. It allocates a per-thread memory
area which is large enough to support that feature set, and
registers it to the kernel through rseq(2) on thread creation.

Note that before we had the extensible rseq scheme, glibc registered
a 32-byte structure (including padding at the end), which is considered
as the rseq "original" registration size.

The "mm_cid" field ends at 28 bytes, which leaves 4 bytes of padding at
the end of the original rseq structure. Considering that the time slice
extension fields will likely fit within those 4 bytes, I expect that
applications linked against glibc [2.35, 2.40] will also be able to use
those fields. Those applications should use getauxval(3)
AT_RSEQ_FEATURE_SIZE to validate whether the kernel populates this field
or if it's just padding.

Note that this all works even if you backport the feature to an older kernel:
the rseq extension scheme does not depend on querying the kernel version at
all. You will however be required to backport the support for additional
rseq fields that come before the time slice, such as node_id and mm_cid,
if they are not implemented in your older kernel.

> 
> Application would have to verify structure size, so should it be mentioned  in the
> documentation.

Yes, applications should check that the glibc's __rseq_size is large enough to fit
the new slice field(s), *and* for the original rseq size special case
(32 bytes including padding), those would need to query getauxval(3)
AT_RSEQ_FEATURE_SIZE to make sure the field is indeed supported.

  Also, perhaps make the prctl() enable call return error, if structure size
> does not match?

That's not how the extensible scheme works.

Either glibc registers a 32-byte area (in which the time slice feature would
fit), or it registers an area large enough to fit all kernel supported features,
or it fails registration. And prctl() is per-process, whereas the rseq registration
is per-thread, so it's kind of weird to make prctl() fail if the current
thread's rseq is not registered.

> 
> With regards to application determining the address and size of rseq structure
> registered by libc, what are you thoughts on getting that thru the rseq(2)
> system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
> 
> https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/

I think that the other leg of that email thread got to a resolution of both static and
dynamic use-cases through use of an extern __weak symbol, no [1] ? Not that I am against
adding a rseq(2) query for rseq address, size, and signature, but I just want to double
check that it would be there for convenience and is not actually needed in the typical
use-cases.

Thanks,

Mathieu

[1] https://lore.kernel.org/all/aKPFIQwg5zxSS5oS@google.com/

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-22  5:57     ` K Prateek Nayak
@ 2025-09-22 13:57       ` Mathieu Desnoyers
  0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 13:57 UTC (permalink / raw)
  To: K Prateek Nayak, Prakash Sangappa, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Madadi Vineeth Reddy, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch@vger.kernel.org

On 2025-09-22 01:57, K Prateek Nayak wrote:
> Hello Prakash,
> 
> On 9/22/2025 10:58 AM, Prakash Sangappa wrote:
>> With use of a new structure member for slice control, could there be discrepancy
>> with rseq structure size(older version) registered by libc?  In that case the application
>> may  not be able to use slice extension feature unless Libc’s use of rseq is disabled.
> 
> In this case, wouldn't GLIBC's rseq registration fail if presumed
> __rseq_size is smaller than the "struct rseq" size?

The registered rseq size cannot be smaller than 32 bytes, else
registration is refused by the system call (-EINVAL).

The new slice extension fields would fit within those 32 bytes,
so it should always work.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-19 17:30             ` Prakash Sangappa
@ 2025-09-22 14:09               ` Mathieu Desnoyers
  2025-09-23  1:01                 ` Prakash Sangappa
  0 siblings, 1 reply; 54+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 14:09 UTC (permalink / raw)
  To: Prakash Sangappa, Thomas Gleixner
  Cc: LKML, Peter Zilstra, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
	linux-arch@vger.kernel.org, Florian Weimer, carlos@redhat.com,
	libc-coord@lists.openwall.com

On 2025-09-19 13:30, Prakash Sangappa wrote:
> 
> 
>> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>>      has nesting ability.
>>>>>
>>>>>      We should consider making the slice request ABI a 8-bit
>>>>>      or 16-bit nesting counter to allow nesting of its users.
>>>>
>>>> Making request a counter requires to keep request set when the
>>>> extension is granted. So the states would be:
>>>>
>>>>       request    granted
>>>>       0          0               Neutral
>>>>> 0         0               Requested
>>>>> =0        1               Granted
>>>
>>
>> Second thoughts on this.
>>
[...]
> 
>>
>> If user space wants nesting, then it can do so on its own without
>> creating an ill defined and fragile kernel/user ABI. We created enough
>> of them in the past and all of them resulted in long term headaches.
> 
> Guess user space should be able to handle nesting, possibly without the need of a counter?
> 
> AFAICS can’t the nested request, to extend the slice, be handled by checking
> if both ‘REQUEST’ & ‘GRANTED’ bits are zero?  If so,  attempt to request
> slice extension.  Otherwise If either REQUEST or GRANTED bit Is set, then a slice
> extension has been already requested or granted.

I think you are onto something here. If we want independent pieces of
software (e.g. libc and application) to allow nesting of time slice
extension requests, without having to deal with a counter and the
inevitable unbalance bugs (leak and underflow), we could require
userspace to check the value of the request and granted flags. If both
are zero, then it can set the request.

Then when userspace exits its critical section, it needs to remember
whether it has set a request or not, so it does not clear a request
too early if the request was set by an outer context. This requires
handing over additional state (one bit) from "lock" to "unlock" though.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 02/12] rseq: Add fields and constants for time slice extension
  2025-09-22 13:55     ` Mathieu Desnoyers
@ 2025-09-23  0:57       ` Prakash Sangappa
  0 siblings, 0 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-23  0:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch@vger.kernel.org, Michael Jeanson



> On Sep 22, 2025, at 6:55 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-09-22 01:28, Prakash Sangappa wrote:
>>> On Sep 8, 2025, at 3:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>> ..
>>> +enum rseq_slice_masks {
>>> + RSEQ_SLICE_EXT_REQUEST = (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
>>> + RSEQ_SLICE_EXT_GRANTED = (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>>> };
>>> 
>>> /*
>>> @@ -142,6 +164,12 @@ struct rseq {
>>> __u32 mm_cid;
>>> 
>>> /*
>>> + * Time slice extension control word. CPU local atomic updates from
>>> + * kernel and user space.
>>> + */
>>> + __u32 slice_ctrl;
>> We intend to backport the slice extension feature to older kernel versions.
>> With use of a new structure member for slice control, could there be discrepancy
>> with rseq structure size(older version) registered by libc?  In that case the application
>> may  not be able to use slice extension feature unless Libc’s use of rseq is disabled.
> 
> The rseq extension scheme allows this to seamlessly work.
> 
> You will need a glibc 2.41+, which uses the getauxval(3)
> AT_RSEQ_FEATURE_SIZE and AT_RSEQ_ALIGN to query the feature size
> supported by the Linux kernel. It allocates a per-thread memory
> area which is large enough to support that feature set, and
> registers it to the kernel through rseq(2) on thread creation.

Ok, 

> 
> Note that before we had the extensible rseq scheme, glibc registered
> a 32-byte structure (including padding at the end), which is considered
> as the rseq "original" registration size.
> 
> The "mm_cid" field ends at 28 bytes, which leaves 4 bytes of padding at
> the end of the original rseq structure. Considering that the time slice
> extension fields will likely fit within those 4 bytes, I expect that
> applications linked against glibc [2.35, 2.40] will also be able to use
> those fields. Those applications should use getauxval(3)
> AT_RSEQ_FEATURE_SIZE to validate whether the kernel populates this field
> or if it's just padding.

The question was about the size of rseq structure registered by glibc. If it is using
AT_RSEQ_FEATURE_SIZE to allocate the per-thread area for rseq, I suppose that
should be fine.  However application would have to verify that __rseq_size size is large 
enough.

As for the Kernel supporting slice extension, I expect the prctl(..,PR_RSEQ_SLICE_EXT_ENABLE)
would return an error if it is not supported, won’t that be sufficient or should it check
AT_RSEQ_FEATURE_SIZE?


> 
> Note that this all works even if you backport the feature to an older kernel:
> the rseq extension scheme does not depend on querying the kernel version at
> all. You will however be required to backport the support for additional
> rseq fields that come before the time slice, such as node_id and mm_cid,
> if they are not implemented in your older kernel.

Yes, need to look at those changes that needs to be backported. Also, the dependent 
'rseq: Optimize exit to user space’ changes from other patch series. 

> 
>> Application would have to verify structure size, so should it be mentioned  in the
>> documentation.
> 
> Yes, applications should check that the glibc's __rseq_size is large enough to fit
> the new slice field(s), *and* for the original rseq size special case
> (32 bytes including padding), those would need to query getauxval(3)
> AT_RSEQ_FEATURE_SIZE to make sure the field is indeed supported.
> 
> Also, perhaps make the prctl() enable call return error, if structure size
>> does not match?
> 
> That's not how the extensible scheme works.
> 
> Either glibc registers a 32-byte area (in which the time slice feature would
> fit), or it registers an area large enough to fit all kernel supported features,
> or it fails registration. And prctl() is per-process, whereas the rseq registration
> is per-thread, so it's kind of weird to make prctl() fail if the current
> thread's rseq is not registered.

I meant the prctl(.., PR_RSEQ_SLICE_EXT_ENABLE) call is per thread and
sets the enabled bit in per thread rseq. This could fail if  rseq struct size is not large enough?

> 
>> With regards to application determining the address and size of rseq structure
>> registered by libc, what are you thoughts on getting that thru the rseq(2)
>> system call or a prctl() call instead of dealing with the __week symbols as was discussed here.
>> https://lore.kernel.org/all/F9DBABAD-ABF0-49AA-9A38-BD4D2BE78B94@oracle.com/
> 
> I think that the other leg of that email thread got to a resolution of both static and
> dynamic use-cases through use of an extern __weak symbol, no [1] ? Not that I am against
> adding a rseq(2) query for rseq address, size, and signature, but I just want to double
> check that it would be there for convenience and is not actually needed in the typical
> use-cases.

Yes, mainly for convenience. 

Thanks,
-Prakash

> 
> Thanks,
> 
> Mathieu
> 
> [1] https://lore.kernel.org/all/aKPFIQwg5zxSS5oS@google.com/
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 00/12] rseq: Implement time slice extension mechanism
  2025-09-22 14:09               ` Mathieu Desnoyers
@ 2025-09-23  1:01                 ` Prakash Sangappa
  0 siblings, 0 replies; 54+ messages in thread
From: Prakash Sangappa @ 2025-09-23  1:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Peter Zilstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch@vger.kernel.org, Florian Weimer,
	carlos@redhat.com, libc-coord@lists.openwall.com



> On Sep 22, 2025, at 7:09 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-09-19 13:30, Prakash Sangappa wrote:
>>> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>>>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>>>     has nesting ability.
>>>>>> 
>>>>>>     We should consider making the slice request ABI a 8-bit
>>>>>>     or 16-bit nesting counter to allow nesting of its users.
>>>>> 
>>>>> Making request a counter requires to keep request set when the
>>>>> extension is granted. So the states would be:
>>>>> 
>>>>>      request    granted
>>>>>      0          0               Neutral
>>>>>> 0         0               Requested
>>>>>> =0        1               Granted
>>>> 
>>> 
>>> Second thoughts on this.
>>> 
> [...]
>>> 
>>> If user space wants nesting, then it can do so on its own without
>>> creating an ill defined and fragile kernel/user ABI. We created enough
>>> of them in the past and all of them resulted in long term headaches.
>> Guess user space should be able to handle nesting, possibly without the need of a counter?
>> AFAICS can’t the nested request, to extend the slice, be handled by checking
>> if both ‘REQUEST’ & ‘GRANTED’ bits are zero?  If so,  attempt to request
>> slice extension.  Otherwise If either REQUEST or GRANTED bit Is set, then a slice
>> extension has been already requested or granted.
> 
> I think you are onto something here. If we want independent pieces of
> software (e.g. libc and application) to allow nesting of time slice
> extension requests, without having to deal with a counter and the
> inevitable unbalance bugs (leak and underflow), we could require
> userspace to check the value of the request and granted flags. If both
> are zero, then it can set the request.
> 
> Then when userspace exits its critical section, it needs to remember
> whether it has set a request or not, so it does not clear a request
> too early if the request was set by an outer context. This requires
> handing over additional state (one bit) from "lock" to "unlock" though.

Yes that is correct. Additional state will be required to track if slice extension
was requested in that context. 

-Prakash

> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2025-09-23  1:02 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-09-08 22:59 ` [patch 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-09-09  0:04   ` Randy Dunlap
2025-09-11 15:41   ` Mathieu Desnoyers
2025-09-11 15:49     ` Mathieu Desnoyers
2025-09-22  5:28   ` Prakash Sangappa
2025-09-22  5:57     ` K Prateek Nayak
2025-09-22 13:57       ` Mathieu Desnoyers
2025-09-22 13:55     ` Mathieu Desnoyers
2025-09-23  0:57       ` Prakash Sangappa
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-09-09  3:10   ` K Prateek Nayak
2025-09-09  4:11     ` Randy Dunlap
2025-09-09 12:12       ` Thomas Gleixner
2025-09-09 16:01         ` Randy Dunlap
2025-09-11 15:42   ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
2025-09-11 15:43   ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
2025-09-11 15:50   ` Mathieu Desnoyers
2025-09-11 16:52     ` K Prateek Nayak
2025-09-11 17:18       ` Mathieu Desnoyers
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-09-09  9:52   ` K Prateek Nayak
2025-09-09 12:23     ` Thomas Gleixner
2025-09-10 11:15   ` K Prateek Nayak
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-09-10  5:22   ` K Prateek Nayak
2025-09-10  7:49     ` Thomas Gleixner
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-09-10 11:20   ` K Prateek Nayak
2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-09-09  8:14   ` K Prateek Nayak
2025-09-09 12:16     ` Thomas Gleixner
2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-09-10 11:23   ` K Prateek Nayak
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-10  4:42   ` K Prateek Nayak
2025-09-10 11:28 ` K Prateek Nayak
2025-09-10 14:50   ` Thomas Gleixner
2025-09-11  3:03     ` K Prateek Nayak
2025-09-11  7:36       ` Prakash Sangappa
2025-09-11 15:27 ` Mathieu Desnoyers
2025-09-11 20:18   ` Thomas Gleixner
2025-09-12 12:33     ` Mathieu Desnoyers
2025-09-12 16:31       ` Thomas Gleixner
2025-09-12 19:26         ` Mathieu Desnoyers
2025-09-13 13:02           ` Thomas Gleixner
2025-09-19 17:30             ` Prakash Sangappa
2025-09-22 14:09               ` Mathieu Desnoyers
2025-09-23  1:01                 ` Prakash Sangappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).