All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance
@ 2015-09-11 18:37 Waiman Long
  2015-09-11 18:37 ` [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

v5->v6:
 - Added a new patch 1 to relax the cmpxchg and xchg operations in
   the native code path to reduce performance overhead on non-x86
   architectures.
 - Updated the unconditional PV kick patch as suggested by PeterZ.
 - Added a new patch to allow one lock stealing attempt at slowpath
   entry point to reduce performance penalty due to lock waiter
   preemption.
 - Removed the pending bit and kick-ahead patches as they didn't show
   any noticeable performance improvement on top of the lock stealing
   patch.
 - Simplified the adaptive spinning patch as the lock stealing patch
   allows more aggressive pv_wait() without much performance penalty
   in non-overcommitted VMs.

v4->v5:
 - Rebased the patch to the latest tip tree.
 - Corrected the comments and commit log for patch 1.
 - Removed the v4 patch 5 as PV kick deferment is no longer needed with
   the new tip tree.
 - Simplified the adaptive spinning patch (patch 6) & improve its
   performance a bit further.
 - Re-ran the benchmark test with the new patch.

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.

Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.

Patch 2 simplifies the unlock code by removing the unnecessary
state check.

Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.

Patch 3 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 4 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.

Patch 5 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.

Patch 6 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.

Waiman Long (6):
  locking/qspinlock: relaxes cmpxchg & xchg ops in native code
  locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock: Allow 1 lock stealing attempt
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    7 +
 arch/x86/include/asm/qspinlock.h          |    2 +-
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++++
 include/asm-generic/qspinlock.h           |    6 +-
 kernel/locking/qspinlock.c                |   45 +++--
 kernel/locking/qspinlock_paravirt.h       |  378 +++++++++++++++++++++++++----
 6 files changed, 431 insertions(+), 66 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
  2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
@ 2015-09-11 18:37 ` Waiman Long
  2015-09-11 22:27   ` Davidlohr Bueso
  2015-09-11 18:37 ` [PATCH v6 2/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with more relaxed versions of those calls to enable
other architectures to adopt queued spinlocks with less performance
overhead.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/include/asm/qspinlock.h |    2 +-
 include/asm-generic/qspinlock.h  |    6 +++---
 kernel/locking/qspinlock.c       |   21 +++++++++++++++++----
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 9d51fae..053e70d 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -46,7 +46,7 @@ static inline bool virt_queued_spin_lock(struct qspinlock *lock)
 	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
 		return false;
 
-	while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
+	while (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) != 0)
 		cpu_relax();
 
 	return true;
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 83bfb87..efbd1fd 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -62,7 +62,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
 	if (!atomic_read(&lock->val) &&
-	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+	   (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
 		return 1;
 	return 0;
 }
@@ -77,7 +77,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
 	u32 val;
 
-	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+	val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
 	if (likely(val == 0))
 		return;
 	queued_spin_lock_slowpath(lock, val);
@@ -93,7 +93,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * smp_mb__before_atomic() in order to guarantee release semantics
 	 */
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_sub(_Q_LOCKED_VAL, &lock->val);
 }
 #endif
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 337c881..28a15c7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -176,7 +176,12 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	struct __qspinlock *l = (void *)lock;
 
-	return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+	/*
+	 * Use release semantics to make sure that the MCS node is properly
+	 * initialized before changing the tail code.
+	 */
+	return (u32)xchg_release(&l->tail,
+				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
 #else /* _Q_PENDING_BITS == 8 */
@@ -208,7 +213,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
-		old = atomic_cmpxchg(&lock->val, val, new);
+		/*
+		 * Use release semantics to make sure that the MCS node is
+		 * properly initialized before changing the tail code.
+		 */
+		old = atomic_cmpxchg_release(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -319,7 +328,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		if (val == new)
 			new |= _Q_PENDING_VAL;
 
-		old = atomic_cmpxchg(&lock->val, val, new);
+		old = atomic_cmpxchg_acquire(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -426,7 +435,11 @@ queue:
 			set_locked(lock);
 			break;
 		}
-		old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL);
+		/*
+		 * The smp_load_acquire() call above has provided the necessary
+		 * acquire semantics required for locking.
+		 */
+		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
 			goto release;	/* No contention */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 2/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
  2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
  2015-09-11 18:37 ` [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
@ 2015-09-11 18:37 ` Waiman Long
  2015-09-18  8:50   ` [tip:locking/core] locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL tip-bot for Waiman Long
  2015-09-11 18:37 ` [PATCH v6 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
The extra check at the end of __pv_queued_spin_unlock() is unnecessary
and so is removed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
---
 kernel/locking/qspinlock_paravirt.h |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index c8e6e9a..f0450ff 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -267,7 +267,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 		}
 
 		if (!lp) { /* ONCE */
-			WRITE_ONCE(pn->state, vcpu_hashed);
 			lp = pv_hash(lock, pn);
 
 			/*
@@ -275,11 +274,9 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
 			 * we'll be sure to be able to observe our hash entry.
 			 *
-			 *   [S] pn->state
 			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
 			 *       MB                           RMB
 			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
-			 *                                [L] pn->state
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
@@ -364,8 +361,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 * vCPU is harmless other than the additional latency in completing
 	 * the unlock.
 	 */
-	if (READ_ONCE(node->state) == vcpu_hashed)
-		pv_kick(node->cpu);
+	pv_kick(node->cpu);
 }
 /*
  * Include the architecture specific callee-save thunk of the
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path
  2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
  2015-09-11 18:37 ` [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
  2015-09-11 18:37 ` [PATCH v6 2/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
@ 2015-09-11 18:37 ` Waiman Long
  2015-09-11 18:37 ` [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:
 1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..
 2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

  Threads	Before patch	After patch	% Change
  -------	------------	-----------	--------
     1		   134.1 ms	  119.3 ms	  -11%
     2		   1286  ms	   953  ms	  -26%
     3		   3715  ms	  3480  ms	  -6.3%
     4		   4092  ms	  3764  ms	  -8.0%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++++++++++++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h       |   43 +++++++++++++--------
 2 files changed, 86 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..3001972 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,65 @@
 #ifndef __ASM_QSPINLOCK_PARAVIRT_H
 #define __ASM_QSPINLOCK_PARAVIRT_H
 
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock	__pv_queued_spin_unlock
+#define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ *	struct __qspinlock *l = (void *)lock;
+ *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ *	if (likely(lockval == _Q_LOCKED_VAL))
+ *		return;
+ *	pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ *   rdi = lock 	(first argument)
+ *   rsi = lockval	(second argument)
+ *   rdx = internal variable (set to 0)
+ */
+asm(".pushsection .text;"
+    ".globl " PV_UNLOCK ";"
+    ".align 4,0x90;"
+    PV_UNLOCK ": "
+    "push  %rdx;"
+    "mov   $0x1,%eax;"
+    "xor   %edx,%edx;"
+    "lock cmpxchg %dl,(%rdi);"
+    "cmp   $0x1,%al;"
+    "jne   .slowpath;"
+    "pop   %rdx;"
+    "ret;"
+    ".slowpath: "
+    "push   %rsi;"
+    "movzbl %al,%esi;"
+    "call " PV_UNLOCK_SLOWPATH ";"
+    "pop    %rsi;"
+    "pop    %rdx;"
+    "ret;"
+    ".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+    ".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
 PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
 
+#endif /* CONFIG_64BIT */
 #endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f0450ff..4bd323d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -308,23 +308,14 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
  */
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
 	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
-	u8 locked;
-
-	/*
-	 * We must not unlock if SLOW, because in that case we must first
-	 * unhash. Otherwise it would be possible to have multiple @lock
-	 * entries, which would be BAD.
-	 */
-	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
-	if (likely(locked == _Q_LOCKED_VAL))
-		return;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
 		WARN(!debug_locks_silent,
@@ -363,12 +354,32 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 */
 	pv_kick(node->cpu);
 }
+
 /*
  * Include the architecture specific callee-save thunk of the
  * __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
  */
 #include <asm/qspinlock_paravirt.h>
 
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+	u8 locked;
+
+	/*
+	 * We must not unlock if SLOW, because in that case we must first
+	 * unhash. Otherwise it would be possible to have multiple @lock
+	 * entries, which would be BAD.
+	 */
+	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+	if (likely(locked == _Q_LOCKED_VAL))
+		return;
+
+	__pv_queued_spin_unlock_slowpath(lock, locked);
+}
+#endif /* __pv_queued_spin_unlock */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
                   ` (2 preceding siblings ...)
  2015-09-11 18:37 ` [PATCH v6 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
@ 2015-09-11 18:37 ` Waiman Long
  2015-09-11 23:13   ` Davidlohr Bueso
  2015-09-11 18:37 ` [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
  2015-09-11 18:37 ` [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
  5 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of kicking and
wakeup latencies which have a heavy dependency on the CPUs being used.

The measured latencies for different CPUs are:

	CPU		Wakeup		Kicking
	---		------		-------
	Haswell-EX	63.6us		 7.4us
	Westmere-EX	67.6us		 9.3us

The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.

A sample of statistics counts after system bootup (with vCPU
overcommit) was:

hash_hops_count=9001
kick_latencies=138047878
kick_unlock_count=9001
kick_wait_count=9000
spurious_wakeup=3
wait_again_count=2
wait_head_count=10
wait_node_count=8994
wake_latencies=713195944

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/Kconfig                    |    7 ++
 kernel/locking/qspinlock_paravirt.h |  171 ++++++++++++++++++++++++++++++++++-
 2 files changed, 173 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f37010f..d08828f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -719,6 +719,13 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config QUEUED_LOCK_STAT
+	bool "Paravirt queued lock statistics"
+	depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+	---help---
+	  Enable the collection of statistical data on the behavior of
+	  paravirtualized queued spinlocks and report them on debugfs.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 4bd323d..2d71768 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,147 @@ struct pv_node {
 };
 
 /*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+	pvstat_wait_head,
+	pvstat_wait_node,
+	pvstat_wait_again,
+	pvstat_kick_wait,
+	pvstat_kick_unlock,
+	pvstat_spurious,
+	pvstat_hops,
+	pvstat_num	/* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+	[pvstat_wait_head]   = "wait_head_count",
+	[pvstat_wait_node]   = "wait_node_count",
+	[pvstat_wait_again]  = "wait_again_count",
+	[pvstat_kick_wait]   = "kick_wait_count",
+	[pvstat_kick_unlock] = "kick_unlock_count",
+	[pvstat_spurious]    = "spurious_wakeup",
+	[pvstat_hops]	     = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency   = pv_kick_latencies/kick_unlock_count
+ * Avg wake latency   = pv_wake_latencies/kick_wait_count
+ * Avg # of hops/hash = hash_hops_count/kick_unlock_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+	struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+	int i;
+
+	if (!d_pvqlock)
+		pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+	for (i = 0; i < pvstat_num; i++)
+		debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+				  (u32 *)&pvstats[i]);
+	debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+			   (u64 *)&pv_kick_latencies);
+	debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+			   (u64 *)&pv_wake_latencies);
+	debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+	return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+	int i;
+
+	for (i = 0; i < pvstat_num; i++)
+		atomic_set(&pvstats[i], 0);
+	atomic64_set(&pv_kick_latencies, 0);
+	atomic64_set(&pv_wake_latencies, 0);
+	reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+	atomic_inc(&pvstats[stat]);
+	if (unlikely(reset_cnts))
+		pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+	atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+	u64 start = sched_clock();
+
+	*per_cpu_ptr(&pv_kick_time, cpu) = start;
+	pv_kick(cpu);
+	atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+	*pkick_time = 0;
+	pv_wait(ptr, val);
+	if (*pkick_time) {
+		atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+		pvstat_inc(pvstat_kick_wait);
+	}
+}
+
+#define pv_kick(c)	__pv_kick(c)
+#define pv_wait(p, v)	__pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat)	{ }
+static inline void pvstat_hop(int hopcnt)		{ }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
  * Lock and MCS node addresses hash table for fast lookup
  *
  * Hashing is done on a per-cacheline basis to minimize the need to access
@@ -100,10 +241,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
 {
 	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
 	struct pv_hash_entry *he;
+	int hopcnt = 0;
 
 	for_each_hash_entry(he, offset, hash) {
+		hopcnt++;
 		if (!cmpxchg(&he->lock, NULL, lock)) {
 			WRITE_ONCE(he->node, node);
+			pvstat_hop(hopcnt);
 			return &he->lock;
 		}
 	}
@@ -164,9 +308,10 @@ static void pv_init_node(struct mcs_spinlock *node)
 static void pv_wait_node(struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	int waitcnt = 0;
 	int loop;
 
-	for (;;) {
+	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
@@ -184,15 +329,22 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		 */
 		smp_store_mb(pn->state, vcpu_halted);
 
-		if (!READ_ONCE(node->locked))
+		if (!READ_ONCE(node->locked)) {
+			pvstat_inc(pvstat_wait_node);
+			if (waitcnt)
+				pvstat_inc(pvstat_wait_again);
 			pv_wait(&pn->state, vcpu_halted);
+		}
 
 		/*
-		 * If pv_kick_node() changed us to vcpu_hashed, retain that value
-		 * so that pv_wait_head() knows to not also try to hash this lock.
+		 * If pv_kick_node() changed us to vcpu_hashed, retain that
+		 * value so that pv_wait_head() knows to not also try to hash
+		 * this lock.
 		 */
 		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
 
+		if (READ_ONCE(node->locked))
+			break;
 		/*
 		 * If the locked flag is still not set after wakeup, it is a
 		 * spurious wakeup and the vCPU should wait again. However,
@@ -200,6 +352,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		 * So it is better to spin for a while in the hope that the
 		 * MCS lock will be released soon.
 		 */
+		pvstat_inc(pvstat_spurious);
 	}
 
 	/*
@@ -250,6 +403,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
+	int waitcnt = 0;
 	int loop;
 
 	/*
@@ -259,7 +413,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 	if (READ_ONCE(pn->state) == vcpu_hashed)
 		lp = (struct qspinlock **)1;
 
-	for (;;) {
+	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (!READ_ONCE(l->locked))
 				return;
@@ -290,14 +444,20 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 				return;
 			}
 		}
+		pvstat_inc(pvstat_wait_head);
+		if (waitcnt)
+			pvstat_inc(pvstat_wait_again);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
+		if (!READ_ONCE(l->locked))
+			return;
 		/*
 		 * The unlocker should have freed the lock before kicking the
 		 * CPU. So if the lock is still not free, it is a spurious
 		 * wakeup and so the vCPU should wait again after spinning for
 		 * a while.
 		 */
+		pvstat_inc(pvstat_spurious);
 	}
 
 	/*
@@ -352,6 +512,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * vCPU is harmless other than the additional latency in completing
 	 * the unlock.
 	 */
+	pvstat_inc(pvstat_kick_unlock);
 	pv_kick(node->cpu);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
                   ` (3 preceding siblings ...)
  2015-09-11 18:37 ` [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-09-11 18:37 ` Waiman Long
  2015-09-14 13:57   ` Peter Zijlstra
                     ` (2 more replies)
  2015-09-11 18:37 ` [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
  5 siblings, 3 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath.  This helps to reduce the performance
penalty caused by lock waiter preemption while not having much of
the downsides of a real unfair lock.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

                    Westmere                    Haswell
  Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
  -----         --------    --------    --------    --------
  Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
  After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s

For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c          |   19 +++---
 kernel/locking/qspinlock_paravirt.h |  116 ++++++++++++++++++++++++++++-------
 2 files changed, 102 insertions(+), 33 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 28a15c7..1be1aab 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -248,17 +248,15 @@ static __always_inline void set_locked(struct qspinlock *lock)
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct qspinlock *lock,
-					   struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_head(struct qspinlock *lock,
-					   struct mcs_spinlock *node) { }
-
+static __always_inline bool __pv_wait_head_and_lock(struct qspinlock *lock,
+						    struct mcs_spinlock *node,
+						    u32 tail)
+						    { return false; }
 #define pv_enabled()		false
 
 #define pv_init_node		__pv_init_node
 #define pv_wait_node		__pv_wait_node
-#define pv_kick_node		__pv_kick_node
-#define pv_wait_head		__pv_wait_head
+#define pv_wait_head_and_lock	__pv_wait_head_and_lock
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
@@ -416,7 +414,8 @@ queue:
 	 * does not imply a full barrier.
 	 *
 	 */
-	pv_wait_head(lock, node);
+	if (pv_wait_head_and_lock(lock, node, tail))
+		goto release;
 	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
 		cpu_relax();
 
@@ -453,7 +452,6 @@ queue:
 		cpu_relax();
 
 	arch_mcs_spin_unlock_contended(&next->locked);
-	pv_kick_node(lock, next);
 
 release:
 	/*
@@ -474,8 +472,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 #undef pv_init_node
 #undef pv_wait_node
-#undef pv_kick_node
-#undef pv_wait_head
+#undef pv_wait_head_and_lock
 
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2d71768..9fd49a2 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,30 @@ struct pv_node {
 };
 
 /*
+ * Allow one unfair trylock when entering the PV slowpath to reduce the
+ * performance impact of lock waiter preemption (either explicitly via
+ * pv_wait or implicitly via PLE).
+ *
+ * A little bit of unfairness here can improve performance without many
+ * of the downsides of a real unfair lock.
+ */
+#define queued_spin_trylock(l)	pv_queued_spin_trylock_unfair(l)
+static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	if (READ_ONCE(l->locked))
+		return 0;
+	/*
+	 * Wait a bit here to ensure that an actively spinning vCPU has a fair
+	 * chance of getting the lock.
+	 */
+	cpu_relax();
+
+	return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
+}
+
+/*
  * PV qspinlock statistics
  */
 enum pv_qlock_stat {
@@ -51,6 +75,7 @@ enum pv_qlock_stat {
 	pvstat_kick_unlock,
 	pvstat_spurious,
 	pvstat_hops,
+	pvstat_utrylock,
 	pvstat_num	/* Total number of statistics counts */
 };
 
@@ -69,6 +94,7 @@ static const char * const stat_fsnames[pvstat_num] = {
 	[pvstat_kick_unlock] = "kick_unlock_count",
 	[pvstat_spurious]    = "spurious_wakeup",
 	[pvstat_hops]	     = "hash_hops_count",
+	[pvstat_utrylock]    = "utrylock_count",
 };
 
 static atomic_t pvstats[pvstat_num];
@@ -145,6 +171,20 @@ static inline void pvstat_hop(int hopcnt)
 }
 
 /*
+ * PV unfair trylock count
+ */
+static inline int pvstat_trylock_unfair(struct qspinlock *lock)
+{
+	int ret = pv_queued_spin_trylock_unfair(lock);
+
+	if (ret)
+		pvstat_inc(pvstat_utrylock);
+	return ret;
+}
+#undef  queued_spin_trylock
+#define queued_spin_trylock(l)	pvstat_trylock_unfair(l)
+
+/*
  * Replacement function for pv_kick()
  */
 static inline void __pv_kick(int cpu)
@@ -338,8 +378,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
 
 		/*
 		 * If pv_kick_node() changed us to vcpu_hashed, retain that
-		 * value so that pv_wait_head() knows to not also try to hash
-		 * this lock.
+		 * value so that pv_wait_head_and_lock() knows to not also
+		 * try to hash this lock.
 		 */
 		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
 
@@ -365,8 +405,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
 /*
  * Called after setting next->locked = 1 when we're the lock owner.
  *
- * Instead of waking the waiters stuck in pv_wait_node() advance their state such
- * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_and_lock(), this avoids a
+ * wake/sleep cycle.
  */
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
@@ -398,13 +439,15 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
  * Wait for l->locked to become clear; halt the vcpu after a short spin.
  * __pv_queued_spin_unlock() will wake us.
  */
-static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
+static int pv_wait_head_and_lock(struct qspinlock *lock,
+				 struct mcs_spinlock *node, u32 tail)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	struct mcs_spinlock *next;
 	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
-	int loop;
+	int loop, old;
 
 	/*
 	 * If pv_kick_node() already advanced our state, we don't need to
@@ -415,8 +458,12 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 
 	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
-			if (!READ_ONCE(l->locked))
-				return;
+			/*
+			 * Try to acquire the lock when it is free.
+			 */
+			if (!READ_ONCE(l->locked) &&
+			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
+				goto gotlock;
 			cpu_relax();
 		}
 
@@ -434,14 +481,15 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (!cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
+			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
 				/*
-				 * The lock is free and _Q_SLOW_VAL has never
-				 * been set. Therefore we need to unhash before
-				 * getting the lock.
+				 * The lock was free and now we own the lock.
+				 * Change the lock value back to _Q_LOCKED_VAL
+				 * and unhash the table.
 				 */
+				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
-				return;
+				goto gotlock;
 			}
 		}
 		pvstat_inc(pvstat_wait_head);
@@ -449,22 +497,46 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 			pvstat_inc(pvstat_wait_again);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
-		if (!READ_ONCE(l->locked))
-			return;
 		/*
 		 * The unlocker should have freed the lock before kicking the
 		 * CPU. So if the lock is still not free, it is a spurious
-		 * wakeup and so the vCPU should wait again after spinning for
-		 * a while.
+		 * wakeup or another vCPU has stolen the lock. The current
+		 * vCPU should spin again.
 		 */
-		pvstat_inc(pvstat_spurious);
+		if (READ_ONCE(l->locked))
+			pvstat_inc(pvstat_spurious);
 	}
 
+gotlock:
 	/*
-	 * Lock is unlocked now; the caller will acquire it without waiting.
-	 * As with pv_wait_node() we rely on the caller to do a load-acquire
-	 * for us.
+	 * We now have the lock. We need to either clear the tail code or
+	 * notify the next one in queue as the new queue head.
 	 */
+	old = atomic_read(&lock->val);
+	while ((old & _Q_TAIL_MASK) == tail) {
+		int val;
+		int new = old & ~_Q_TAIL_MASK;
+
+		/*
+		 * We are the only one in the queue, so clear the tail code
+		 * and return.
+		 */
+		val = atomic_cmpxchg(&lock->val, old, new);
+		if (old == val)
+			goto done;
+		old = val;
+	}
+
+	/*
+	 * contended path; wait for next, release.
+	 */
+	while (!(next = READ_ONCE(node->next)))
+		cpu_relax();
+
+	arch_mcs_spin_unlock_contended(&next->locked);
+	pv_kick_node(lock, next);
+done:
+	return 1;
 }
 
 /*
@@ -489,7 +561,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * so we need a barrier to order the read of the node data in
 	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
 	 *
-	 * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
+	 * Matches the cmpxchg() in pv_wait_head_and_lock() setting _Q_SLOW_VAL.
 	 */
 	smp_rmb();
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
                   ` (4 preceding siblings ...)
  2015-09-11 18:37 ` [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-09-11 18:37 ` Waiman Long
  2015-09-14 14:10   ` Peter Zijlstra
  5 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-11 18:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the following conditions are true:

 1) the vCPU has not been halted before;
 2) the previous vCPU is not running.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

		    Westmere			Haswell
  Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
  -----		--------    --------    --------    --------
  Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
  After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s

For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c          |    5 ++-
 kernel/locking/qspinlock_paravirt.h |   52 +++++++++++++++++++++++++++++++++-
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 1be1aab..319e823 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -247,7 +247,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+					   struct mcs_spinlock *prev) { }
 static __always_inline bool __pv_wait_head_and_lock(struct qspinlock *lock,
 						    struct mcs_spinlock *node,
 						    u32 tail)
@@ -398,7 +399,7 @@ queue:
 		prev = decode_tail(old);
 		WRITE_ONCE(prev->next, node);
 
-		pv_wait_node(node);
+		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
 	}
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 9fd49a2..57ed38b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,6 +23,22 @@
 #define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
 
 /*
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will stop spinning if the following conditions are true:
+ * 1) vCPU in the previous node is not running
+ * 2) current vCPU has not been halted before
+ *
+ * The one lock stealing attempt allowed at slowpath entry mitigates the
+ * slight slowdown for non-overcommitted guest with this aggressive wait-early
+ * mechanism.
+ *
+ * The status of the previous node will be checked at fixed interval
+ * controlled by PV_PREV_CHECK_MASK.
+ */
+#define PV_PREV_CHECK_MASK	0xff
+
+/*
  * Queue node uses: vcpu_running & vcpu_halted.
  * Queue head uses: vcpu_running & vcpu_hashed.
  */
@@ -71,6 +87,7 @@ enum pv_qlock_stat {
 	pvstat_wait_head,
 	pvstat_wait_node,
 	pvstat_wait_again,
+	pvstat_wait_early,
 	pvstat_kick_wait,
 	pvstat_kick_unlock,
 	pvstat_spurious,
@@ -90,6 +107,7 @@ static const char * const stat_fsnames[pvstat_num] = {
 	[pvstat_wait_head]   = "wait_head_count",
 	[pvstat_wait_node]   = "wait_node_count",
 	[pvstat_wait_again]  = "wait_again_count",
+	[pvstat_wait_early]  = "wait_early_count",
 	[pvstat_kick_wait]   = "kick_wait_count",
 	[pvstat_kick_unlock] = "kick_unlock_count",
 	[pvstat_spurious]    = "spurious_wakeup",
@@ -328,6 +346,20 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 }
 
 /*
+ * Return true if when it is time to check the previous node which is not
+ * in a running state.
+ */
+static inline bool
+pv_wait_early(struct pv_node *node, struct pv_node *prev, int loop)
+{
+
+	if ((loop & PV_PREV_CHECK_MASK) != 0)
+		return false;
+
+	return READ_ONCE(prev->state) != vcpu_running;
+}
+
+/*
  * Initialize the PV part of the mcs_spinlock node.
  */
 static void pv_init_node(struct mcs_spinlock *node)
@@ -345,16 +377,25 @@ static void pv_init_node(struct mcs_spinlock *node)
  * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
  * behalf.
  */
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	struct pv_node *pp = (struct pv_node *)prev;
 	int waitcnt = 0;
 	int loop;
+	bool wait_early;
 
 	for (;; waitcnt++) {
-		for (loop = SPIN_THRESHOLD; loop; loop--) {
+		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
+			/*
+			 * Wait early only if it has not been halted before.
+			 */
+			if (!waitcnt && pv_wait_early(pn, pp, loop)) {
+				wait_early = true;
+				break;
+			}
 			cpu_relax();
 		}
 
@@ -457,6 +498,12 @@ static int pv_wait_head_and_lock(struct qspinlock *lock,
 		lp = (struct qspinlock **)1;
 
 	for (;; waitcnt++) {
+		/*
+		 * Set correct vCPU state to be used by queue node wait-early
+		 * mechanism.
+		 */
+		WRITE_ONCE(pn->state, vcpu_running);
+
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			/*
 			 * Try to acquire the lock when it is free.
@@ -492,6 +539,7 @@ static int pv_wait_head_and_lock(struct qspinlock *lock,
 				goto gotlock;
 			}
 		}
+		WRITE_ONCE(pn->state, vcpu_halted);
 		pvstat_inc(pvstat_wait_head);
 		if (waitcnt)
 			pvstat_inc(pvstat_wait_again);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
  2015-09-11 18:37 ` [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
@ 2015-09-11 22:27   ` Davidlohr Bueso
  2015-09-14 12:06     ` Peter Zijlstra
  2015-09-14 15:16     ` Waiman Long
  0 siblings, 2 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2015-09-11 22:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On Fri, 11 Sep 2015, Waiman Long wrote:

>@@ -46,7 +46,7 @@ static inline bool virt_queued_spin_lock(struct qspinlock *lock)
> 	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> 		return false;
>
>-	while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
>+	while (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) != 0)
> 		cpu_relax();

This code has changed with Peter's recent ccas fix. And the whole virt_queued_spin_lock()
thing will now be under pv configs. So this doesn't apply to native code anymore, so it
looks like it should be dropped altogether.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-09-11 18:37 ` [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-09-11 23:13   ` Davidlohr Bueso
  2015-09-14 15:25     ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2015-09-11 23:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On Fri, 11 Sep 2015, Waiman Long wrote:

>A sample of statistics counts after system bootup (with vCPU
>overcommit) was:
>
>hash_hops_count=9001
>kick_latencies=138047878
>kick_unlock_count=9001
>kick_wait_count=9000
>spurious_wakeup=3
>wait_again_count=2
>wait_head_count=10
>wait_node_count=8994
>wake_latencies=713195944

Any reason you chose not to make the stats per-cpu? The locking
numbers don't have to be exact, so you can easily get away with
it and suffer from much less overhead that resorting to atomics.
Obviously assuming that reading/collecting the stats is done
infrequently, such as between workloads or at bootup as you did.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
  2015-09-11 22:27   ` Davidlohr Bueso
@ 2015-09-14 12:06     ` Peter Zijlstra
  2015-09-14 18:40       ` Waiman Long
  2015-09-14 15:16     ` Waiman Long
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-14 12:06 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On Fri, Sep 11, 2015 at 03:27:44PM -0700, Davidlohr Bueso wrote:
> On Fri, 11 Sep 2015, Waiman Long wrote:
> 
> >@@ -46,7 +46,7 @@ static inline bool virt_queued_spin_lock(struct qspinlock *lock)
> >	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> >		return false;
> >
> >-	while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
> >+	while (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) != 0)
> >		cpu_relax();
> 
> This code has changed with Peter's recent ccas fix. And the whole virt_queued_spin_lock()
> thing will now be under pv configs. So this doesn't apply to native code anymore, so it
> looks like it should be dropped altogether.

Yeah, it also doesn't make sense, this ix x86 arch code, x86 cannot do
cmpxchg_acquire. Then again, I suppose we could argue its of
documentation value..

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-11 18:37 ` [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-09-14 13:57   ` Peter Zijlstra
  2015-09-14 19:02     ` Waiman Long
  2015-09-14 14:00   ` Peter Zijlstra
  2015-09-14 14:04   ` Peter Zijlstra
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-14 13:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
> +#define queued_spin_trylock(l)	pv_queued_spin_trylock_unfair(l)
> +static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
> +{
> +	struct __qspinlock *l = (void *)lock;
> +
> +	if (READ_ONCE(l->locked))
> +		return 0;
> +	/*
> +	 * Wait a bit here to ensure that an actively spinning vCPU has a fair
> +	 * chance of getting the lock.
> +	 */
> +	cpu_relax();
> +
> +	return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
> +}

> +static inline int pvstat_trylock_unfair(struct qspinlock *lock)
> +{
> +	int ret = pv_queued_spin_trylock_unfair(lock);
> +
> +	if (ret)
> +		pvstat_inc(pvstat_utrylock);
> +	return ret;
> +}
> +#undef  queued_spin_trylock
> +#define queued_spin_trylock(l)	pvstat_trylock_unfair(l)

These aren't actually ever used...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-11 18:37 ` [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
  2015-09-14 13:57   ` Peter Zijlstra
@ 2015-09-14 14:00   ` Peter Zijlstra
  2015-09-14 19:15     ` Waiman Long
  2015-09-14 14:04   ` Peter Zijlstra
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-14 14:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
> This patch allows one attempt for the lock waiter to steal the lock
> when entering the PV slowpath.  This helps to reduce the performance
> penalty caused by lock waiter preemption while not having much of
> the downsides of a real unfair lock.

> @@ -415,8 +458,12 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>  
>  	for (;; waitcnt++) {
>  		for (loop = SPIN_THRESHOLD; loop; loop--) {
> -			if (!READ_ONCE(l->locked))
> -				return;
> +			/*
> +			 * Try to acquire the lock when it is free.
> +			 */
> +			if (!READ_ONCE(l->locked) &&
> +			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
> +				goto gotlock;
>  			cpu_relax();
>  		}
>  

This isn't _once_, this is once per 'wakeup'. And note that interrupts
unrelated to the kick can equally wake the vCPU up.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-11 18:37 ` [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
  2015-09-14 13:57   ` Peter Zijlstra
  2015-09-14 14:00   ` Peter Zijlstra
@ 2015-09-14 14:04   ` Peter Zijlstra
  2015-09-14 19:19     ` Waiman Long
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-14 14:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
> This patch allows one attempt for the lock waiter to steal the lock
> when entering the PV slowpath.  This helps to reduce the performance
> penalty caused by lock waiter preemption while not having much of
> the downsides of a real unfair lock.


> @@ -416,7 +414,8 @@ queue:
>  	 * does not imply a full barrier.
>  	 *
>  	 */

If it really were once, like the Changelog says it is, then you could
have simply added:

	if (pv_try_steal_lock(...))
		goto release;

here, and not wrecked pv_wait_head() like you did. Note that if you do
it like this, you also do not need to play games with the hash, because
you'll never get into that situation.

> -	pv_wait_head(lock, node);
> +	if (pv_wait_head_and_lock(lock, node, tail))
> +		goto release;
>  	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
>  		cpu_relax();
>  

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-09-11 18:37 ` [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
@ 2015-09-14 14:10   ` Peter Zijlstra
  2015-09-14 19:37     ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-14 14:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Sep 11, 2015 at 02:37:38PM -0400, Waiman Long wrote:
> In an overcommitted guest where some vCPUs have to be halted to make
> forward progress in other areas, it is highly likely that a vCPU later
> in the spinlock queue will be spinning while the ones earlier in the
> queue would have been halted. The spinning in the later vCPUs is then
> just a waste of precious CPU cycles because they are not going to
> get the lock soon as the earlier ones have to be woken up and take
> their turn to get the lock.
> 
> This patch implements an adaptive spinning mechanism where the vCPU
> will call pv_wait() if the following conditions are true:
> 
>  1) the vCPU has not been halted before;
>  2) the previous vCPU is not running.

Why 1? For the mutex adaptive stuff we only care about the lock holder
running, right?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
  2015-09-11 22:27   ` Davidlohr Bueso
  2015-09-14 12:06     ` Peter Zijlstra
@ 2015-09-14 15:16     ` Waiman Long
  1 sibling, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-14 15:16 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On 09/11/2015 06:27 PM, Davidlohr Bueso wrote:
> On Fri, 11 Sep 2015, Waiman Long wrote:
>
>> @@ -46,7 +46,7 @@ static inline bool virt_queued_spin_lock(struct 
>> qspinlock *lock)
>>     if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>         return false;
>>
>> -    while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
>> +    while (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) != 0)
>>         cpu_relax();
>
> This code has changed with Peter's recent ccas fix. And the whole 
> virt_queued_spin_lock()
> thing will now be under pv configs. So this doesn't apply to native 
> code anymore, so it
> looks like it should be dropped altogether.
>
> Thanks,
> Davidlohr

You are right. Patch 1 needs to be updated on top of PeterZ latest patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-09-11 23:13   ` Davidlohr Bueso
@ 2015-09-14 15:25     ` Waiman Long
  2015-09-14 21:41       ` Davidlohr Bueso
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-14 15:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On 09/11/2015 07:13 PM, Davidlohr Bueso wrote:
> On Fri, 11 Sep 2015, Waiman Long wrote:
>
>> A sample of statistics counts after system bootup (with vCPU
>> overcommit) was:
>>
>> hash_hops_count=9001
>> kick_latencies=138047878
>> kick_unlock_count=9001
>> kick_wait_count=9000
>> spurious_wakeup=3
>> wait_again_count=2
>> wait_head_count=10
>> wait_node_count=8994
>> wake_latencies=713195944
>
> Any reason you chose not to make the stats per-cpu? The locking
> numbers don't have to be exact, so you can easily get away with
> it and suffer from much less overhead that resorting to atomics.
> Obviously assuming that reading/collecting the stats is done
> infrequently, such as between workloads or at bootup as you did.
>
> Thanks,
> Davidlohr

You can't use debugfs if we want to have per-cpu stats. We will have to 
use sysfs instead. This will require more code changes. It is certainly 
doable, but we have to choose between simplicity and performance 
overhead. Right now, I am assuming that lock PV lockstat is used 
primarily for debugging purpose and won't be enabled on production 
system. If we want to have this capability in production systems, we 
will certainly need to change it to per-cpu stats and use sysfs instead.

The original PV ticketlock code used debugfs and I was just following 
its footstep. Do you think it is worthwhile to have this capability 
available on production system by default?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
  2015-09-14 12:06     ` Peter Zijlstra
@ 2015-09-14 18:40       ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-14 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	x86, linux-kernel, Scott J Norton, Douglas Hatch

On 09/14/2015 08:06 AM, Peter Zijlstra wrote:
> On Fri, Sep 11, 2015 at 03:27:44PM -0700, Davidlohr Bueso wrote:
>> On Fri, 11 Sep 2015, Waiman Long wrote:
>>
>>> @@ -46,7 +46,7 @@ static inline bool virt_queued_spin_lock(struct qspinlock *lock)
>>> 	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> 		return false;
>>>
>>> -	while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
>>> +	while (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) != 0)
>>> 		cpu_relax();
>> This code has changed with Peter's recent ccas fix. And the whole virt_queued_spin_lock()
>> thing will now be under pv configs. So this doesn't apply to native code anymore, so it
>> looks like it should be dropped altogether.
> Yeah, it also doesn't make sense, this ix x86 arch code, x86 cannot do
> cmpxchg_acquire. Then again, I suppose we could argue its of
> documentation value..

Yes, it is to be consistent with the change in asm_generic qspinlock.h. 
We can certainly skip that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-14 13:57   ` Peter Zijlstra
@ 2015-09-14 19:02     ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-14 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/14/2015 09:57 AM, Peter Zijlstra wrote:
> On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
>> +#define queued_spin_trylock(l)	pv_queued_spin_trylock_unfair(l)
>> +static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
>> +{
>> +	struct __qspinlock *l = (void *)lock;
>> +
>> +	if (READ_ONCE(l->locked))
>> +		return 0;
>> +	/*
>> +	 * Wait a bit here to ensure that an actively spinning vCPU has a fair
>> +	 * chance of getting the lock.
>> +	 */
>> +	cpu_relax();
>> +
>> +	return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
>> +}
>> +static inline int pvstat_trylock_unfair(struct qspinlock *lock)
>> +{
>> +	int ret = pv_queued_spin_trylock_unfair(lock);
>> +
>> +	if (ret)
>> +		pvstat_inc(pvstat_utrylock);
>> +	return ret;
>> +}
>> +#undef  queued_spin_trylock
>> +#define queued_spin_trylock(l)	pvstat_trylock_unfair(l)
> These aren't actually ever used...

The pvstat_trylock_unfair() is within the CONFIG_QUEUED_LOCK_STAT block. 
It will only be activated when the config parameter is set. Otherwise, 
pv_queued_spin_trylock_unfair() will be used without any counting.

It is used provide count of how many unfair trylock has successfully got 
the lock.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-14 14:00   ` Peter Zijlstra
@ 2015-09-14 19:15     ` Waiman Long
  2015-09-14 19:38       ` Waiman Long
  2015-09-15  8:24       ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-14 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/14/2015 10:00 AM, Peter Zijlstra wrote:
> On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
>> This patch allows one attempt for the lock waiter to steal the lock
>> when entering the PV slowpath.  This helps to reduce the performance
>> penalty caused by lock waiter preemption while not having much of
>> the downsides of a real unfair lock.
>> @@ -415,8 +458,12 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>>
>>   	for (;; waitcnt++) {
>>   		for (loop = SPIN_THRESHOLD; loop; loop--) {
>> -			if (!READ_ONCE(l->locked))
>> -				return;
>> +			/*
>> +			 * Try to acquire the lock when it is free.
>> +			 */
>> +			if (!READ_ONCE(l->locked)&&
>> +			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
>> +				goto gotlock;
>>   			cpu_relax();
>>   		}
>>
> This isn't _once_, this is once per 'wakeup'. And note that interrupts
> unrelated to the kick can equally wake the vCPU up.
>

Oh! There is a minor bug that I shouldn't need to have a second 
READ_ONCE() call here.

As this is the queue head, finding the lock free entitles the vCPU to 
own the lock. However, because of lock stealing, I can't just write a 1 
to the lock and assume thing is all set. That is why I need to use 
cmpxchg() to make sure that the queue head vCPU can actually get the 
lock without the lock stolen underneath. I don't count that as lock 
stealing as it is the rightful owner of the lock.

I am sorry that I should have added a comment to clarify that. Will do 
so in the next update.

 > void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 > {
 >     :
 >         /*
 >         * We touched a (possibly) cold cacheline in the per-cpu queue 
node;
 >         * attempt the trylock once more in the hope someone let go 
while we
 >         * weren't watching.
 >         */
 >        if (queued_spin_trylock(lock))
 >                goto release;

This is the only place where I consider lock stealing happens. Again, I 
should have a comment in pv_queued_spin_trylock_unfair() to say where it 
will be called.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-14 14:04   ` Peter Zijlstra
@ 2015-09-14 19:19     ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-14 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/14/2015 10:04 AM, Peter Zijlstra wrote:
> On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
>> This patch allows one attempt for the lock waiter to steal the lock
>> when entering the PV slowpath.  This helps to reduce the performance
>> penalty caused by lock waiter preemption while not having much of
>> the downsides of a real unfair lock.
>
>> @@ -416,7 +414,8 @@ queue:
>>   	 * does not imply a full barrier.
>>   	 *
>>   	 */
> If it really were once, like the Changelog says it is, then you could
> have simply added:
>
> 	if (pv_try_steal_lock(...))
> 		goto release;

My previous mail has clarified where the lock stealing happen. Will add 
the necessary comment to the patch.

> here, and not wrecked pv_wait_head() like you did. Note that if you do
> it like this, you also do not need to play games with the hash, because
> you'll never get into that situation.
>
>> -	pv_wait_head(lock, node);
>> +	if (pv_wait_head_and_lock(lock, node, tail))
>> +		goto release;
>>   	while ((val = smp_load_acquire(&lock->val.counter))&  _Q_LOCKED_PENDING_MASK)
>>   		cpu_relax();
>>

Because we need to use atomic op to get the lock, we can't use the 
native logic to do the acquire. I know it is kind of hacky, but I don't 
have a good alternative here.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-09-14 14:10   ` Peter Zijlstra
@ 2015-09-14 19:37     ` Waiman Long
  2015-09-15  8:38       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-14 19:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/14/2015 10:10 AM, Peter Zijlstra wrote:
> On Fri, Sep 11, 2015 at 02:37:38PM -0400, Waiman Long wrote:
>> In an overcommitted guest where some vCPUs have to be halted to make
>> forward progress in other areas, it is highly likely that a vCPU later
>> in the spinlock queue will be spinning while the ones earlier in the
>> queue would have been halted. The spinning in the later vCPUs is then
>> just a waste of precious CPU cycles because they are not going to
>> get the lock soon as the earlier ones have to be woken up and take
>> their turn to get the lock.
>>
>> This patch implements an adaptive spinning mechanism where the vCPU
>> will call pv_wait() if the following conditions are true:
>>
>>   1) the vCPU has not been halted before;
>>   2) the previous vCPU is not running.
> Why 1? For the mutex adaptive stuff we only care about the lock holder
> running, right?

The wait-early once logic was there because of the kick-ahead patch as I 
don't want a recently kicked vCPU near the head of the queue to go back 
to sleep too early. However, without kick-ahead, a woken up vCPU should 
now be at the queue head. Indeed, we can remove that check and simplify 
the logic.

BTW, the queue head vCPU at pv_wait_head_and_lock() doesn't wait early, 
it will spin the full threshold as there is no way for it to figure out 
if the lock holder is running or not.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-14 19:15     ` Waiman Long
@ 2015-09-14 19:38       ` Waiman Long
  2015-09-15  8:24       ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-14 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/14/2015 03:15 PM, Waiman Long wrote:
> On 09/14/2015 10:00 AM, Peter Zijlstra wrote:
>> On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
>>> This patch allows one attempt for the lock waiter to steal the lock
>>> when entering the PV slowpath.  This helps to reduce the performance
>>> penalty caused by lock waiter preemption while not having much of
>>> the downsides of a real unfair lock.
>>> @@ -415,8 +458,12 @@ static void pv_wait_head(struct qspinlock 
>>> *lock, struct mcs_spinlock *node)
>>>
>>>       for (;; waitcnt++) {
>>>           for (loop = SPIN_THRESHOLD; loop; loop--) {
>>> -            if (!READ_ONCE(l->locked))
>>> -                return;
>>> +            /*
>>> +             * Try to acquire the lock when it is free.
>>> +             */
>>> +            if (!READ_ONCE(l->locked)&&
>>> +               (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
>>> +                goto gotlock;
>>>               cpu_relax();
>>>           }
>>>
>> This isn't _once_, this is once per 'wakeup'. And note that interrupts
>> unrelated to the kick can equally wake the vCPU up.
>>
>
> Oh! There is a minor bug that I shouldn't need to have a second 
> READ_ONCE() call here.

Oh! I misread the diff, the code was OK.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-09-14 15:25     ` Waiman Long
@ 2015-09-14 21:41       ` Davidlohr Bueso
  2015-09-15  3:47         ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2015-09-14 21:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On Mon, 14 Sep 2015, Waiman Long wrote:

>You can't use debugfs if we want to have per-cpu stats. We will have 
>to use sysfs instead. This will require more code changes. It is 
>certainly doable, but we have to choose between simplicity and 
>performance overhead. Right now, I am assuming that lock PV lockstat 
>is used primarily for debugging purpose and won't be enabled on 
>production system. If we want to have this capability in production 
>systems, we will certainly need to change it to per-cpu stats and use 
>sysfs instead.
>
>The original PV ticketlock code used debugfs and I was just following 
>its footstep. Do you think it is worthwhile to have this capability 
>available on production system by default?

If we can prove that the overhead is small enough, and do it correctly
(ie see how we do vmstats), it would be _very_ useful data to have
enabled by default for debugging performance issues; methinks. But right
now we have nowhere near that kind of data, not even with this atomic
variant -- although I recall you did mention a workload in a previous
iteration (which would be good to have in the changelog).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-09-14 21:41       ` Davidlohr Bueso
@ 2015-09-15  3:47         ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-15  3:47 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch

On 09/14/2015 05:41 PM, Davidlohr Bueso wrote:
> On Mon, 14 Sep 2015, Waiman Long wrote:
>
>> You can't use debugfs if we want to have per-cpu stats. We will have 
>> to use sysfs instead. This will require more code changes. It is 
>> certainly doable, but we have to choose between simplicity and 
>> performance overhead. Right now, I am assuming that lock PV lockstat 
>> is used primarily for debugging purpose and won't be enabled on 
>> production system. If we want to have this capability in production 
>> systems, we will certainly need to change it to per-cpu stats and use 
>> sysfs instead.
>>
>> The original PV ticketlock code used debugfs and I was just following 
>> its footstep. Do you think it is worthwhile to have this capability 
>> available on production system by default?
>
> If we can prove that the overhead is small enough, and do it correctly
> (ie see how we do vmstats), it would be _very_ useful data to have
> enabled by default for debugging performance issues; methinks. But right
> now we have nowhere near that kind of data, not even with this atomic
> variant -- although I recall you did mention a workload in a previous
> iteration (which would be good to have in the changelog).

Using the per-cpu stats, the overhead should be pretty small as atomic 
instructions are not needed. I would probably need to encapsulate the 
stat code into another header file (e.g. qspinlock_pvstat.h) to avoid 
making the qspinlock_paravirt.h too complex. This will probably be a 
separate patch once this patch series can be merged.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-14 19:15     ` Waiman Long
  2015-09-14 19:38       ` Waiman Long
@ 2015-09-15  8:24       ` Peter Zijlstra
  2015-09-15 15:29         ` Waiman Long
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-15  8:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Mon, Sep 14, 2015 at 03:15:20PM -0400, Waiman Long wrote:
> On 09/14/2015 10:00 AM, Peter Zijlstra wrote:
> >On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
> >>This patch allows one attempt for the lock waiter to steal the lock
                      ^^^

> >>when entering the PV slowpath.  This helps to reduce the performance
> >>penalty caused by lock waiter preemption while not having much of
> >>the downsides of a real unfair lock.

> >>@@ -415,8 +458,12 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> >>
> >>  	for (;; waitcnt++) {
> >>  		for (loop = SPIN_THRESHOLD; loop; loop--) {
> >>-			if (!READ_ONCE(l->locked))
> >>-				return;
> >>+			/*
> >>+			 * Try to acquire the lock when it is free.
> >>+			 */
> >>+			if (!READ_ONCE(l->locked)&&
> >>+			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
> >>+				goto gotlock;
> >>  			cpu_relax();
> >>  		}
> >>
> >This isn't _once_, this is once per 'wakeup'. And note that interrupts
> >unrelated to the kick can equally wake the vCPU up.

> > void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > {
> >     :
> >         /*
> >         * We touched a (possibly) cold cacheline in the per-cpu queue node;
> >         * attempt the trylock once more in the hope someone let go while we
> >         * weren't watching.
> >         */
> >        if (queued_spin_trylock(lock))
> >                goto release;
> 
> This is the only place where I consider lock stealing happens. Again, I
> should have a comment in pv_queued_spin_trylock_unfair() to say where it
> will be called.

But you're not adding that..

What you did add is a steal in pv_wait_head(), and its not even once per
pv_wait_head, its inside the spin loop (I read it wrong yesterday).

So that makes the entire Changelog complete crap. There isn't _one_
attempt, and there is absolutely no fairness left.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-09-14 19:37     ` Waiman Long
@ 2015-09-15  8:38       ` Peter Zijlstra
  2015-09-15 15:32         ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-15  8:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Mon, Sep 14, 2015 at 03:37:32PM -0400, Waiman Long wrote:
> BTW, the queue head vCPU at pv_wait_head_and_lock() doesn't wait early, it
> will spin the full threshold as there is no way for it to figure out if the
> lock holder is running or not.

We can know its cpu id, right? Surely we should then be able to figure
out if its active or not, I'm thinking the KVM/Xen know these things
somewhere.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-15  8:24       ` Peter Zijlstra
@ 2015-09-15 15:29         ` Waiman Long
  2015-09-16 15:01           ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-15 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/15/2015 04:24 AM, Peter Zijlstra wrote:
> On Mon, Sep 14, 2015 at 03:15:20PM -0400, Waiman Long wrote:
>> On 09/14/2015 10:00 AM, Peter Zijlstra wrote:
>>> On Fri, Sep 11, 2015 at 02:37:37PM -0400, Waiman Long wrote:
>>>> This patch allows one attempt for the lock waiter to steal the lock
>                        ^^^
>
>>>> when entering the PV slowpath.  This helps to reduce the performance
>>>> penalty caused by lock waiter preemption while not having much of
>>>> the downsides of a real unfair lock.
>>>> @@ -415,8 +458,12 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>>>>
>>>>   	for (;; waitcnt++) {
>>>>   		for (loop = SPIN_THRESHOLD; loop; loop--) {
>>>> -			if (!READ_ONCE(l->locked))
>>>> -				return;
>>>> +			/*
>>>> +			 * Try to acquire the lock when it is free.
>>>> +			 */
>>>> +			if (!READ_ONCE(l->locked)&&
>>>> +			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
>>>> +				goto gotlock;
>>>>   			cpu_relax();
>>>>   		}
>>>>
>>> This isn't _once_, this is once per 'wakeup'. And note that interrupts
>>> unrelated to the kick can equally wake the vCPU up.
>>> void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>> {
>>>      :
>>>          /*
>>>          * We touched a (possibly) cold cacheline in the per-cpu queue node;
>>>          * attempt the trylock once more in the hope someone let go while we
>>>          * weren't watching.
>>>          */
>>>         if (queued_spin_trylock(lock))
>>>                 goto release;
>> This is the only place where I consider lock stealing happens. Again, I
>> should have a comment in pv_queued_spin_trylock_unfair() to say where it
>> will be called.
> But you're not adding that..
>
> What you did add is a steal in pv_wait_head(), and its not even once per
> pv_wait_head, its inside the spin loop (I read it wrong yesterday).
>
> So that makes the entire Changelog complete crap. There isn't _one_
> attempt, and there is absolutely no fairness left.

Only the queue head vCPU will be in pv_wait_head() spinning to acquire 
the lock. The other vCPUs in the queue will still be spinning on their 
MCS nodes. The only competitors for the lock are those vCPUs that have 
just entered slowpath and execute the queued_spin_trylock() function 
once before being queued. That is what I mean by each task having only 
one chance of stealing the lock. Maybe the following code changes can 
make this point clearer.

Cheers,
Longman

--------------------------------------------------------------------------------------------------------

--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -59,7 +59,8 @@ struct pv_node {
  /*
   * Allow one unfair trylock when entering the PV slowpath to reduce the
   * performance impact of lock waiter preemption (either explicitly via
- * pv_wait or implicitly via PLE).
+ * pv_wait or implicitly via PLE). This function will be called once when
+ * a lock waiter enter the slowpath before being queued.
   *
   * A little bit of unfairness here can improve performance without many
   * of the downsides of a real unfair lock.
@@ -72,8 +73,8 @@ static inline bool 
pv_queued_spin_trylock_unfair(struct qspinl
         if (READ_ONCE(l->locked))
                 return 0;
         /*
-        * Wait a bit here to ensure that an actively spinning vCPU has 
a fair
-        * chance of getting the lock.
+        * Wait a bit here to ensure that an actively spinning queue 
head vCPU
+        * has a fair chance of getting the lock.
          */
         cpu_relax();

@@ -504,14 +505,23 @@ static int pv_wait_head_and_lock(struct qspinlock 
*lock,
                  */
                 WRITE_ONCE(pn->state, vcpu_running);

-               for (loop = SPIN_THRESHOLD; loop; loop--) {
+               loop = SPIN_THRESHOLD;
+               while (loop) {
                         /*
-                        * Try to acquire the lock when it is free.
+                        * Spin until the lock is free
                          */
-                       if (!READ_ONCE(l->locked) &&
-                          (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0))
+                       for (; loop && READ_ONCE(l->locked); loop--)
+                               cpu_relax();
+                       /*
+                        * Seeing the lock is free, this queue head vCPU is
+                        * the rightful next owner of the lock. However, the
+                        * lock may have just been stolen by another 
task which
+                        * has entered the slowpath. So we need to use 
atomic
+                        * operation to make sure that we really get the 
lock.
+                        * Otherwise, we have to wait again.
+                        */
+                       if (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)
                                 goto gotlock;
-                       cpu_relax();
                 }

                 if (!lp) { /* ONCE */



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-09-15  8:38       ` Peter Zijlstra
@ 2015-09-15 15:32         ` Waiman Long
  2015-09-16 15:03           ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2015-09-15 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/15/2015 04:38 AM, Peter Zijlstra wrote:
> On Mon, Sep 14, 2015 at 03:37:32PM -0400, Waiman Long wrote:
>> BTW, the queue head vCPU at pv_wait_head_and_lock() doesn't wait early, it
>> will spin the full threshold as there is no way for it to figure out if the
>> lock holder is running or not.
> We can know its cpu id, right? Surely we should then be able to figure
> out if its active or not, I'm thinking the KVM/Xen know these things
> somewhere.

We can make a guess of the lock holder cpu id by peeking at previous MCS 
node. However, if the current lock holder got the lock without entering 
the queue or it got the lock by stealing, we won't have information of 
what CPU it is.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-15 15:29         ` Waiman Long
@ 2015-09-16 15:01           ` Peter Zijlstra
  2015-09-17 15:08             ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-16 15:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Tue, Sep 15, 2015 at 11:29:14AM -0400, Waiman Long wrote:

> Only the queue head vCPU will be in pv_wait_head() spinning to acquire the
> lock.

But what will guarantee fwd progress for the lock that is the head?

Suppose CPU0 becomes head and enters the /* claim the lock */ loop.

Then CPU1 comes in, steals it in pv_wait_head(). CPU1 releases, CPU1
re-acquires and _again_ steals in pv_wait_head(), etc..

All the while CPU0 doesn't go anywhere.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-09-15 15:32         ` Waiman Long
@ 2015-09-16 15:03           ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2015-09-16 15:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Tue, Sep 15, 2015 at 11:32:14AM -0400, Waiman Long wrote:
> On 09/15/2015 04:38 AM, Peter Zijlstra wrote:
> >On Mon, Sep 14, 2015 at 03:37:32PM -0400, Waiman Long wrote:
> >>BTW, the queue head vCPU at pv_wait_head_and_lock() doesn't wait early, it
> >>will spin the full threshold as there is no way for it to figure out if the
> >>lock holder is running or not.
> >We can know its cpu id, right? Surely we should then be able to figure
> >out if its active or not, I'm thinking the KVM/Xen know these things
> >somewhere.
> 
> We can make a guess of the lock holder cpu id by peeking at previous MCS
> node. However, if the current lock holder got the lock without entering the
> queue or it got the lock by stealing, we won't have information of what CPU
> it is.

Ah, yes I see, we do not have storage for the lock holder. Yes you're
right.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-09-16 15:01           ` Peter Zijlstra
@ 2015-09-17 15:08             ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2015-09-17 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 09/16/2015 11:01 AM, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 11:29:14AM -0400, Waiman Long wrote:
>
>> Only the queue head vCPU will be in pv_wait_head() spinning to acquire the
>> lock.
> But what will guarantee fwd progress for the lock that is the head?
>
> Suppose CPU0 becomes head and enters the /* claim the lock */ loop.
>
> Then CPU1 comes in, steals it in pv_wait_head(). CPU1 releases, CPU1
> re-acquires and _again_ steals in pv_wait_head(), etc..
>
> All the while CPU0 doesn't go anywhere.
>

That can't happen. For a given lock, there can only be 1 queue head 
spinning on the lock at any instance in time. If CPU0 was the head, 
another CPU could not become head until CPU0 got the lock and pass the 
MCS lock bit to the next one in the queue. As I said in earlier mail, 
the only place where lock stealing can happen is in the 
pv_queued_spin_trylock_unfair() function where I purposely inserted a 
cpu_relax() to allow an actively spinning queue head CPU a better chance 
of getting the lock. Once a CPU enters the queue. It won't try to 
acquire the lock until it becomes the head and there is one and only one 
head.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [tip:locking/core] locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL
  2015-09-11 18:37 ` [PATCH v6 2/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
@ 2015-09-18  8:50   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: tip-bot for Waiman Long @ 2015-09-18  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, akpm, scott.norton, torvalds, paulmck, dave, peterz,
	doug.hatch, Waiman.Long, tglx, mingo, hpa

Commit-ID:  93edc8bd7750ff3cae088bfca453ea73dc9004a4
Gitweb:     http://git.kernel.org/tip/93edc8bd7750ff3cae088bfca453ea73dc9004a4
Author:     Waiman Long <Waiman.Long@hpe.com>
AuthorDate: Fri, 11 Sep 2015 14:37:34 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 18 Sep 2015 09:27:29 +0200

locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL

If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
The extra check at the end of __pv_queued_spin_unlock() is unnecessary
and can be removed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441996658-62854-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock_paravirt.h | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index c8e6e9a..f0450ff 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -267,7 +267,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 		}
 
 		if (!lp) { /* ONCE */
-			WRITE_ONCE(pn->state, vcpu_hashed);
 			lp = pv_hash(lock, pn);
 
 			/*
@@ -275,11 +274,9 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
 			 * we'll be sure to be able to observe our hash entry.
 			 *
-			 *   [S] pn->state
 			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
 			 *       MB                           RMB
 			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
-			 *                                [L] pn->state
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
@@ -364,8 +361,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 * vCPU is harmless other than the additional latency in completing
 	 * the unlock.
 	 */
-	if (READ_ONCE(node->state) == vcpu_hashed)
-		pv_kick(node->cpu);
+	pv_kick(node->cpu);
 }
 /*
  * Include the architecture specific callee-save thunk of the

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-09-18  8:51 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-11 18:37 [PATCH v6 0/6] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-09-11 18:37 ` [PATCH v6 1/6] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
2015-09-11 22:27   ` Davidlohr Bueso
2015-09-14 12:06     ` Peter Zijlstra
2015-09-14 18:40       ` Waiman Long
2015-09-14 15:16     ` Waiman Long
2015-09-11 18:37 ` [PATCH v6 2/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
2015-09-18  8:50   ` [tip:locking/core] locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL tip-bot for Waiman Long
2015-09-11 18:37 ` [PATCH v6 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
2015-09-11 18:37 ` [PATCH v6 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
2015-09-11 23:13   ` Davidlohr Bueso
2015-09-14 15:25     ` Waiman Long
2015-09-14 21:41       ` Davidlohr Bueso
2015-09-15  3:47         ` Waiman Long
2015-09-11 18:37 ` [PATCH v6 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-09-14 13:57   ` Peter Zijlstra
2015-09-14 19:02     ` Waiman Long
2015-09-14 14:00   ` Peter Zijlstra
2015-09-14 19:15     ` Waiman Long
2015-09-14 19:38       ` Waiman Long
2015-09-15  8:24       ` Peter Zijlstra
2015-09-15 15:29         ` Waiman Long
2015-09-16 15:01           ` Peter Zijlstra
2015-09-17 15:08             ` Waiman Long
2015-09-14 14:04   ` Peter Zijlstra
2015-09-14 19:19     ` Waiman Long
2015-09-11 18:37 ` [PATCH v6 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
2015-09-14 14:10   ` Peter Zijlstra
2015-09-14 19:37     ` Waiman Long
2015-09-15  8:38       ` Peter Zijlstra
2015-09-15 15:32         ` Waiman Long
2015-09-16 15:03           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.