[PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64

Linux-ARM-Kernel Archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64
@ 2015-07-07 17:24 Will Deacon
  2015-07-07 17:24 ` [PATCH 1/9] locking/qrwlock: include <linux/spinlock.h> for arch_spin_{lock, unlock} Will Deacon
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

Hi all,

This patch series gently massages the qrwlock code so as to make it
usable on arm64 as a replacement for our current implementation. The
last two patches actually do the arch conversion and are included here
for reference.

Most of the work is removing redundant memory barriers from the current
implementation and adding hooks to the contended paths to allow us to
avoid busy-waiting.

Implemented on top of -tip (i.e. the patches queued there from Waiman),
but this is likely to conflict with his latest series.

All feedback welcome,

Will

--->8

Will Deacon (9):
  locking/qrwlock: include <linux/spinlock.h> for
    arch_spin_{lock,unlock}
  locking/qrwlock: avoid redundant atomic_add_return on
    read_lock_slowpath
  locking/qrwlock: tidy up rspin_until_writer_unlock
  locking/qrwlock: implement queue_write_unlock using smp_store_release
  locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path
  locking/qrwlock: allow architectures to hook in to contended paths
  locking/qrwlock: expose internal lock structure in qrwlock definition
  arm64: cmpxchg: implement cmpxchg_relaxed
  arm64: locking: replace read/write locks with generic qrwlock code

 arch/arm64/Kconfig                      |   1 +
 arch/arm64/include/asm/cmpxchg.h        |   6 +-
 arch/arm64/include/asm/qrwlock.h        |  30 ++++++++
 arch/arm64/include/asm/spinlock.h       | 122 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |  10 +--
 arch/x86/include/asm/qrwlock.h          |  10 ---
 include/asm-generic/qrwlock.h           |  11 ++-
 include/asm-generic/qrwlock_types.h     |  17 ++++-
 kernel/locking/qrwlock.c                |  60 +++++++---------
 9 files changed, 81 insertions(+), 186 deletions(-)
 create mode 100644 arch/arm64/include/asm/qrwlock.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/9] locking/qrwlock: include <linux/spinlock.h> for arch_spin_{lock, unlock}
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-07 17:24 ` [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath Will Deacon
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

locking/qrwlock.c makes use of arch_spin_{lock,unlock}, so ensure that
we #include <linux/spinlock.h> to get their definitions.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qrwlock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index d9c36c5f5711..96b77d1e0545 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -20,6 +20,7 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/spinlock.h>
 #include <asm/qrwlock.h>
 
 /*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
  2015-07-07 17:24 ` [PATCH 1/9] locking/qrwlock: include <linux/spinlock.h> for arch_spin_{lock, unlock} Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-07 17:51   ` Waiman Long
  2015-07-07 17:24 ` [PATCH 3/9] locking/qrwlock: tidy up rspin_until_writer_unlock Will Deacon
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

When a slow-path reader gets to the front of the wait queue outside of
interrupt context, it waits for any writers to drain, increments the
reader count and again waits for any additional writers that may have
snuck in between the initial check and the increment.

Given that this second check is performed with acquire semantics, there
is no need to perform the increment using atomic_add_return, which acts
as a full barrier.

This patch changes the slow-path code to use smp_load_acquire and
atomic_add instead of atomic_add_return. Since the check only involves
the writer count, we can perform the acquire after the add.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qrwlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 96b77d1e0545..4e29bef688ac 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -98,7 +98,8 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	while (atomic_read(&lock->cnts) & _QW_WMASK)
 		cpu_relax_lowlatency();

-	cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
+	atomic_add(_QR_BIAS, &lock->cnts);
+	cnts = smp_load_acquire((u32 *)&lock->cnts);
 	rspin_until_writer_unlock(lock, cnts);

 	/*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/9] locking/qrwlock: tidy up rspin_until_writer_unlock
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
  2015-07-07 17:24 ` [PATCH 1/9] locking/qrwlock: include <linux/spinlock.h> for arch_spin_{lock, unlock} Will Deacon
  2015-07-07 17:24 ` [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-07 17:24 ` [PATCH 4/9] locking/qrwlock: implement queue_write_unlock using smp_store_release Will Deacon
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

rspin_until_writer_unlock is used by readers to spin on a qrwlock until
it is no longer held for write.

This patch tidies up the function so that (a) the comments match the
code and (b) we avoid passing the redundant initial lock value.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qrwlock.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 4e29bef688ac..a1b63df19a04 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -44,16 +44,17 @@ struct __qrwlock {
 };
 
 /**
- * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * rspin_until_writer_unlock - spin until writer is gone
  * @lock  : Pointer to queue rwlock structure
- * @writer: Current queue rwlock writer status byte
  *
  * In interrupt context or at the head of the queue, the reader will just
- * increment the reader count & wait until the writer releases the lock.
+ * wait until the writer releases the lock.
  */
 static __always_inline void
-rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
+rspin_until_writer_unlock(struct qrwlock *lock)
 {
+	u32 cnts = smp_load_acquire((u32 *)&lock->cnts);
+
 	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
 		cpu_relax_lowlatency();
 		cnts = smp_load_acquire((u32 *)&lock->cnts);
@@ -74,11 +75,11 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 		/*
 		 * Readers in interrupt context will get the lock immediately
 		 * if the writer is just waiting (not holding the lock yet).
-		 * The rspin_until_writer_unlock() function returns immediately
-		 * in this case. Otherwise, they will spin until the lock
-		 * is available without waiting in the queue.
+		 * Otherwise, they will spin until the lock is available
+		 * without waiting in the queue.
 		 */
-		rspin_until_writer_unlock(lock, cnts);
+		if ((cnts & _QW_WMASK) == _QW_LOCKED)
+			rspin_until_writer_unlock(lock);
 		return;
 	}
 	atomic_sub(_QR_BIAS, &lock->cnts);
@@ -99,8 +100,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 		cpu_relax_lowlatency();
 
 	atomic_add(_QR_BIAS, &lock->cnts);
-	cnts = smp_load_acquire((u32 *)&lock->cnts);
-	rspin_until_writer_unlock(lock, cnts);
+	rspin_until_writer_unlock(lock);
 
 	/*
 	 * Signal the next one in queue to become queue head
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/9] locking/qrwlock: implement queue_write_unlock using smp_store_release
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
                   ` (2 preceding siblings ...)
  2015-07-07 17:24 ` [PATCH 3/9] locking/qrwlock: tidy up rspin_until_writer_unlock Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-08 10:00   ` Peter Zijlstra
  2015-07-07 17:24 ` [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path Will Deacon
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

smp_store_release supports byte accesses, so use that in writer unlock
and remove the conditional macro override.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qrwlock.h | 10 ----------
 include/asm-generic/qrwlock.h  |  9 +--------
 2 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/qrwlock.h b/arch/x86/include/asm/qrwlock.h
index a8810bf135ab..c537cbb038a7 100644
--- a/arch/x86/include/asm/qrwlock.h
+++ b/arch/x86/include/asm/qrwlock.h
@@ -2,16 +2,6 @@
 #define _ASM_X86_QRWLOCK_H
 
 #include <asm-generic/qrwlock_types.h>
-
-#ifndef CONFIG_X86_PPRO_FENCE
-#define queued_write_unlock queued_write_unlock
-static inline void queued_write_unlock(struct qrwlock *lock)
-{
-        barrier();
-        ACCESS_ONCE(*(u8 *)&lock->cnts) = 0;
-}
-#endif
-
 #include <asm-generic/qrwlock.h>
 
 #endif /* _ASM_X86_QRWLOCK_H */
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index deb9e8b0eb9e..eb673dde8879 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -134,21 +134,14 @@ static inline void queued_read_unlock(struct qrwlock *lock)
 	atomic_sub(_QR_BIAS, &lock->cnts);
 }
 
-#ifndef queued_write_unlock
 /**
  * queued_write_unlock - release write lock of a queue rwlock
  * @lock : Pointer to queue rwlock structure
  */
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
-	/*
-	 * If the writer field is atomic, it can be cleared directly.
-	 * Otherwise, an atomic subtraction will be used to clear it.
-	 */
-	smp_mb__before_atomic();
-	atomic_sub(_QW_LOCKED, &lock->cnts);
+	smp_store_release((u8 *)&lock->cnts, 0);
 }
-#endif
 
 /*
  * Remapping rwlock architecture specific functions to the corresponding
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
                   ` (3 preceding siblings ...)
  2015-07-07 17:24 ` [PATCH 4/9] locking/qrwlock: implement queue_write_unlock using smp_store_release Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-08 10:05   ` Peter Zijlstra
  2015-07-07 17:24 ` [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths Will Deacon
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

When transitioning from unlocked - locked (for write), we notify readers
of our presence by setting the _QW_WAITING flag using a cmpxchg.

Given that we use another cmpxchg to transition to the _QW_LOCKED state,
we can drop the barriers from the intermediate step by offering
architectures the chance to implement cmpxchg_relaxed, in a similar
manner to the cmpxchg64_relaxed macro in the lockref code.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qrwlock.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index a1b63df19a04..e3c51c4635d3 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -113,6 +113,10 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
  * queued_write_lock_slowpath - acquire write lock of a queue rwlock
  * @lock : Pointer to queue rwlock structure
  */
+#ifndef cmpxchg_relaxed
+# define cmpxchg_relaxed	cmpxchg
+#endif
+
 void queued_write_lock_slowpath(struct qrwlock *lock)
 {
 	u32 cnts;
@@ -133,7 +137,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 		struct __qrwlock *l = (struct __qrwlock *)lock;
 
 		if (!READ_ONCE(l->wmode) &&
-		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
+		   (cmpxchg_relaxed(&l->wmode, 0, _QW_WAITING) == 0))
 			break;
 
 		cpu_relax_lowlatency();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
                   ` (4 preceding siblings ...)
  2015-07-07 17:24 ` [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-08 10:06   ` Peter Zijlstra
  2015-07-07 17:24 ` [PATCH 7/9] locking/qrwlock: expose internal lock structure in qrwlock definition Will Deacon
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

When contended, architectures may be able to reduce the polling overhead
in ways which aren't expressible using a simple relax() primitive.

This patch allows architectures to override the use of
cpu_relax_lowlatency() in the qrwlock code and also implement their own
unlock macros in case explicit signalling is required to wake up a
`relaxed' CPU spinning on an unlock event.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/qrwlock.h |  4 ++++
 kernel/locking/qrwlock.c      | 12 ++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index eb673dde8879..dbaac5f2af8f 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -125,6 +125,7 @@ static inline void queued_write_lock(struct qrwlock *lock)
  * queued_read_unlock - release read lock of a queue rwlock
  * @lock : Pointer to queue rwlock structure
  */
+#ifndef queued_read_unlock
 static inline void queued_read_unlock(struct qrwlock *lock)
 {
 	/*
@@ -133,15 +134,18 @@ static inline void queued_read_unlock(struct qrwlock *lock)
 	smp_mb__before_atomic();
 	atomic_sub(_QR_BIAS, &lock->cnts);
 }
+#endif
 
 /**
  * queued_write_unlock - release write lock of a queue rwlock
  * @lock : Pointer to queue rwlock structure
  */
+#ifndef queued_write_unlock
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
 	smp_store_release((u8 *)&lock->cnts, 0);
 }
+#endif
 
 /*
  * Remapping rwlock architecture specific functions to the corresponding
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index e3c51c4635d3..45fbb48f83f6 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -23,6 +23,10 @@
 #include <linux/spinlock.h>
 #include <asm/qrwlock.h>
 
+#ifndef arch_qrwlock_relax
+# define arch_qrwlock_relax(lock)	cpu_relax_lowlatency()
+#endif
+
 /*
  * This internal data structure is used for optimizing access to some of
  * the subfields within the atomic_t cnts.
@@ -56,7 +60,7 @@ rspin_until_writer_unlock(struct qrwlock *lock)
 	u32 cnts = smp_load_acquire((u32 *)&lock->cnts);
 
 	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
-		cpu_relax_lowlatency();
+		arch_qrwlock_relax(lock);
 		cnts = smp_load_acquire((u32 *)&lock->cnts);
 	}
 }
@@ -97,7 +101,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	 * to make sure that the write lock isn't taken.
 	 */
 	while (atomic_read(&lock->cnts) & _QW_WMASK)
-		cpu_relax_lowlatency();
+		arch_qrwlock_relax(lock);
 
 	atomic_add(_QR_BIAS, &lock->cnts);
 	rspin_until_writer_unlock(lock);
@@ -140,7 +144,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 		   (cmpxchg_relaxed(&l->wmode, 0, _QW_WAITING) == 0))
 			break;
 
-		cpu_relax_lowlatency();
+		arch_qrwlock_relax(lock);
 	}
 
 	/* When no more readers, set the locked flag */
@@ -151,7 +155,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 				    _QW_LOCKED) == _QW_WAITING))
 			break;
 
-		cpu_relax_lowlatency();
+		arch_qrwlock_relax(lock);
 	}
 unlock:
 	arch_spin_unlock(&lock->lock);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/9] locking/qrwlock: expose internal lock structure in qrwlock definition
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
                   ` (5 preceding siblings ...)
  2015-07-07 17:24 ` [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-07 17:24 ` [PATCH 8/9] arm64: cmpxchg: implement cmpxchg_relaxed Will Deacon
  2015-07-07 17:24 ` [PATCH 9/9] arm64: locking: replace read/write locks with generic qrwlock code Will Deacon
  8 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

In order for a writer to unlock correctly on big-endian machines, we
need to expose the internal layout of the qrwlock structure so that
we also zero the correct sub-field.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/qrwlock.h       |  2 +-
 include/asm-generic/qrwlock_types.h | 17 ++++++++++++++++-
 kernel/locking/qrwlock.c            | 26 ++------------------------
 3 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index dbaac5f2af8f..25e97942726a 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -143,7 +143,7 @@ static inline void queued_read_unlock(struct qrwlock *lock)
 #ifndef queued_write_unlock
 static inline void queued_write_unlock(struct qrwlock *lock)
 {
-	smp_store_release((u8 *)&lock->cnts, 0);
+	smp_store_release(&lock->wmode, 0);
 }
 #endif
 
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 4d76f24df518..4c4f87fcf720 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -2,14 +2,29 @@
 #define __ASM_GENERIC_QRWLOCK_TYPES_H
 
 #include <linux/types.h>
+#include <asm/byteorder.h>
 #include <asm/spinlock_types.h>
 
 /*
  * The queue read/write lock data structure
+ *
+ * The 32-bit count is divided into an 8-bit writer mode byte (least
+ * significant) and a 24-bit reader count.
  */
 
 typedef struct qrwlock {
-	atomic_t		cnts;
+	union {
+		atomic_t	cnts;
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8	wmode;		/* Writer mode   */
+			u8	rcnts[3];	/* Reader counts */
+#else
+			u8	rcnts[3];	/* Reader counts */
+			u8	wmode;		/* Writer mode   */
+#endif
+		};
+	};
 	arch_spinlock_t		lock;
 } arch_rwlock_t;
 
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 45fbb48f83f6..6ae1ae321897 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -27,26 +27,6 @@
 # define arch_qrwlock_relax(lock)	cpu_relax_lowlatency()
 #endif
 
-/*
- * This internal data structure is used for optimizing access to some of
- * the subfields within the atomic_t cnts.
- */
-struct __qrwlock {
-	union {
-		atomic_t cnts;
-		struct {
-#ifdef __LITTLE_ENDIAN
-			u8 wmode;	/* Writer mode   */
-			u8 rcnts[3];	/* Reader counts */
-#else
-			u8 rcnts[3];	/* Reader counts */
-			u8 wmode;	/* Writer mode   */
-#endif
-		};
-	};
-	arch_spinlock_t	lock;
-};
-
 /**
  * rspin_until_writer_unlock - spin until writer is gone
  * @lock  : Pointer to queue rwlock structure
@@ -138,10 +118,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	 * or wait for a previous writer to go away.
 	 */
 	for (;;) {
-		struct __qrwlock *l = (struct __qrwlock *)lock;
-
-		if (!READ_ONCE(l->wmode) &&
-		   (cmpxchg_relaxed(&l->wmode, 0, _QW_WAITING) == 0))
+		if (!READ_ONCE(lock->wmode) &&
+		   (cmpxchg_relaxed(&lock->wmode, 0, _QW_WAITING) == 0))
 			break;
 
 		arch_qrwlock_relax(lock);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/9] arm64: cmpxchg: implement cmpxchg_relaxed
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
                   ` (6 preceding siblings ...)
  2015-07-07 17:24 ` [PATCH 7/9] locking/qrwlock: expose internal lock structure in qrwlock definition Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  2015-07-07 17:24 ` [PATCH 9/9] arm64: locking: replace read/write locks with generic qrwlock code Will Deacon
  8 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

cmpxchg_relaxed offers an SMP-safe, barrier-less cmpxchg implementation
so we can define it in the same way that we implement cmpxchg_local on
arm64.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/cmpxchg.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index d8c25b7b18fb..317bb7f6f92a 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -219,7 +219,7 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 	__ret; \
 })
 
-#define cmpxchg_local(ptr, o, n) \
+#define cmpxchg_relaxed(ptr, o, n) \
 ({ \
 	__typeof__(*(ptr)) __ret; \
 	__ret = (__typeof__(*(ptr))) \
@@ -228,6 +228,8 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 	__ret; \
 })
 
+#define cmpxchg_local	cmpxchg_relaxed
+
 #define cmpxchg_double(ptr1, ptr2, o1, o2, n1, n2) \
 ({\
 	int __ret;\
@@ -274,6 +276,6 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 #define cmpxchg64(ptr,o,n)		cmpxchg((ptr),(o),(n))
 #define cmpxchg64_local(ptr,o,n)	cmpxchg_local((ptr),(o),(n))
 
-#define cmpxchg64_relaxed(ptr,o,n)	cmpxchg_local((ptr),(o),(n))
+#define cmpxchg64_relaxed(ptr,o,n)	cmpxchg_relaxed((ptr),(o),(n))
 
 #endif	/* __ASM_CMPXCHG_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 9/9] arm64: locking: replace read/write locks with generic qrwlock code
  2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
                   ` (7 preceding siblings ...)
  2015-07-07 17:24 ` [PATCH 8/9] arm64: cmpxchg: implement cmpxchg_relaxed Will Deacon
@ 2015-07-07 17:24 ` Will Deacon
  8 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-07 17:24 UTC (permalink / raw
  To: linux-arm-kernel

The queued rwlock implementation in asm-generic is actually pretty
decent, so remove our homebrew (unfair) rwlock implementation in
favour of the generic code.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig                      |   1 +
 arch/arm64/include/asm/qrwlock.h        |  30 ++++++++
 arch/arm64/include/asm/spinlock.h       | 122 +-------------------------------
 arch/arm64/include/asm/spinlock_types.h |  10 +--
 4 files changed, 33 insertions(+), 130 deletions(-)
 create mode 100644 arch/arm64/include/asm/qrwlock.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0f6edb14b7e4..b4ce331a43c8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -9,6 +9,7 @@ config ARM64
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
diff --git a/arch/arm64/include/asm/qrwlock.h b/arch/arm64/include/asm/qrwlock.h
new file mode 100644
index 000000000000..4b4a2aae4be1
--- /dev/null
+++ b/arch/arm64/include/asm/qrwlock.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (C) 2015 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef __ASM_QRWLOCK_H
+#define __ASM_QRWLOCK_H
+
+#define arch_read_lock_flags(lock, flags)	arch_read_lock(lock)
+#define arch_write_lock_flags(lock, flags)	arch_write_lock(lock)
+
+#define arch_qrwlock_relax(lock)					\
+do {									\
+	asm volatile("ldxr	wzr, %0" :: "Q" (*(u32 *)&lock->cnts));	\
+	wfe();								\
+} while (0)
+
+#include <asm-generic/qrwlock.h>
+
+#endif /* __ASM_QRWLOCK_H */
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index cee128732435..cb04e479470a 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -109,126 +109,6 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
-/*
- * Write lock implementation.
- *
- * Write locks set bit 31. Unlocking, is done by writing 0 since the lock is
- * exclusively held.
- *
- * The memory barriers are implicit with the load-acquire and store-release
- * instructions.
- */
-
-static inline void arch_write_lock(arch_rwlock_t *rw)
-{
-	unsigned int tmp;
-
-	asm volatile(
-	"	sevl\n"
-	"1:	wfe\n"
-	"2:	ldaxr	%w0, %1\n"
-	"	cbnz	%w0, 1b\n"
-	"	stxr	%w0, %w2, %1\n"
-	"	cbnz	%w0, 2b\n"
-	: "=&r" (tmp), "+Q" (rw->lock)
-	: "r" (0x80000000)
-	: "memory");
-}
-
-static inline int arch_write_trylock(arch_rwlock_t *rw)
-{
-	unsigned int tmp;
-
-	asm volatile(
-	"	ldaxr	%w0, %1\n"
-	"	cbnz	%w0, 1f\n"
-	"	stxr	%w0, %w2, %1\n"
-	"1:\n"
-	: "=&r" (tmp), "+Q" (rw->lock)
-	: "r" (0x80000000)
-	: "memory");
-
-	return !tmp;
-}
-
-static inline void arch_write_unlock(arch_rwlock_t *rw)
-{
-	asm volatile(
-	"	stlr	%w1, %0\n"
-	: "=Q" (rw->lock) : "r" (0) : "memory");
-}
-
-/* write_can_lock - would write_trylock() succeed? */
-#define arch_write_can_lock(x)		((x)->lock == 0)
-
-/*
- * Read lock implementation.
- *
- * It exclusively loads the lock value, increments it and stores the new value
- * back if positive and the CPU still exclusively owns the location. If the
- * value is negative, the lock is already held.
- *
- * During unlocking there may be multiple active read locks but no write lock.
- *
- * The memory barriers are implicit with the load-acquire and store-release
- * instructions.
- */
-static inline void arch_read_lock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(
-	"	sevl\n"
-	"1:	wfe\n"
-	"2:	ldaxr	%w0, %2\n"
-	"	add	%w0, %w0, #1\n"
-	"	tbnz	%w0, #31, 1b\n"
-	"	stxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 2b\n"
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "memory");
-}
-
-static inline void arch_read_unlock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2;
-
-	asm volatile(
-	"1:	ldxr	%w0, %2\n"
-	"	sub	%w0, %w0, #1\n"
-	"	stlxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 1b\n"
-	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
-	:
-	: "memory");
-}
-
-static inline int arch_read_trylock(arch_rwlock_t *rw)
-{
-	unsigned int tmp, tmp2 = 1;
-
-	asm volatile(
-	"	ldaxr	%w0, %2\n"
-	"	add	%w0, %w0, #1\n"
-	"	tbnz	%w0, #31, 1f\n"
-	"	stxr	%w1, %w0, %2\n"
-	"1:\n"
-	: "=&r" (tmp), "+r" (tmp2), "+Q" (rw->lock)
-	:
-	: "memory");
-
-	return !tmp2;
-}
-
-/* read_can_lock - would read_trylock() succeed? */
-#define arch_read_can_lock(x)		((x)->lock < 0x80000000)
-
-#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
-#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
-
-#define arch_spin_relax(lock)	cpu_relax()
-#define arch_read_relax(lock)	cpu_relax()
-#define arch_write_relax(lock)	cpu_relax()
+#include <asm/qrwlock.h>
 
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/arm64/include/asm/spinlock_types.h b/arch/arm64/include/asm/spinlock_types.h
index b8d383665f56..2d636be9c7ff 100644
--- a/arch/arm64/include/asm/spinlock_types.h
+++ b/arch/arm64/include/asm/spinlock_types.h
@@ -16,10 +16,6 @@
 #ifndef __ASM_SPINLOCK_TYPES_H
 #define __ASM_SPINLOCK_TYPES_H
 
-#if !defined(__LINUX_SPINLOCK_TYPES_H) && !defined(__ASM_SPINLOCK_H)
-# error "please don't include this file directly"
-#endif
-
 #define TICKET_SHIFT	16
 
 typedef struct {
@@ -34,10 +30,6 @@ typedef struct {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ 0 , 0 }
 
-typedef struct {
-	volatile unsigned int lock;
-} arch_rwlock_t;
-
-#define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
+#include <asm-generic/qrwlock_types.h>
 
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-07 17:24 ` [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath Will Deacon
@ 2015-07-07 17:51   ` Waiman Long
  2015-07-07 18:19     ` Will Deacon
  2015-07-07 21:30     ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Waiman Long @ 2015-07-07 17:51 UTC (permalink / raw
  To: linux-arm-kernel

On 07/07/2015 01:24 PM, Will Deacon wrote:
> When a slow-path reader gets to the front of the wait queue outside of
> interrupt context, it waits for any writers to drain, increments the
> reader count and again waits for any additional writers that may have
> snuck in between the initial check and the increment.
>
> Given that this second check is performed with acquire semantics, there
> is no need to perform the increment using atomic_add_return, which acts
> as a full barrier.
>
> This patch changes the slow-path code to use smp_load_acquire and
> atomic_add instead of atomic_add_return. Since the check only involves
> the writer count, we can perform the acquire after the add.
>
> Signed-off-by: Will Deacon<will.deacon@arm.com>
> ---
>   kernel/locking/qrwlock.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 96b77d1e0545..4e29bef688ac 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -98,7 +98,8 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>   	while (atomic_read(&lock->cnts)&  _QW_WMASK)
>   		cpu_relax_lowlatency();
>
> -	cnts = atomic_add_return(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> +	atomic_add(_QR_BIAS,&lock->cnts);
> +	cnts = smp_load_acquire((u32 *)&lock->cnts);
>   	rspin_until_writer_unlock(lock, cnts);
>
>   	/*

Atomic add in x86 is actually a full barrier too. The performance 
difference between "lock add" and "lock xadd" should be minor. The 
additional load, however, could potentially cause an additional 
cacheline load on a contended lock. So do you see actual performance 
benefit of this change in ARM?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-07 17:51   ` Waiman Long
@ 2015-07-07 18:19     ` Will Deacon
  2015-07-07 19:28       ` Waiman Long
  2015-07-07 21:30     ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Will Deacon @ 2015-07-07 18:19 UTC (permalink / raw
  To: linux-arm-kernel

On Tue, Jul 07, 2015 at 06:51:54PM +0100, Waiman Long wrote:
> On 07/07/2015 01:24 PM, Will Deacon wrote:
> > When a slow-path reader gets to the front of the wait queue outside of
> > interrupt context, it waits for any writers to drain, increments the
> > reader count and again waits for any additional writers that may have
> > snuck in between the initial check and the increment.
> >
> > Given that this second check is performed with acquire semantics, there
> > is no need to perform the increment using atomic_add_return, which acts
> > as a full barrier.
> >
> > This patch changes the slow-path code to use smp_load_acquire and
> > atomic_add instead of atomic_add_return. Since the check only involves
> > the writer count, we can perform the acquire after the add.
> >
> > Signed-off-by: Will Deacon<will.deacon@arm.com>
> > ---
> >   kernel/locking/qrwlock.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > index 96b77d1e0545..4e29bef688ac 100644
> > --- a/kernel/locking/qrwlock.c
> > +++ b/kernel/locking/qrwlock.c
> > @@ -98,7 +98,8 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> >   	while (atomic_read(&lock->cnts)&  _QW_WMASK)
> >   		cpu_relax_lowlatency();
> >
> > -	cnts = atomic_add_return(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> > +	atomic_add(_QR_BIAS,&lock->cnts);
> > +	cnts = smp_load_acquire((u32 *)&lock->cnts);
> >   	rspin_until_writer_unlock(lock, cnts);
> >
> >   	/*
> 
> Atomic add in x86 is actually a full barrier too. The performance 
> difference between "lock add" and "lock xadd" should be minor. The 
> additional load, however, could potentially cause an additional 
> cacheline load on a contended lock. So do you see actual performance 
> benefit of this change in ARM?

I'd need to re-run the numbers, but atomic_add is significantly less
work on ARM than atomic_add_return, which basically has two full memory
barriers compared to none for the former.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-07 18:19     ` Will Deacon
@ 2015-07-07 19:28       ` Waiman Long
  2015-07-08  9:59         ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2015-07-07 19:28 UTC (permalink / raw
  To: linux-arm-kernel

On 07/07/2015 02:19 PM, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 06:51:54PM +0100, Waiman Long wrote:
>> On 07/07/2015 01:24 PM, Will Deacon wrote:
>>> When a slow-path reader gets to the front of the wait queue outside of
>>> interrupt context, it waits for any writers to drain, increments the
>>> reader count and again waits for any additional writers that may have
>>> snuck in between the initial check and the increment.
>>>
>>> Given that this second check is performed with acquire semantics, there
>>> is no need to perform the increment using atomic_add_return, which acts
>>> as a full barrier.
>>>
>>> This patch changes the slow-path code to use smp_load_acquire and
>>> atomic_add instead of atomic_add_return. Since the check only involves
>>> the writer count, we can perform the acquire after the add.
>>>
>>> Signed-off-by: Will Deacon<will.deacon@arm.com>
>>> ---
>>>    kernel/locking/qrwlock.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>>> index 96b77d1e0545..4e29bef688ac 100644
>>> --- a/kernel/locking/qrwlock.c
>>> +++ b/kernel/locking/qrwlock.c
>>> @@ -98,7 +98,8 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>>>    	while (atomic_read(&lock->cnts)&   _QW_WMASK)
>>>    		cpu_relax_lowlatency();
>>>
>>> -	cnts = atomic_add_return(_QR_BIAS,&lock->cnts) - _QR_BIAS;
>>> +	atomic_add(_QR_BIAS,&lock->cnts);
>>> +	cnts = smp_load_acquire((u32 *)&lock->cnts);
>>>    	rspin_until_writer_unlock(lock, cnts);
>>>
>>>    	/*
>> Atomic add in x86 is actually a full barrier too. The performance
>> difference between "lock add" and "lock xadd" should be minor. The
>> additional load, however, could potentially cause an additional
>> cacheline load on a contended lock. So do you see actual performance
>> benefit of this change in ARM?
> I'd need to re-run the numbers, but atomic_add is significantly less
> work on ARM than atomic_add_return, which basically has two full memory
> barriers compared to none for the former.
>
> Will

I think a compromise is to encapsulate that in an inlined function that 
can be overridden by architecture specific code. For example,

#ifndef queued_inc_reader_return
static inline u32 queued_inc_reader_return(struct qrwlock *lock)
{
     return atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
}
#endif

:

cnts = queued_inc_reader_return(lock);

This also means that you will need to keep the function prototype for 
rspin_until_writer_unlock() in the later patch. Other than that, I don't 
see any other issue in the patch series.

BTW, are you also planning to use qspinlock in the ARM64 code?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-07 17:51   ` Waiman Long
  2015-07-07 18:19     ` Will Deacon
@ 2015-07-07 21:30     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-07-07 21:30 UTC (permalink / raw
  To: linux-arm-kernel

On Tue, Jul 07, 2015 at 01:51:54PM -0400, Waiman Long wrote:
> >-	cnts = atomic_add_return(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> >+	atomic_add(_QR_BIAS,&lock->cnts);
> >+	cnts = smp_load_acquire((u32 *)&lock->cnts);
> >  	rspin_until_writer_unlock(lock, cnts);
> >
> >  	/*
> 
> Atomic add in x86 is actually a full barrier too. The performance difference
> between "lock add" and "lock xadd" should be minor. The additional load,
> however, could potentially cause an additional cacheline load on a contended
> lock. So do you see actual performance benefit of this change in ARM?

Yes, atomic_add() does not imply (and does not have) any memory barriers
on ARM.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-07 19:28       ` Waiman Long
@ 2015-07-08  9:59         ` Peter Zijlstra
  2015-07-08 13:37           ` Will Deacon
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2015-07-08  9:59 UTC (permalink / raw
  To: linux-arm-kernel

On Tue, Jul 07, 2015 at 03:28:13PM -0400, Waiman Long wrote:
> #ifndef queued_inc_reader_return
> static inline u32 queued_inc_reader_return(struct qrwlock *lock)

That's actually a misleading name, we don't do an inc_return, we do a
fetch_inc.

> {
>     return atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
> }
> #endif

With that you (Will) could even reuse the ll/sc load and avoid issuing a
second load. You basically need atomic_fetch_add_relaxed().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/9] locking/qrwlock: implement queue_write_unlock using smp_store_release
  2015-07-07 17:24 ` [PATCH 4/9] locking/qrwlock: implement queue_write_unlock using smp_store_release Will Deacon
@ 2015-07-08 10:00   ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-07-08 10:00 UTC (permalink / raw
  To: linux-arm-kernel

On Tue, Jul 07, 2015 at 06:24:20PM +0100, Will Deacon wrote:
> smp_store_release supports byte accesses, so use that in writer unlock
> and remove the conditional macro override.
> 

Maybe mention that this is possible since:

536fa402221f ("compiler: Allow 1- and 2-byte smp_load_acquire() and
smp_store_release()")

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path
  2015-07-07 17:24 ` [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path Will Deacon
@ 2015-07-08 10:05   ` Peter Zijlstra
  2015-07-08 13:34     ` Will Deacon
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2015-07-08 10:05 UTC (permalink / raw
  To: linux-arm-kernel

On Tue, Jul 07, 2015 at 06:24:21PM +0100, Will Deacon wrote:
> +#ifndef cmpxchg_relaxed
> +# define cmpxchg_relaxed	cmpxchg
> +#endif

Should we collate this _relaxed stuff and make it 'official' instead of
these ad-hoc/in-situ things?

There's more archs that can usefully implement them than seem to have
implemented them atm. Of course that means someone doing a full arch/*
sweep, but hey.. :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths
  2015-07-07 17:24 ` [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths Will Deacon
@ 2015-07-08 10:06   ` Peter Zijlstra
  2015-07-08 13:35     ` Will Deacon
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2015-07-08 10:06 UTC (permalink / raw
  To: linux-arm-kernel

On Tue, Jul 07, 2015 at 06:24:22PM +0100, Will Deacon wrote:
> When contended, architectures may be able to reduce the polling overhead
> in ways which aren't expressible using a simple relax() primitive.
> 
> This patch allows architectures to override the use of
> cpu_relax_lowlatency() in the qrwlock code and also implement their own
> unlock macros in case explicit signalling is required to wake up a
> `relaxed' CPU spinning on an unlock event.

No real objection, but could you do this _after_ you've converted
AARGH64 to use the normal qrwlock, such that you can show the benefit
with numbers?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path
  2015-07-08 10:05   ` Peter Zijlstra
@ 2015-07-08 13:34     ` Will Deacon
  0 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-08 13:34 UTC (permalink / raw
  To: linux-arm-kernel

On Wed, Jul 08, 2015 at 11:05:26AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 06:24:21PM +0100, Will Deacon wrote:
> > +#ifndef cmpxchg_relaxed
> > +# define cmpxchg_relaxed	cmpxchg
> > +#endif
> 
> Should we collate this _relaxed stuff and make it 'official' instead of
> these ad-hoc/in-situ things?
> 
> There's more archs that can usefully implement them than seem to have
> implemented them atm. Of course that means someone doing a full arch/*
> sweep, but hey.. :-)

Well, in writing this series, I'm seeing a repeated need for:

  * acquire/release/relaxed variants of cmpxchg
  * acquire/relaxed atomic_add_return
  * acquire/release atomic_sub

I also suspect that if I look at getting qspinlock up and running, the
list above will grow.

So you're right, but it sounds like we need to extend the atomic APIs to
have acquire/release/relaxed variants. The easiest start would be to
extend the _return variants (+cmpxchg) to allow the new options, but
defaulting to the existing (full barrier) implementations if the arch
doesn't provide an alternative. Weird things like dec_if_positive could
be left alone (i.e. not implemented) for now.

The hard part is defining the semantics of these new flavours. Do we want
SC acquire/release (i.e. what we have on arm64) or PC acquire/release (i.e.
what we have in C11)? For architectures building these constructs out of
barrier instructions, the former requires an additional barrier following
a release operation so that it is ordered against a subsequent acquire.

Another potential problem of defining things this way is cmpxchg_acquire
potentially giving relaxed semantics if the comparison fails (different
to C11, iirc).

Anyway, clearly a separate series. Should keep me busy...

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths
  2015-07-08 10:06   ` Peter Zijlstra
@ 2015-07-08 13:35     ` Will Deacon
  0 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-08 13:35 UTC (permalink / raw
  To: linux-arm-kernel

On Wed, Jul 08, 2015 at 11:06:57AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 06:24:22PM +0100, Will Deacon wrote:
> > When contended, architectures may be able to reduce the polling overhead
> > in ways which aren't expressible using a simple relax() primitive.
> > 
> > This patch allows architectures to override the use of
> > cpu_relax_lowlatency() in the qrwlock code and also implement their own
> > unlock macros in case explicit signalling is required to wake up a
> > `relaxed' CPU spinning on an unlock event.
> 
> No real objection, but could you do this _after_ you've converted
> AARGH64 to use the normal qrwlock, such that you can show the benefit
> with numbers?

Sure, although the biggest gain may be in the form of reduced power
consumption, which I can't easily measure on my development platform.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath
  2015-07-08  9:59         ` Peter Zijlstra
@ 2015-07-08 13:37           ` Will Deacon
  0 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-07-08 13:37 UTC (permalink / raw
  To: linux-arm-kernel

On Wed, Jul 08, 2015 at 10:59:44AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 03:28:13PM -0400, Waiman Long wrote:
> > #ifndef queued_inc_reader_return
> > static inline u32 queued_inc_reader_return(struct qrwlock *lock)
> 
> That's actually a misleading name, we don't do an inc_return, we do a
> fetch_inc.
> 
> > {
> >     return atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
> > }
> > #endif
> 
> With that you (Will) could even reuse the ll/sc load and avoid issuing a
> second load. You basically need atomic_fetch_add_relaxed().

ARMv8.1 has an instruction for that :) [ldadd].

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-07-08 13:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-07 17:24 [PATCH 0/9] locking/qrwlock: get qrwlocks up and running on arm64 Will Deacon
2015-07-07 17:24 ` [PATCH 1/9] locking/qrwlock: include <linux/spinlock.h> for arch_spin_{lock, unlock} Will Deacon
2015-07-07 17:24 ` [PATCH 2/9] locking/qrwlock: avoid redundant atomic_add_return on read_lock_slowpath Will Deacon
2015-07-07 17:51   ` Waiman Long
2015-07-07 18:19     ` Will Deacon
2015-07-07 19:28       ` Waiman Long
2015-07-08  9:59         ` Peter Zijlstra
2015-07-08 13:37           ` Will Deacon
2015-07-07 21:30     ` Peter Zijlstra
2015-07-07 17:24 ` [PATCH 3/9] locking/qrwlock: tidy up rspin_until_writer_unlock Will Deacon
2015-07-07 17:24 ` [PATCH 4/9] locking/qrwlock: implement queue_write_unlock using smp_store_release Will Deacon
2015-07-08 10:00   ` Peter Zijlstra
2015-07-07 17:24 ` [PATCH 5/9] locking/qrwlock: remove redundant cmpxchg barriers on writer slow-path Will Deacon
2015-07-08 10:05   ` Peter Zijlstra
2015-07-08 13:34     ` Will Deacon
2015-07-07 17:24 ` [PATCH 6/9] locking/qrwlock: allow architectures to hook in to contended paths Will Deacon
2015-07-08 10:06   ` Peter Zijlstra
2015-07-08 13:35     ` Will Deacon
2015-07-07 17:24 ` [PATCH 7/9] locking/qrwlock: expose internal lock structure in qrwlock definition Will Deacon
2015-07-07 17:24 ` [PATCH 8/9] arm64: cmpxchg: implement cmpxchg_relaxed Will Deacon
2015-07-07 17:24 ` [PATCH 9/9] arm64: locking: replace read/write locks with generic qrwlock code Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).