All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Andreas Herrmann <aherrmann@suse.com>
To: Alex Kogan <alex.kogan@oracle.com>
Cc: linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com,
	will.deacon@arm.com, arnd@arndb.de, longman@redhat.com,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com,
	jglauber@marvell.com, steven.sistare@oracle.com,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com
Subject: Re: [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
Date: Wed, 14 Apr 2021 09:47:01 +0200	[thread overview]
Message-ID: <YHad9S5ckj5IR1l6@suselix> (raw)
In-Reply-To: <20210401153156.1165900-7-alex.kogan@oracle.com>

On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
> This performance optimization chooses probabilistically to avoid moving
> threads from the main queue into the secondary one when the secondary queue
> is empty.
> 
> It is helpful when the lock is only lightly contended. In particular, it
> makes CNA less eager to create a secondary queue, but does not introduce
> any extra delays for threads waiting in that queue once it is created.
> 
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index 29c3abbd3d94..983c6a47a221 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -5,6 +5,7 @@
>  
>  #include <linux/topology.h>
>  #include <linux/sched/rt.h>
> +#include <linux/random.h>
>  
>  /*
>   * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
>  	return current_time - threshold > 0;
>  }
>  
> +/*
> + * Controls the probability for enabling the ordering of the main queue
> + * when the secondary queue is empty. The chosen value reduces the amount
> + * of unnecessary shuffling of threads between the two waiting queues
> + * when the contention is low, while responding fast enough and enabling
> + * the shuffling when the contention is high.
> + */
> +#define SHUFFLE_REDUCTION_PROB_ARG  (7)

Out of curiosity:

Have you used other values and done measurements what's an efficient
value for SHUFFLE_REDUCTION_PROB_ARG?
Maybe I miscalculated it, but if I understand it correctly this value
implies that the propability is 0.9921875 that below function returns
true.

My question is probably answered by following statement from
referenced paper:

"In our experiments with the shuffle reduction optimization enabled,
we set THRESHOLD2 to 0xff." (page with figure 5)

> +
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Return false with probability 1 / 2^@num_bits.
> + * Intuitively, the larger @num_bits the less likely false is to be returned.
> + * @num_bits must be a number between 0 and 31.
> + */
> +static bool probably(unsigned int num_bits)
> +{
> +	u32 s;
> +
> +	s = this_cpu_read(seed);
> +	s = next_pseudo_random32(s);
> +	this_cpu_write(seed, s);
> +
> +	return s & ((1 << num_bits) - 1);
> +}
> +
>  static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>  {
>  	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>  {
>  	struct cna_node *cn = (struct cna_node *)node;
>  
> +	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {

Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
it's roughly 1 out of 100 cases where probably() returns false.

Why/when is this beneficial?

I assume it has to do with following statement in the referenced
paper:

"The superior performance over MCS at 4 threads is the result of the
shuffling that does take place once in a while, organizing threads’
arrivals to the lock in a way that reduces the inter-socket lock
migration without the need to continuously modify the main queue. ..."
(page with figure 9; the paper has no page numbers)

But OTHO why this pseudo randomness?

How about deterministically treating every 100th execution differently
(it also matches "once in a while") and thus entirely removing the
pseudo randomness?

Have you tried this? If so why was it worse than pseudo randomness?

(Or maybe I missed something and pseudo randomness is required for
other reasons there.)

> +		/*
> +		 * When the secondary queue is empty, skip the call to
> +		 * cna_order_queue() below with high probability. This optimization
> +		 * reduces the overhead of unnecessary shuffling of threads
> +		 * between waiting queues when the lock is only lightly contended.
> +		 */
> +		return 0;
> +	}
> +
>  	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
>  		/*
>  		 * We are at the head of the wait queue, no need to use
> -- 
> 2.24.3 (Apple Git-128)
> 

-- 
Regards,
Andreas


WARNING: multiple messages have this Message-ID (diff)
From: Andreas Herrmann <aherrmann@suse.com>
To: Alex Kogan <alex.kogan@oracle.com>
Cc: linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com,
	will.deacon@arm.com, arnd@arndb.de, longman@redhat.com,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com,
	jglauber@marvell.com, steven.sistare@oracle.com,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com
Subject: Re: [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
Date: Wed, 14 Apr 2021 09:47:01 +0200	[thread overview]
Message-ID: <YHad9S5ckj5IR1l6@suselix> (raw)
In-Reply-To: <20210401153156.1165900-7-alex.kogan@oracle.com>

On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
> This performance optimization chooses probabilistically to avoid moving
> threads from the main queue into the secondary one when the secondary queue
> is empty.
> 
> It is helpful when the lock is only lightly contended. In particular, it
> makes CNA less eager to create a secondary queue, but does not introduce
> any extra delays for threads waiting in that queue once it is created.
> 
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index 29c3abbd3d94..983c6a47a221 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -5,6 +5,7 @@
>  
>  #include <linux/topology.h>
>  #include <linux/sched/rt.h>
> +#include <linux/random.h>
>  
>  /*
>   * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
>  	return current_time - threshold > 0;
>  }
>  
> +/*
> + * Controls the probability for enabling the ordering of the main queue
> + * when the secondary queue is empty. The chosen value reduces the amount
> + * of unnecessary shuffling of threads between the two waiting queues
> + * when the contention is low, while responding fast enough and enabling
> + * the shuffling when the contention is high.
> + */
> +#define SHUFFLE_REDUCTION_PROB_ARG  (7)

Out of curiosity:

Have you used other values and done measurements what's an efficient
value for SHUFFLE_REDUCTION_PROB_ARG?
Maybe I miscalculated it, but if I understand it correctly this value
implies that the propability is 0.9921875 that below function returns
true.

My question is probably answered by following statement from
referenced paper:

"In our experiments with the shuffle reduction optimization enabled,
we set THRESHOLD2 to 0xff." (page with figure 5)

> +
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Return false with probability 1 / 2^@num_bits.
> + * Intuitively, the larger @num_bits the less likely false is to be returned.
> + * @num_bits must be a number between 0 and 31.
> + */
> +static bool probably(unsigned int num_bits)
> +{
> +	u32 s;
> +
> +	s = this_cpu_read(seed);
> +	s = next_pseudo_random32(s);
> +	this_cpu_write(seed, s);
> +
> +	return s & ((1 << num_bits) - 1);
> +}
> +
>  static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>  {
>  	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>  {
>  	struct cna_node *cn = (struct cna_node *)node;
>  
> +	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {

Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
it's roughly 1 out of 100 cases where probably() returns false.

Why/when is this beneficial?

I assume it has to do with following statement in the referenced
paper:

"The superior performance over MCS at 4 threads is the result of the
shuffling that does take place once in a while, organizing threads’
arrivals to the lock in a way that reduces the inter-socket lock
migration without the need to continuously modify the main queue. ..."
(page with figure 9; the paper has no page numbers)

But OTHO why this pseudo randomness?

How about deterministically treating every 100th execution differently
(it also matches "once in a while") and thus entirely removing the
pseudo randomness?

Have you tried this? If so why was it worse than pseudo randomness?

(Or maybe I missed something and pseudo randomness is required for
other reasons there.)

> +		/*
> +		 * When the secondary queue is empty, skip the call to
> +		 * cna_order_queue() below with high probability. This optimization
> +		 * reduces the overhead of unnecessary shuffling of threads
> +		 * between waiting queues when the lock is only lightly contended.
> +		 */
> +		return 0;
> +	}
> +
>  	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
>  		/*
>  		 * We are at the head of the wait queue, no need to use
> -- 
> 2.24.3 (Apple Git-128)
> 

-- 
Regards,
Andreas


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-04-14  7:47 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
2021-04-01 15:31 ` Alex Kogan
2021-04-01 15:31 ` [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2021-04-01 15:31   ` Alex Kogan
2021-04-01 15:31 ` [PATCH v14 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2021-04-01 15:31   ` Alex Kogan
2021-04-01 15:31 ` [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2021-04-01 15:31   ` Alex Kogan
2021-04-13 11:30   ` Peter Zijlstra
2021-04-13 11:30     ` Peter Zijlstra
2021-04-14  2:29     ` [External] : " Alex Kogan
2021-04-14  2:29       ` Alex Kogan
2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2021-04-01 15:31   ` Alex Kogan
2021-04-13  6:03   ` Andi Kleen
2021-04-13  6:03     ` Andi Kleen
2021-04-13  6:12     ` Andi Kleen
2021-04-13  6:12       ` Andi Kleen
2021-04-13 21:01     ` [External] : " Alex Kogan
2021-04-13 21:01       ` Alex Kogan
2021-04-13 21:22       ` Andi Kleen
2021-04-13 21:22         ` Andi Kleen
2021-04-14  2:30         ` Alex Kogan
2021-04-14  2:30           ` Alex Kogan
2021-04-14 16:41       ` Waiman Long
2021-04-14 16:41         ` Waiman Long
2021-04-14 17:26         ` Andi Kleen
2021-04-14 17:26           ` Andi Kleen
2021-04-14 17:31           ` Waiman Long
2021-04-14 17:31             ` Waiman Long
2021-04-13 12:03   ` Peter Zijlstra
2021-04-13 12:03     ` Peter Zijlstra
2021-04-16  2:52     ` [External] : " Alex Kogan
2021-04-16  2:52       ` Alex Kogan
2021-04-01 15:31 ` [PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
2021-04-01 15:31   ` Alex Kogan
2021-04-01 15:31 ` [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
2021-04-01 15:31   ` Alex Kogan
2021-04-14  7:47   ` Andreas Herrmann [this message]
2021-04-14  7:47     ` Andreas Herrmann
2021-04-14 18:18     ` [External] : " Alex Kogan
2021-04-14 18:18       ` Alex Kogan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YHad9S5ckj5IR1l6@suselix \
    --to=aherrmann@suse.com \
    --cc=alex.kogan@oracle.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave.dice@oracle.com \
    --cc=guohanjun@huawei.com \
    --cc=hpa@zytor.com \
    --cc=jglauber@marvell.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=steven.sistare@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.