From: Alex Kogan <alex.kogan@oracle.com> To: linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com, will.deacon@arm.com, arnd@arndb.de, longman@redhat.com, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de, hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com, jglauber@marvell.com Cc: steven.sistare@oracle.com, daniel.m.jordan@oracle.com, alex.kogan@oracle.com, dave.dice@oracle.com Subject: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Date: Thu, 1 Apr 2021 11:31:54 -0400 [thread overview] Message-ID: <20210401153156.1165900-5-alex.kogan@oracle.com> (raw) In-Reply-To: <20210401153156.1165900-1-alex.kogan@oracle.com> Keep track of the time the thread at the head of the secondary queue has been waiting, and force inter-node handoff once this time passes a preset threshold. The default value for the threshold (10ms) can be overridden with the new kernel boot command-line option "numa_spinlock_threshold". The ms value is translated internally to the nearest rounded-up jiffies. Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> --- .../admin-guide/kernel-parameters.txt | 9 ++ kernel/locking/qspinlock_cna.h | 96 ++++++++++++++++--- 2 files changed, 93 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ace55afd4441..5c959631a8c8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3485,6 +3485,15 @@ Not specifying this option is equivalent to numa_spinlock=auto. + numa_spinlock_threshold= [NUMA, PV_OPS] + Set the time threshold in milliseconds for the + number of intra-node lock hand-offs before the + NUMA-aware spinlock is forced to be passed to + a thread on another NUMA node. Valid values + are in the [1..100] range. Smaller values result + in a more fair, but less performant spinlock, + and vice versa. The default value is 10. + numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 'node', 'default' can be specified This can be set from sysctl after boot. diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h index d689861a7b3d..0513360c11fe 100644 --- a/kernel/locking/qspinlock_cna.h +++ b/kernel/locking/qspinlock_cna.h @@ -37,6 +37,12 @@ * gradually filter the primary queue, leaving only waiters running on the same * preferred NUMA node. * + * We change the NUMA node preference after a waiter at the head of the + * secondary queue spins for a certain amount of time (10ms, by default). + * We do that by flushing the secondary queue into the head of the primary queue, + * effectively changing the preference to the NUMA node of the waiter at the head + * of the secondary queue at the time of the flush. + * * For more details, see https://arxiv.org/abs/1810.05600. * * Authors: Alex Kogan <alex.kogan@oracle.com> @@ -49,13 +55,33 @@ struct cna_node { u16 real_numa_node; u32 encoded_tail; /* self */ u32 partial_order; /* enum val */ + s32 start_time; }; enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND, + FLUSH_SECONDARY_QUEUE }; +/* + * Controls the threshold time in ms (default = 10) for intra-node lock + * hand-offs before the NUMA-aware variant of spinlock is forced to be + * passed to a thread on another NUMA node. The default setting can be + * changed with the "numa_spinlock_threshold" boot option. + */ +#define MSECS_TO_JIFFIES(m) \ + (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)) +static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10); + +static inline bool intra_node_threshold_reached(struct cna_node *cn) +{ + s32 current_time = (s32)jiffies; + s32 threshold = cn->start_time + intra_node_handoff_threshold; + + return current_time - threshold > 0; +} + static void __init cna_init_nodes_per_cpu(unsigned int cpu) { struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu); @@ -99,6 +125,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node) cn->numa_node = cn->real_numa_node; cn->partial_order = LOCAL_WAITER_FOUND; + cn->start_time = 0; } /* @@ -198,8 +225,15 @@ static void cna_splice_next(struct mcs_spinlock *node, /* stick `next` on the secondary queue tail */ if (node->locked <= 1) { /* if secondary queue is empty */ + struct cna_node *cn = (struct cna_node *)node; + /* create secondary queue */ next->next = next; + + cn->start_time = (s32)jiffies; + /* make sure start_time != 0 iff secondary queue is not empty */ + if (!cn->start_time) + cn->start_time = 1; } else { /* add to the tail of the secondary queue */ struct mcs_spinlock *tail_2nd = decode_tail(node->locked); @@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node) static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) { - /* - * Try and put the time otherwise spent spin waiting on - * _Q_LOCKED_PENDING_MASK to use by sorting our lists. - */ - cna_order_queue(node); + struct cna_node *cn = (struct cna_node *)node; + + if (!cn->start_time || !intra_node_threshold_reached(cn)) { + /* + * Try and put the time otherwise spent spin waiting on + * _Q_LOCKED_PENDING_MASK to use by sorting our lists. + */ + cna_order_queue(node); + } else { + cn->partial_order = FLUSH_SECONDARY_QUEUE; + } return 0; /* we lied; we didn't wait, go do so now */ } @@ -270,13 +310,28 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node, if (partial_order == LOCAL_WAITER_NOT_FOUND) cna_order_queue(node); - /* - * We have a local waiter, either real or fake one; - * reload @next in case it was changed by cna_order_queue(). - */ - next = node->next; - if (node->locked > 1) - val = node->locked; /* preseve secondary queue */ + if (partial_order != FLUSH_SECONDARY_QUEUE) { + /* + * We have a local waiter, either real or fake one; + * reload @next in case it was changed by cna_order_queue(). + */ + next = node->next; + if (node->locked > 1) { + val = node->locked; /* preseve secondary queue */ + ((struct cna_node *)next)->start_time = cn->start_time; + } + } else { + /* + * We decided to flush the secondary queue; + * this can only happen if that queue is not empty. + */ + WARN_ON(node->locked <= 1); + /* + * Splice the secondary queue onto the primary queue and pass the lock + * to the longest waiting remote waiter. + */ + next = cna_splice_head(NULL, 0, node, next); + } arch_mcs_lock_handoff(&next->locked, val); } @@ -328,3 +383,20 @@ void __init cna_configure_spin_lock_slowpath(void) pr_info("Enabling CNA spinlock\n"); } + +static int __init numa_spinlock_threshold_setup(char *str) +{ + int param; + + if (get_option(&str, ¶m)) { + /* valid value is between 1 and 100 */ + if (param <= 0 || param > 100) + return 0; + + intra_node_handoff_threshold = msecs_to_jiffies(param); + return 1; + } + + return 0; +} +__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup); -- 2.24.3 (Apple Git-128) _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Alex Kogan <alex.kogan@oracle.com> To: linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com, will.deacon@arm.com, arnd@arndb.de, longman@redhat.com, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de, hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com, jglauber@marvell.com Cc: steven.sistare@oracle.com, daniel.m.jordan@oracle.com, alex.kogan@oracle.com, dave.dice@oracle.com Subject: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Date: Thu, 1 Apr 2021 11:31:54 -0400 [thread overview] Message-ID: <20210401153156.1165900-5-alex.kogan@oracle.com> (raw) In-Reply-To: <20210401153156.1165900-1-alex.kogan@oracle.com> Keep track of the time the thread at the head of the secondary queue has been waiting, and force inter-node handoff once this time passes a preset threshold. The default value for the threshold (10ms) can be overridden with the new kernel boot command-line option "numa_spinlock_threshold". The ms value is translated internally to the nearest rounded-up jiffies. Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> --- .../admin-guide/kernel-parameters.txt | 9 ++ kernel/locking/qspinlock_cna.h | 96 ++++++++++++++++--- 2 files changed, 93 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ace55afd4441..5c959631a8c8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3485,6 +3485,15 @@ Not specifying this option is equivalent to numa_spinlock=auto. + numa_spinlock_threshold= [NUMA, PV_OPS] + Set the time threshold in milliseconds for the + number of intra-node lock hand-offs before the + NUMA-aware spinlock is forced to be passed to + a thread on another NUMA node. Valid values + are in the [1..100] range. Smaller values result + in a more fair, but less performant spinlock, + and vice versa. The default value is 10. + numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 'node', 'default' can be specified This can be set from sysctl after boot. diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h index d689861a7b3d..0513360c11fe 100644 --- a/kernel/locking/qspinlock_cna.h +++ b/kernel/locking/qspinlock_cna.h @@ -37,6 +37,12 @@ * gradually filter the primary queue, leaving only waiters running on the same * preferred NUMA node. * + * We change the NUMA node preference after a waiter at the head of the + * secondary queue spins for a certain amount of time (10ms, by default). + * We do that by flushing the secondary queue into the head of the primary queue, + * effectively changing the preference to the NUMA node of the waiter at the head + * of the secondary queue at the time of the flush. + * * For more details, see https://arxiv.org/abs/1810.05600. * * Authors: Alex Kogan <alex.kogan@oracle.com> @@ -49,13 +55,33 @@ struct cna_node { u16 real_numa_node; u32 encoded_tail; /* self */ u32 partial_order; /* enum val */ + s32 start_time; }; enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND, + FLUSH_SECONDARY_QUEUE }; +/* + * Controls the threshold time in ms (default = 10) for intra-node lock + * hand-offs before the NUMA-aware variant of spinlock is forced to be + * passed to a thread on another NUMA node. The default setting can be + * changed with the "numa_spinlock_threshold" boot option. + */ +#define MSECS_TO_JIFFIES(m) \ + (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)) +static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10); + +static inline bool intra_node_threshold_reached(struct cna_node *cn) +{ + s32 current_time = (s32)jiffies; + s32 threshold = cn->start_time + intra_node_handoff_threshold; + + return current_time - threshold > 0; +} + static void __init cna_init_nodes_per_cpu(unsigned int cpu) { struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu); @@ -99,6 +125,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node) cn->numa_node = cn->real_numa_node; cn->partial_order = LOCAL_WAITER_FOUND; + cn->start_time = 0; } /* @@ -198,8 +225,15 @@ static void cna_splice_next(struct mcs_spinlock *node, /* stick `next` on the secondary queue tail */ if (node->locked <= 1) { /* if secondary queue is empty */ + struct cna_node *cn = (struct cna_node *)node; + /* create secondary queue */ next->next = next; + + cn->start_time = (s32)jiffies; + /* make sure start_time != 0 iff secondary queue is not empty */ + if (!cn->start_time) + cn->start_time = 1; } else { /* add to the tail of the secondary queue */ struct mcs_spinlock *tail_2nd = decode_tail(node->locked); @@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node) static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) { - /* - * Try and put the time otherwise spent spin waiting on - * _Q_LOCKED_PENDING_MASK to use by sorting our lists. - */ - cna_order_queue(node); + struct cna_node *cn = (struct cna_node *)node; + + if (!cn->start_time || !intra_node_threshold_reached(cn)) { + /* + * Try and put the time otherwise spent spin waiting on + * _Q_LOCKED_PENDING_MASK to use by sorting our lists. + */ + cna_order_queue(node); + } else { + cn->partial_order = FLUSH_SECONDARY_QUEUE; + } return 0; /* we lied; we didn't wait, go do so now */ } @@ -270,13 +310,28 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node, if (partial_order == LOCAL_WAITER_NOT_FOUND) cna_order_queue(node); - /* - * We have a local waiter, either real or fake one; - * reload @next in case it was changed by cna_order_queue(). - */ - next = node->next; - if (node->locked > 1) - val = node->locked; /* preseve secondary queue */ + if (partial_order != FLUSH_SECONDARY_QUEUE) { + /* + * We have a local waiter, either real or fake one; + * reload @next in case it was changed by cna_order_queue(). + */ + next = node->next; + if (node->locked > 1) { + val = node->locked; /* preseve secondary queue */ + ((struct cna_node *)next)->start_time = cn->start_time; + } + } else { + /* + * We decided to flush the secondary queue; + * this can only happen if that queue is not empty. + */ + WARN_ON(node->locked <= 1); + /* + * Splice the secondary queue onto the primary queue and pass the lock + * to the longest waiting remote waiter. + */ + next = cna_splice_head(NULL, 0, node, next); + } arch_mcs_lock_handoff(&next->locked, val); } @@ -328,3 +383,20 @@ void __init cna_configure_spin_lock_slowpath(void) pr_info("Enabling CNA spinlock\n"); } + +static int __init numa_spinlock_threshold_setup(char *str) +{ + int param; + + if (get_option(&str, ¶m)) { + /* valid value is between 1 and 100 */ + if (param <= 0 || param > 100) + return 0; + + intra_node_handoff_threshold = msecs_to_jiffies(param); + return 1; + } + + return 0; +} +__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup); -- 2.24.3 (Apple Git-128)
next prev parent reply other threads:[~2021-04-01 15:37 UTC|newest] Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan 2021-04-01 15:31 ` Alex Kogan 2021-04-01 15:31 ` [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan 2021-04-01 15:31 ` Alex Kogan 2021-04-01 15:31 ` [PATCH v14 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan 2021-04-01 15:31 ` Alex Kogan 2021-04-01 15:31 ` [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan 2021-04-01 15:31 ` Alex Kogan 2021-04-13 11:30 ` Peter Zijlstra 2021-04-13 11:30 ` Peter Zijlstra 2021-04-14 2:29 ` [External] : " Alex Kogan 2021-04-14 2:29 ` Alex Kogan 2021-04-01 15:31 ` Alex Kogan [this message] 2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan 2021-04-13 6:03 ` Andi Kleen 2021-04-13 6:03 ` Andi Kleen 2021-04-13 6:12 ` Andi Kleen 2021-04-13 6:12 ` Andi Kleen 2021-04-13 21:01 ` [External] : " Alex Kogan 2021-04-13 21:01 ` Alex Kogan 2021-04-13 21:22 ` Andi Kleen 2021-04-13 21:22 ` Andi Kleen 2021-04-14 2:30 ` Alex Kogan 2021-04-14 2:30 ` Alex Kogan 2021-04-14 16:41 ` Waiman Long 2021-04-14 16:41 ` Waiman Long 2021-04-14 17:26 ` Andi Kleen 2021-04-14 17:26 ` Andi Kleen 2021-04-14 17:31 ` Waiman Long 2021-04-14 17:31 ` Waiman Long 2021-04-13 12:03 ` Peter Zijlstra 2021-04-13 12:03 ` Peter Zijlstra 2021-04-16 2:52 ` [External] : " Alex Kogan 2021-04-16 2:52 ` Alex Kogan 2021-04-01 15:31 ` [PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan 2021-04-01 15:31 ` Alex Kogan 2021-04-01 15:31 ` [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan 2021-04-01 15:31 ` Alex Kogan 2021-04-14 7:47 ` Andreas Herrmann 2021-04-14 7:47 ` Andreas Herrmann 2021-04-14 18:18 ` [External] : " Alex Kogan 2021-04-14 18:18 ` Alex Kogan
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210401153156.1165900-5-alex.kogan@oracle.com \ --to=alex.kogan@oracle.com \ --cc=arnd@arndb.de \ --cc=bp@alien8.de \ --cc=daniel.m.jordan@oracle.com \ --cc=dave.dice@oracle.com \ --cc=guohanjun@huawei.com \ --cc=hpa@zytor.com \ --cc=jglauber@marvell.com \ --cc=linux-arch@vger.kernel.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux@armlinux.org.uk \ --cc=longman@redhat.com \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=steven.sistare@oracle.com \ --cc=tglx@linutronix.de \ --cc=will.deacon@arm.com \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.