[PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
@ 2023-06-06 12:54 Yu Ma
  2023-06-06 19:21 ` Liam R. Howlett
  2023-06-07 14:50 ` [PATCH v2] " Yu Ma
  0 siblings, 2 replies; 12+ messages in thread
From: Yu Ma @ 2023-06-06 12:54 UTC (permalink / raw)
  To: akpm, tim.c.chen
  Cc: linux-mm, linux-kernel, dave.hansen, dan.j.williams, shakeelb,
	Liam.Howlett, pan.deng, tianyou.li, lipeng.zhu, tim.c.chen, yu.ma

When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area  to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
+   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
-   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
   - 53.54% 0x654278696e552f34
        main
        __execve
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_execve
        do_execveat_common.isra.47
        alloc_bprm
        mm_init
        __percpu_counter_init
        pcpu_alloc
      - __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the 
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 mm/percpu-internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..981eeb2ad0a9 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,10 @@ struct pcpu_chunk {
 	struct list_head	list;		/* linked to pcpu_slot lists */
 	int			free_bytes;	/* free bytes in the chunk */
 	struct pcpu_block_md	chunk_md;
+	unsigned long		*bound_map;	/* boundary map */
 	void			*base_addr;	/* base address of this chunk */
 
 	unsigned long		*alloc_map;	/* allocation map */
-	unsigned long		*bound_map;	/* boundary map */
 	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
 
 	void			*data;		/* chunk data */
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-06 12:54 [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing Yu Ma
@ 2023-06-06 19:21 ` Liam R. Howlett
  2023-06-06 21:25   ` Dennis Zhou
  2023-06-07 14:50 ` [PATCH v2] " Yu Ma
  1 sibling, 1 reply; 12+ messages in thread
From: Liam R. Howlett @ 2023-06-06 19:21 UTC (permalink / raw)
  To: Yu Ma
  Cc: akpm, tim.c.chen, linux-mm, linux-kernel, dave.hansen,
	dan.j.williams, shakeelb, pan.deng, tianyou.li, lipeng.zhu,
	tim.c.chen

* Yu Ma <yu.ma@intel.com> [230606 08:27]:
> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
> 
> UnixBench/Execl represents a class of workload where bash scripts
> are spawned frequently to do some short jobs. It will do system call on
> execl frequently, and execl will call mm_init to initialize mm_struct
> of the process. mm_init will call __percpu_counter_init for
> percpu_counters initialization. Then pcpu_alloc is called to read
> the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> it will call pcpu_alloc_area  to allocate memory from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the
> rest free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
>    - 53.54% 0x654278696e552f34
>         main
>         __execve
>         entry_SYSCALL_64_after_hwframe
>         do_syscall_64
>         __x64_sys_execve
>         do_execveat_common.isra.47
>         alloc_bprm
>         mm_init
>         __percpu_counter_init
>         pcpu_alloc
>       - __mutex_lock.isra.17
> 
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the 
> last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> to let ‘base_addr’ locate in a new cacheline.
> 
> With this change, on Intel Sapphire Rapids 112C/224T platform,
> based on v6.4-rc4, the 160 parallel score improves by 24%.

Can we have a comment somewhere around this structure to avoid someone
reverting this change by accident?

> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  mm/percpu-internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index f9847c131998..981eeb2ad0a9 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,10 @@ struct pcpu_chunk {
>  	struct list_head	list;		/* linked to pcpu_slot lists */
>  	int			free_bytes;	/* free bytes in the chunk */
>  	struct pcpu_block_md	chunk_md;
> +	unsigned long		*bound_map;	/* boundary map */
>  	void			*base_addr;	/* base address of this chunk */
>  
>  	unsigned long		*alloc_map;	/* allocation map */
> -	unsigned long		*bound_map;	/* boundary map */
>  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
>  
>  	void			*data;		/* chunk data */
> -- 
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-06 19:21 ` Liam R. Howlett
@ 2023-06-06 21:25   ` Dennis Zhou
  2023-06-07 12:50     ` Ma, Yu
  0 siblings, 1 reply; 12+ messages in thread
From: Dennis Zhou @ 2023-06-06 21:25 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Yu Ma, akpm, tim.c.chen, linux-mm, linux-kernel, dave.hansen,
	dan.j.williams, shakeelb, pan.deng, tianyou.li, lipeng.zhu,
	tim.c.chen

Hello,

On Tue, Jun 06, 2023 at 03:21:27PM -0400, Liam R. Howlett wrote:
> * Yu Ma <yu.ma@intel.com> [230606 08:27]:
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> > 
> > UnixBench/Execl represents a class of workload where bash scripts
> > are spawned frequently to do some short jobs. It will do system call on
> > execl frequently, and execl will call mm_init to initialize mm_struct
> > of the process. mm_init will call __percpu_counter_init for
> > percpu_counters initialization. Then pcpu_alloc is called to read
> > the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> > it will call pcpu_alloc_area  to allocate memory from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the
> > rest free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> >    - 53.54% 0x654278696e552f34
> >         main
> >         __execve
> >         entry_SYSCALL_64_after_hwframe
> >         do_syscall_64
> >         __x64_sys_execve
> >         do_execveat_common.isra.47
> >         alloc_bprm
> >         mm_init
> >         __percpu_counter_init
> >         pcpu_alloc
> >       - __mutex_lock.isra.17
> > 
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the 
> > last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> > to let ‘base_addr’ locate in a new cacheline.
> > 
> > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > based on v6.4-rc4, the 160 parallel score improves by 24%.
> 
> Can we have a comment somewhere around this structure to avoid someone
> reverting this change by accident?
> 

I agree with Liam. It was only recently percpu was added to the
mm_struct so this wasn't originally on the hot path. It's probably worth
reshuffling around pcpu_chunk because as you point out base_addr is
read_only after init. There in general aren't that many of these structs
on any particular host, so its probably good to just annotate with
____cacheline_aligned_in_smp and potentially reshuffle around a few
other variables.

Another optimization here is a batch allocation which hasn't been done
yet (allocate essentially an array of percpu variables all at once, but
allow for their lifetimes to be independent).

PS - I know I'm not super active, but please cc me on percpu changes.

Thanks,
Dennis

> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  mm/percpu-internal.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > index f9847c131998..981eeb2ad0a9 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,10 @@ struct pcpu_chunk {
> >  	struct list_head	list;		/* linked to pcpu_slot lists */
> >  	int			free_bytes;	/* free bytes in the chunk */
> >  	struct pcpu_block_md	chunk_md;
> > +	unsigned long		*bound_map;	/* boundary map */
> >  	void			*base_addr;	/* base address of this chunk */
> >  
> >  	unsigned long		*alloc_map;	/* allocation map */
> > -	unsigned long		*bound_map;	/* boundary map */
> >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> >  
> >  	void			*data;		/* chunk data */
> > -- 
> > 2.39.3
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-06 21:25   ` Dennis Zhou
@ 2023-06-07 12:50     ` Ma, Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Ma, Yu @ 2023-06-07 12:50 UTC (permalink / raw)
  To: Dennis Zhou, Liam R. Howlett
  Cc: akpm@linux-foundation.org, Chen, Tim C, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Hansen, Dave, Williams, Dan J,
	shakeelb@google.com, Deng, Pan, Li, Tianyou, Zhu, Lipeng,
	tim.c.chen@linux.intel.com

> Hello,
> 
> On Tue, Jun 06, 2023 at 03:21:27PM -0400, Liam R. Howlett wrote:
> > * Yu Ma <yu.ma@intel.com> [230606 08:27]:
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize
> > > mm_struct of the process. mm_init will call __percpu_counter_init
> > > for percpu_counters initialization. Then pcpu_alloc is called to
> > > read the base_addr of pcpu_chunk for memory allocation. Inside
> > > pcpu_alloc, it will call pcpu_alloc_area  to allocate memory from a
> specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> > >    - 53.54% 0x654278696e552f34
> > >         main
> > >         __execve
> > >         entry_SYSCALL_64_after_hwframe
> > >         do_syscall_64
> > >         __x64_sys_execve
> > >         do_execveat_common.isra.47
> > >         alloc_bprm
> > >         mm_init
> > >         __percpu_counter_init
> > >         pcpu_alloc
> > >       - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Can we have a comment somewhere around this structure to avoid
> someone
> > reverting this change by accident?
> >
> 
> I agree with Liam. It was only recently percpu was added to the mm_struct so
> this wasn't originally on the hot path. It's probably worth reshuffling around
> pcpu_chunk because as you point out base_addr is read_only after init.
> There in general aren't that many of these structs on any particular host, so
> its probably good to just annotate with ____cacheline_aligned_in_smp and
> potentially reshuffle around a few other variables.
> 
Thanks Liam and Dennis for quick feedback, I'll send out the updated patch with comment around.

> Another optimization here is a batch allocation which hasn't been done yet
> (allocate essentially an array of percpu variables all at once, but allow for
> their lifetimes to be independent).
> 
> PS - I know I'm not super active, but please cc me on percpu changes.
> 
LOL, sure :)

> Thanks,
> Dennis
> 
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > ---
> > >  mm/percpu-internal.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > > f9847c131998..981eeb2ad0a9 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,10 @@ struct pcpu_chunk {
> > >  	struct list_head	list;		/* linked to pcpu_slot lists */
> > >  	int			free_bytes;	/* free bytes in the chunk */
> > >  	struct pcpu_block_md	chunk_md;
> > > +	unsigned long		*bound_map;	/* boundary map */
> > >  	void			*base_addr;	/* base address of this chunk
> */
> > >
> > >  	unsigned long		*alloc_map;	/* allocation map */
> > > -	unsigned long		*bound_map;	/* boundary map */
> > >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> > >
> > >  	void			*data;		/* chunk data */
> > > --
> > > 2.39.3
> > >
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-06 12:54 [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing Yu Ma
  2023-06-06 19:21 ` Liam R. Howlett
@ 2023-06-07 14:50 ` Yu Ma
  2023-06-07 15:02   ` Ma, Yu
  1 sibling, 1 reply; 12+ messages in thread
From: Yu Ma @ 2023-06-07 14:50 UTC (permalink / raw)
  To: yu.ma
  Cc: Liam.Howlett, akpm, dan.j.williams, dave.hansen, linux-kernel,
	linux-mm, lipeng.zhu, pan.deng, shakeelb, tianyou.li, tim.c.chen,
	tim.c.chen

When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area  to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
+   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
-   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
   - 53.54% 0x654278696e552f34
        main
        __execve
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_execve
        do_execveat_common.isra.47
        alloc_bprm
        mm_init
        __percpu_counter_init
        pcpu_alloc
      - __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 mm/percpu-internal.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..ecc7be1ec876 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,16 @@ struct pcpu_chunk {
 	struct list_head	list;		/* linked to pcpu_slot lists */
 	int			free_bytes;	/* free bytes in the chunk */
 	struct pcpu_block_md	chunk_md;
+	unsigned long		*bound_map;	/* boundary map */
+	
+	/*
+	 * To reduce false sharing, current layout is optimized to make sure
+	 * base_addr locate in the different cacheline with free_bytes and
+	 * chunk_md.
+	 */
 	void			*base_addr;	/* base address of this chunk */
 
 	unsigned long		*alloc_map;	/* allocation map */
-	unsigned long		*bound_map;	/* boundary map */
 	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
 
 	void			*data;		/* chunk data */
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* RE: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-07 14:50 ` [PATCH v2] " Yu Ma
@ 2023-06-07 15:02   ` Ma, Yu
  2023-06-09 18:20     ` Dennis Zhou
  0 siblings, 1 reply; 12+ messages in thread
From: Ma, Yu @ 2023-06-07 15:02 UTC (permalink / raw)
  To: Liam.Howlett@Oracle.com, Dennis Zhou, akpm@linux-foundation.org
  Cc: Williams, Dan J, Hansen, Dave, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Zhu, Lipeng, Deng, Pan, shakeelb@google.com,
	Li, Tianyou, Chen, Tim C, tim.c.chen@linux.intel.com

Thanks Liam and Dennis for review, this is updated patch with comment around:

> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
> 
> UnixBench/Execl represents a class of workload where bash scripts are
> spawned frequently to do some short jobs. It will do system call on execl
> frequently, and execl will call mm_init to initialize mm_struct of the process.
> mm_init will call __percpu_counter_init for percpu_counters initialization.
> Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory
> allocation. Inside pcpu_alloc, it will call pcpu_alloc_area  to allocate memory
> from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the rest
> free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
>    - 53.54% 0x654278696e552f34
>         main
>         __execve
>         entry_SYSCALL_64_after_hwframe
>         do_syscall_64
>         __x64_sys_execve
>         do_execveat_common.isra.47
>         alloc_bprm
>         mm_init
>         __percpu_counter_init
>         pcpu_alloc
>       - __mutex_lock.isra.17
> 
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line with
> ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8 bytes. This
> patch moves ‘bound_map’ up to ‘base_addr’, to let ‘base_addr’ locate in a
> new cacheline.
> 
> With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> v6.4-rc4, the 160 parallel score improves by 24%.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  mm/percpu-internal.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> f9847c131998..ecc7be1ec876 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,16 @@ struct pcpu_chunk {
>  	struct list_head	list;		/* linked to pcpu_slot lists */
>  	int			free_bytes;	/* free bytes in the chunk */
>  	struct pcpu_block_md	chunk_md;
> +	unsigned long		*bound_map;	/* boundary map */
> +
> +	/*
> +	 * To reduce false sharing, current layout is optimized to make sure
> +	 * base_addr locate in the different cacheline with free_bytes and
> +	 * chunk_md.
> +	 */
>  	void			*base_addr;	/* base address of this chunk
> */
> 
>  	unsigned long		*alloc_map;	/* allocation map */
> -	unsigned long		*bound_map;	/* boundary map */
>  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> 
>  	void			*data;		/* chunk data */
> --
> 2.39.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-07 15:02   ` Ma, Yu
@ 2023-06-09 18:20     ` Dennis Zhou
  2023-06-10  0:12       ` Ma, Yu
  2023-06-10  3:07       ` [PATCH v3] " Yu Ma
  0 siblings, 2 replies; 12+ messages in thread
From: Dennis Zhou @ 2023-06-09 18:20 UTC (permalink / raw)
  To: Ma, Yu
  Cc: Liam.Howlett@Oracle.com, Dennis Zhou, akpm@linux-foundation.org,
	Williams, Dan J, Hansen, Dave, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Zhu, Lipeng, Deng, Pan, shakeelb@google.com,
	Li, Tianyou, Chen, Tim C, tim.c.chen@linux.intel.com

Hi Yu,

On Wed, Jun 07, 2023 at 03:02:32PM +0000, Ma, Yu wrote:
> Thanks Liam and Dennis for review, this is updated patch with comment around:
> 
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> > 
> > UnixBench/Execl represents a class of workload where bash scripts are
> > spawned frequently to do some short jobs. It will do system call on execl
> > frequently, and execl will call mm_init to initialize mm_struct of the process.
> > mm_init will call __percpu_counter_init for percpu_counters initialization.
> > Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory
> > allocation. Inside pcpu_alloc, it will call pcpu_alloc_area  to allocate memory
> > from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the rest
> > free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> >    - 53.54% 0x654278696e552f34
> >         main
> >         __execve
> >         entry_SYSCALL_64_after_hwframe
> >         do_syscall_64
> >         __x64_sys_execve
> >         do_execveat_common.isra.47
> >         alloc_bprm
> >         mm_init
> >         __percpu_counter_init
> >         pcpu_alloc
> >       - __mutex_lock.isra.17
> > 
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line with
> > ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8 bytes. This
> > patch moves ‘bound_map’ up to ‘base_addr’, to let ‘base_addr’ locate in a
> > new cacheline.
> > 
> > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > v6.4-rc4, the 160 parallel score improves by 24%.
> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  mm/percpu-internal.h | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > f9847c131998..ecc7be1ec876 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,16 @@ struct pcpu_chunk {
> >  	struct list_head	list;		/* linked to pcpu_slot lists */
> >  	int			free_bytes;	/* free bytes in the chunk */
> >  	struct pcpu_block_md	chunk_md;
> > +	unsigned long		*bound_map;	/* boundary map */
> > +
> > +	/*
> > +	 * To reduce false sharing, current layout is optimized to make sure
> > +	 * base_addr locate in the different cacheline with free_bytes and
> > +	 * chunk_md.
> > +	 */
> >  	void			*base_addr;	/* base address of this chunk
> > */
> > 
> >  	unsigned long		*alloc_map;	/* allocation map */
> > -	unsigned long		*bound_map;	/* boundary map */
> >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> > 
> >  	void			*data;		/* chunk data */
> > --
> > 2.39.3
> 

Thanks for adding the comment, but would you mind adding
____cacheline_aligned_in_smp? Unless that's something we're trying to
avoid, I think this is a good use case for it both on the pcpu_chunk and
specifically on base_addr as that's what we're accessing without a lock.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-09 18:20     ` Dennis Zhou
@ 2023-06-10  0:12       ` Ma, Yu
  2023-06-10  3:07       ` [PATCH v3] " Yu Ma
  1 sibling, 0 replies; 12+ messages in thread
From: Ma, Yu @ 2023-06-10  0:12 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Liam.Howlett@Oracle.com, akpm@linux-foundation.org,
	Williams, Dan J, Hansen, Dave, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Zhu, Lipeng, Deng, Pan, shakeelb@google.com,
	Li, Tianyou, Chen, Tim C, tim.c.chen@linux.intel.com

> Hi Yu,
> 
> On Wed, Jun 07, 2023 at 03:02:32PM +0000, Ma, Yu wrote:
> > Thanks Liam and Dennis for review, this is updated patch with comment
> around:
> >
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize mm_struct of
> the process.
> > > mm_init will call __percpu_counter_init for percpu_counters initialization.
> > > Then pcpu_alloc is called to read the base_addr of pcpu_chunk for
> > > memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area
> > > to allocate memory from a specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> > >    - 53.54% 0x654278696e552f34
> > >         main
> > >         __execve
> > >         entry_SYSCALL_64_after_hwframe
> > >         do_syscall_64
> > >         __x64_sys_execve
> > >         do_execveat_common.isra.47
> > >         alloc_bprm
> > >         mm_init
> > >         __percpu_counter_init
> > >         pcpu_alloc
> > >       - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > ---
> > >  mm/percpu-internal.h | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > > f9847c131998..ecc7be1ec876 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,16 @@ struct pcpu_chunk {
> > >  	struct list_head	list;		/* linked to pcpu_slot lists */
> > >  	int			free_bytes;	/* free bytes in the chunk */
> > >  	struct pcpu_block_md	chunk_md;
> > > +	unsigned long		*bound_map;	/* boundary map */
> > > +
> > > +	/*
> > > +	 * To reduce false sharing, current layout is optimized to make sure
> > > +	 * base_addr locate in the different cacheline with free_bytes and
> > > +	 * chunk_md.
> > > +	 */
> > >  	void			*base_addr;	/* base address of this chunk
> > > */
> > >
> > >  	unsigned long		*alloc_map;	/* allocation map */
> > > -	unsigned long		*bound_map;	/* boundary map */
> > >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> > >
> > >  	void			*data;		/* chunk data */
> > > --
> > > 2.39.3
> >
> 
> Thanks for adding the comment, but would you mind adding
> ____cacheline_aligned_in_smp? Unless that's something we're trying to
> avoid, I think this is a good use case for it both on the pcpu_chunk and
> specifically on base_addr as that's what we're accessing without a lock.
> 

Thanks Dennis, I'll send out the updated patch with 
____cacheline_aligned_in_smp on base_addr :)

> Thanks,
> Dennis

Regards
Yu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-09 18:20     ` Dennis Zhou
  2023-06-10  0:12       ` Ma, Yu
@ 2023-06-10  3:07       ` Yu Ma
  2023-06-12 21:43         ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Yu Ma @ 2023-06-10  3:07 UTC (permalink / raw)
  To: dennis, Liam.Howlett
  Cc: akpm, dan.j.williams, dave.hansen, linux-kernel, linux-mm,
	lipeng.zhu, pan.deng, shakeelb, tianyou.li, tim.c.chen,
	tim.c.chen, yu.ma

When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area  to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
+   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
-   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
   - 53.54% 0x654278696e552f34
        main
        __execve
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_execve
        do_execveat_common.isra.47
        alloc_bprm
        mm_init
        __percpu_counter_init
        pcpu_alloc
      - __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 mm/percpu-internal.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..7f108b25bb93 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,17 @@ struct pcpu_chunk {
 	struct list_head	list;		/* linked to pcpu_slot lists */
 	int			free_bytes;	/* free bytes in the chunk */
 	struct pcpu_block_md	chunk_md;
-	void			*base_addr;	/* base address of this chunk */
+	unsigned long		*bound_map;	/* boundary map */
+	
+	/* 
+	 * base_addr is the base address of this chunk.
+	 * To reduce false sharing, current layout is optimized to make sure
+	 * base_addr locate in the different cacheline with free_bytes and
+	 * chunk_md.
+	 */
+	void			*base_addr ____cacheline_aligned_in_smp;
 
 	unsigned long		*alloc_map;	/* allocation map */
-	unsigned long		*bound_map;	/* boundary map */
 	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
 
 	void			*data;		/* chunk data */
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-10  3:07       ` [PATCH v3] " Yu Ma
@ 2023-06-12 21:43         ` Andrew Morton
  2023-06-12 21:55           ` Dennis Zhou
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2023-06-12 21:43 UTC (permalink / raw)
  To: Yu Ma
  Cc: dennis, Liam.Howlett, dan.j.williams, dave.hansen, linux-kernel,
	linux-mm, lipeng.zhu, pan.deng, shakeelb, tianyou.li, tim.c.chen,
	tim.c.chen

On Fri,  9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@intel.com> wrote:

> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
> 
> UnixBench/Execl represents a class of workload where bash scripts
> are spawned frequently to do some short jobs. It will do system call on
> execl frequently, and execl will call mm_init to initialize mm_struct
> of the process. mm_init will call __percpu_counter_init for
> percpu_counters initialization. Then pcpu_alloc is called to read
> the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> it will call pcpu_alloc_area  to allocate memory from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the
> rest free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
>    - 53.54% 0x654278696e552f34
>         main
>         __execve
>         entry_SYSCALL_64_after_hwframe
>         do_syscall_64
>         __x64_sys_execve
>         do_execveat_common.isra.47
>         alloc_bprm
>         mm_init
>         __percpu_counter_init
>         pcpu_alloc
>       - __mutex_lock.isra.17
> 
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> to let ‘base_addr’ locate in a new cacheline.
> 
> With this change, on Intel Sapphire Rapids 112C/224T platform,
> based on v6.4-rc4, the 160 parallel score improves by 24%.

Well that's nice.

>
> ...
>
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,17 @@ struct pcpu_chunk {
>  	struct list_head	list;		/* linked to pcpu_slot lists */
>  	int			free_bytes;	/* free bytes in the chunk */
>  	struct pcpu_block_md	chunk_md;
> -	void			*base_addr;	/* base address of this chunk */
> +	unsigned long		*bound_map;	/* boundary map */
> +	
> +	/* 
> +	 * base_addr is the base address of this chunk.
> +	 * To reduce false sharing, current layout is optimized to make sure
> +	 * base_addr locate in the different cacheline with free_bytes and
> +	 * chunk_md.
> +	 */
> +	void			*base_addr ____cacheline_aligned_in_smp;
>  
>  	unsigned long		*alloc_map;	/* allocation map */
> -	unsigned long		*bound_map;	/* boundary map */
>  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
>  
>  	void			*data;		/* chunk data */

This will of course consume more memory.  Do we have a feel for the
worst-case impact of this?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-12 21:43         ` Andrew Morton
@ 2023-06-12 21:55           ` Dennis Zhou
  2023-06-13 17:41             ` Ma, Yu
  0 siblings, 1 reply; 12+ messages in thread
From: Dennis Zhou @ 2023-06-12 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yu Ma, dennis, Liam.Howlett, dan.j.williams, dave.hansen,
	linux-kernel, linux-mm, lipeng.zhu, pan.deng, shakeelb,
	tianyou.li, tim.c.chen, tim.c.chen

Hi Andrew,

On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> On Fri,  9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@intel.com> wrote:
> 
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> > 
> > UnixBench/Execl represents a class of workload where bash scripts
> > are spawned frequently to do some short jobs. It will do system call on
> > execl frequently, and execl will call mm_init to initialize mm_struct
> > of the process. mm_init will call __percpu_counter_init for
> > percpu_counters initialization. Then pcpu_alloc is called to read
> > the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> > it will call pcpu_alloc_area  to allocate memory from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the
> > rest free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> >    - 53.54% 0x654278696e552f34
> >         main
> >         __execve
> >         entry_SYSCALL_64_after_hwframe
> >         do_syscall_64
> >         __x64_sys_execve
> >         do_execveat_common.isra.47
> >         alloc_bprm
> >         mm_init
> >         __percpu_counter_init
> >         pcpu_alloc
> >       - __mutex_lock.isra.17
> > 
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> > last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> > to let ‘base_addr’ locate in a new cacheline.
> > 
> > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > based on v6.4-rc4, the 160 parallel score improves by 24%.
> 
> Well that's nice.
> 
> >
> > ...
> >
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> >  	struct list_head	list;		/* linked to pcpu_slot lists */
> >  	int			free_bytes;	/* free bytes in the chunk */
> >  	struct pcpu_block_md	chunk_md;
> > -	void			*base_addr;	/* base address of this chunk */
> > +	unsigned long		*bound_map;	/* boundary map */
> > +	
> > +	/* 
> > +	 * base_addr is the base address of this chunk.
> > +	 * To reduce false sharing, current layout is optimized to make sure
> > +	 * base_addr locate in the different cacheline with free_bytes and
> > +	 * chunk_md.
> > +	 */
> > +	void			*base_addr ____cacheline_aligned_in_smp;
> >  
> >  	unsigned long		*alloc_map;	/* allocation map */
> > -	unsigned long		*bound_map;	/* boundary map */
> >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> >  
> >  	void			*data;		/* chunk data */
> 
> This will of course consume more memory.  Do we have a feel for the
> worst-case impact of this?
> 

The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic. A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.

Working the #s on my desktop:
Percpu:            58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?

I believe we can do a little better to avoid eating that full padding,
so likely less than that.

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing
  2023-06-12 21:55           ` Dennis Zhou
@ 2023-06-13 17:41             ` Ma, Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Ma, Yu @ 2023-06-13 17:41 UTC (permalink / raw)
  To: Dennis Zhou, Andrew Morton
  Cc: Liam.Howlett@oracle.com, Williams, Dan J, Hansen, Dave,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, Zhu, Lipeng,
	Deng, Pan, shakeelb@google.com, Li, Tianyou, Chen, Tim C,
	tim.c.chen@linux.intel.com


> Hi Andrew,
> 
> On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> > On Fri,  9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@intel.com> wrote:
> >
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize
> > > mm_struct of the process. mm_init will call __percpu_counter_init
> > > for percpu_counters initialization. Then pcpu_alloc is called to
> > > read the base_addr of pcpu_chunk for memory allocation. Inside
> > > pcpu_alloc, it will call pcpu_alloc_area  to allocate memory from a
> specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> > >    - 53.54% 0x654278696e552f34
> > >         main
> > >         __execve
> > >         entry_SYSCALL_64_after_hwframe
> > >         do_syscall_64
> > >         __x64_sys_execve
> > >         do_execveat_common.isra.47
> > >         alloc_bprm
> > >         mm_init
> > >         __percpu_counter_init
> > >         pcpu_alloc
> > >       - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Well that's nice.
> >
> > >
> > > ...
> > >
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> > >  	struct list_head	list;		/* linked to pcpu_slot lists */
> > >  	int			free_bytes;	/* free bytes in the chunk */
> > >  	struct pcpu_block_md	chunk_md;
> > > -	void			*base_addr;	/* base address of this chunk
> */
> > > +	unsigned long		*bound_map;	/* boundary map */
> > > +
> > > +	/*
> > > +	 * base_addr is the base address of this chunk.
> > > +	 * To reduce false sharing, current layout is optimized to make sure
> > > +	 * base_addr locate in the different cacheline with free_bytes and
> > > +	 * chunk_md.
> > > +	 */
> > > +	void			*base_addr ____cacheline_aligned_in_smp;
> > >
> > >  	unsigned long		*alloc_map;	/* allocation map */
> > > -	unsigned long		*bound_map;	/* boundary map */
> > >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> > >
> > >  	void			*data;		/* chunk data */
> >
> > This will of course consume more memory.  Do we have a feel for the
> > worst-case impact of this?
> >
> 
> The pcpu_chunk struct is a backing data structure per chunk, so the
> additional memory should not be dramatic. A chunk covers ballpark between
> 64kb and 512kb memory depending on some config and boot time stuff, so I
> believe the additional memory used here is nominal at best.
> 
> Working the #s on my desktop:
> Percpu:            58624 kB
> 28 cores -> ~2.1MB of percpu memory.
> At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
> Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of
> overhead?
> 
> I believe we can do a little better to avoid eating that full padding, so likely
> less than that.
> 
> Acked-by: Dennis Zhou <dennis@kernel.org>
>

Thanks Andrew and Dennis for agreement on the patch. 
The layout of this structure (printed by pahole) before and after this patch is as below for reference. 
The default size is 136 Bytes with 3 cachelines. With patch v3, it is 192 Bytes with 56 extra padding.
For "____cacheline_aligned_in_smp ", initially it was not added with the same concern on memory, 
as it can obtain the same performance gain with reshuffle on base_addr only. Thanks to Dennis with
expertise on the overall usage, it is added to be more clear and bring convenience for future changes.

--default v6.4-rc4--
struct pcpu_chunk {
        struct list_head           list;                 /*     0    16 */
        int                        free_bytes;           /*    16     4 */
        struct pcpu_block_md       chunk_md;             /*    20    32 */

        /* XXX 4 bytes hole, try to pack */

        void *                     base_addr;            /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int *        alloc_map;            /*    64     8 */
        long unsigned int *        bound_map;            /*    72     8 */
        struct pcpu_block_md *     md_blocks;            /*    80     8 */
        void *                     data;                 /*    88     8 */
        bool                       immutable;            /*    96     1 */
        bool                       isolated;             /*    97     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        start_offset;         /*   100     4 */
        int                        end_offset;           /*   104     4 */

        /* XXX 4 bytes hole, try to pack */

        struct obj_cgroup * *      obj_cgroups;          /*   112     8 */
        int                        nr_pages;             /*   120     4 */
        int                        nr_populated;         /*   124     4 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        nr_empty_pop_pages;   /*   128     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          populated[];          /*   136     0 */

        /* size: 136, cachelines: 3, members: 17 */
        /* sum members: 122, holes: 4, sum holes: 14 */
        /* last cacheline: 8 bytes */
};

--with patch v3--
struct pcpu_chunk {
        struct list_head           list;                 /*     0    16 */
        int                        free_bytes;           /*    16     4 */
        struct pcpu_block_md       chunk_md;             /*    20    32 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int *        bound_map;            /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        void *                     base_addr;            /*    64     8 */
        long unsigned int *        alloc_map;            /*    72     8 */
        struct pcpu_block_md *     md_blocks;            /*    80     8 */
        void *                     data;                 /*    88     8 */
        bool                       immutable;            /*    96     1 */
        bool                       isolated;             /*    97     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        start_offset;         /*   100     4 */
        int                        end_offset;           /*   104     4 */

        /* XXX 4 bytes hole, try to pack */

        struct obj_cgroup * *      obj_cgroups;          /*   112     8 */
        int                        nr_pages;             /*   120     4 */
        int                        nr_populated;         /*   124     4 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        nr_empty_pop_pages;   /*   128     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          populated[];          /*   136     0 */

        /* size: 192, cachelines: 3, members: 17 */
        /* sum members: 122, holes: 4, sum holes: 14 */
        /* padding: 56 */
};


Regards
Yu

> Thanks,
> Dennis

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-06-13 17:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-06 12:54 [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing Yu Ma
2023-06-06 19:21 ` Liam R. Howlett
2023-06-06 21:25   ` Dennis Zhou
2023-06-07 12:50     ` Ma, Yu
2023-06-07 14:50 ` [PATCH v2] " Yu Ma
2023-06-07 15:02   ` Ma, Yu
2023-06-09 18:20     ` Dennis Zhou
2023-06-10  0:12       ` Ma, Yu
2023-06-10  3:07       ` [PATCH v3] " Yu Ma
2023-06-12 21:43         ` Andrew Morton
2023-06-12 21:55           ` Dennis Zhou
2023-06-13 17:41             ` Ma, Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).