Linux-mm Archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm: page_counter: relayout structure to reduce false sharing
@ 2021-01-19  7:20 Feng Tang
  2021-01-19 16:39 ` Shakeel Butt
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Feng Tang @ 2021-01-19  7:20 UTC (permalink / raw
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, linux-mm, linux-kernel
  Cc: Feng Tang

When checking a memory cgroup related performance regression [1],
from the perf c2c profiling data, we found high false sharing for
accessing 'usage' and 'parent'.

On 64 bit system, the 'usage' and 'parent' are close to each other,
and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
is usally written, while 'parent' is usually read as the cgroup's
hierarchical counting nature.

So move the 'parent' to the end of the structure to make sure they
are in different cache lines.

Following are some performance data with the patch, against
v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
B is a platform of 4 sockests 72C/144T, and if a %stddev will be
shown bigger than 2%, P100/P50 means number of test tasks equals
to 100%/50% of nr_cpu]

will-it-scale/malloc1
---------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P100	     15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
A-P50	     21511            +8.9%      23432        will-it-scale.per_process_ops
B-P100	      9155            +2.2%       9357        will-it-scale.per_process_ops
B-P50	     10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops

will-it-scale/pagefault2
------------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P100	     79028            +3.0%      81411        will-it-scale.per_process_ops
A-P50	    183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
B-P100	     85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
B-P50	    198195            +9.8%     217526        will-it-scale.per_process_ops

fio (4k/1M is block size)
-------------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps

[1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
---
Changelogs:
  
  v2:
  * Adjust the format of performance data to be more readable,
    as suggested by Michal Hocko

 include/linux/page_counter.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 85bd413..6795913 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -12,7 +12,6 @@ struct page_counter {
 	unsigned long low;
 	unsigned long high;
 	unsigned long max;
-	struct page_counter *parent;
 
 	/* effective memory.min and memory.min usage tracking */
 	unsigned long emin;
@@ -27,6 +26,14 @@ struct page_counter {
 	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
+
+	/*
+	 * 'parent' is placed here to be far from 'usage' to reduce
+	 * cache false sharing, as 'usage' is written mostly while
+	 * parent is frequently read for cgroup's hierarchical
+	 * counting nature.
+	 */
+	struct page_counter *parent;
 };
 
 #if BITS_PER_LONG == 32
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing
  2021-01-19  7:20 [PATCH v2] mm: page_counter: relayout structure to reduce false sharing Feng Tang
@ 2021-01-19 16:39 ` Shakeel Butt
  2021-01-19 17:00 ` Johannes Weiner
  2021-01-20  7:56 ` Michal Hocko
  2 siblings, 0 replies; 4+ messages in thread
From: Shakeel Butt @ 2021-01-19 16:39 UTC (permalink / raw
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Roman Gushchin,
	Linux MM, LKML

On Mon, Jan 18, 2021 at 11:20 PM Feng Tang <feng.tang@intel.com> wrote:
>
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
>
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
>
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
>
> Following are some performance data with the patch, against
> v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
> B is a platform of 4 sockests 72C/144T, and if a %stddev will be
> shown bigger than 2%, P100/P50 means number of test tasks equals
> to 100%/50% of nr_cpu]
>
> will-it-scale/malloc1
> ---------------------
>            v5.11-rc1                    v5.11-rc1+patch
>
> A-P100       15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
> A-P50        21511            +8.9%      23432        will-it-scale.per_process_ops
> B-P100        9155            +2.2%       9357        will-it-scale.per_process_ops
> B-P50        10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops
>
> will-it-scale/pagefault2
> ------------------------
>            v5.11-rc1                    v5.11-rc1+patch
>
> A-P100       79028            +3.0%      81411        will-it-scale.per_process_ops
> A-P50       183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
> B-P100       85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
> B-P50       198195            +9.8%     217526        will-it-scale.per_process_ops
>
> fio (4k/1M is block size)
> -------------------------
>            v5.11-rc1                    v5.11-rc1+patch
>
> A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
> A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
> A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
> A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps
>
> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing
  2021-01-19  7:20 [PATCH v2] mm: page_counter: relayout structure to reduce false sharing Feng Tang
  2021-01-19 16:39 ` Shakeel Butt
@ 2021-01-19 17:00 ` Johannes Weiner
  2021-01-20  7:56 ` Michal Hocko
  2 siblings, 0 replies; 4+ messages in thread
From: Johannes Weiner @ 2021-01-19 17:00 UTC (permalink / raw
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	linux-mm, linux-kernel

On Tue, Jan 19, 2021 at 03:20:14PM +0800, Feng Tang wrote:
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
> 
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
> 
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
> 
> Following are some performance data with the patch, against
> v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
> B is a platform of 4 sockests 72C/144T, and if a %stddev will be
> shown bigger than 2%, P100/P50 means number of test tasks equals
> to 100%/50% of nr_cpu]
> 
> will-it-scale/malloc1
> ---------------------
> 	   v5.11-rc1			v5.11-rc1+patch
> 
> A-P100	     15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
> A-P50	     21511            +8.9%      23432        will-it-scale.per_process_ops
> B-P100	      9155            +2.2%       9357        will-it-scale.per_process_ops
> B-P50	     10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops
> 
> will-it-scale/pagefault2
> ------------------------
> 	   v5.11-rc1			v5.11-rc1+patch
> 
> A-P100	     79028            +3.0%      81411        will-it-scale.per_process_ops
> A-P50	    183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
> B-P100	     85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
> B-P50	    198195            +9.8%     217526        will-it-scale.per_process_ops
> 
> fio (4k/1M is block size)
> -------------------------
> 	   v5.11-rc1			v5.11-rc1+patch
> 
> A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
> A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
> A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
> A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps
> 
> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing
  2021-01-19  7:20 [PATCH v2] mm: page_counter: relayout structure to reduce false sharing Feng Tang
  2021-01-19 16:39 ` Shakeel Butt
  2021-01-19 17:00 ` Johannes Weiner
@ 2021-01-20  7:56 ` Michal Hocko
  2 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2021-01-20  7:56 UTC (permalink / raw
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	linux-mm, linux-kernel

On Tue 19-01-21 15:20:14, Feng Tang wrote:
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
> 
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
> 
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
> 
> Following are some performance data with the patch, against
> v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
> B is a platform of 4 sockests 72C/144T, and if a %stddev will be
> shown bigger than 2%, P100/P50 means number of test tasks equals
> to 100%/50% of nr_cpu]
> 
> will-it-scale/malloc1
> ---------------------
> 	   v5.11-rc1			v5.11-rc1+patch
> 
> A-P100	     15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
> A-P50	     21511            +8.9%      23432        will-it-scale.per_process_ops
> B-P100	      9155            +2.2%       9357        will-it-scale.per_process_ops
> B-P50	     10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops
> 
> will-it-scale/pagefault2
> ------------------------
> 	   v5.11-rc1			v5.11-rc1+patch
> 
> A-P100	     79028            +3.0%      81411        will-it-scale.per_process_ops
> A-P50	    183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
> B-P100	     85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
> B-P50	    198195            +9.8%     217526        will-it-scale.per_process_ops
> 
> fio (4k/1M is block size)
> -------------------------
> 	   v5.11-rc1			v5.11-rc1+patch
> 
> A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
> A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
> A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
> A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps

Thanks for making results easier to read and understand.

> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
> Changelogs:
>   
>   v2:
>   * Adjust the format of performance data to be more readable,
>     as suggested by Michal Hocko
> 
>  include/linux/page_counter.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 85bd413..6795913 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -12,7 +12,6 @@ struct page_counter {
>  	unsigned long low;
>  	unsigned long high;
>  	unsigned long max;
> -	struct page_counter *parent;
>  
>  	/* effective memory.min and memory.min usage tracking */
>  	unsigned long emin;
> @@ -27,6 +26,14 @@ struct page_counter {
>  	/* legacy */
>  	unsigned long watermark;
>  	unsigned long failcnt;
> +
> +	/*
> +	 * 'parent' is placed here to be far from 'usage' to reduce
> +	 * cache false sharing, as 'usage' is written mostly while
> +	 * parent is frequently read for cgroup's hierarchical
> +	 * counting nature.
> +	 */
> +	struct page_counter *parent;
>  };
>  
>  #if BITS_PER_LONG == 32
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-01-20  7:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-19  7:20 [PATCH v2] mm: page_counter: relayout structure to reduce false sharing Feng Tang
2021-01-19 16:39 ` Shakeel Butt
2021-01-19 17:00 ` Johannes Weiner
2021-01-20  7:56 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).