LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems
@ 2024-03-27 20:03 Tony Luck
  2024-03-27 20:03 ` [PATCH 01/10] x86/resctrl: Prepare for new domain scope Tony Luck
                   ` (10 more replies)
  0 siblings, 11 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

This series on top of v6.9-rc1 plus these two patches:

Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@intel.com/

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

This is a new approach triggered by the discussions that started with
"How can users tell that SNC is enabled?" but then drifted into
whether users of the legacy interface would really get what they
expected when reading from monitor files in the mon_L3_* directories.

During that discussion I'd mentioned providing monitor values for both
the L3 level, and also for each SNC node. That would provide full ABI
compatibility while also giving the finer grained reporting from each
SNC node.

Implementation sets up a new rdt_resource to track monitor resources
with domains for each SNC node. This resource is only used when SNC
mode is detected.

On SNC systems there is a parent-child relationship between the
old L3 resource and the new SUBL3 resource. Reading from legacy
files like mon_data/mon_L3_00/llc_occupancy reads and sums the RMID
counters from all "child" domains in the SUBL3 resource. E.g. on
an SNC3 system:

$ grep . mon_L3_01/llc_occupancy mon_L3_01/*/llc_occupancy
mon_L3_01/llc_occupancy:413097984
mon_L3_01/mon_SUBL3_03/llc_occupancy:141484032
mon_L3_01/mon_SUBL3_04/llc_occupancy:135659520
mon_L3_01/mon_SUBL3_05/llc_occupancy:135954432

So the L3 occupancy shows the total L3 occupancy which is
the sum of the cache occupancy on each of the SNC nodes
that share that L3 cache instance.

Patch 0001 has been salvaged from the previous postings.
All the rest are new.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (10):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Add new rdt_resource for sub-node monitoring
  x86/resctrl: Add new "enabled" state for monitor resources
  x86/resctrl: Add pointer to enabled monitor resource
  x86/resctrl: Add parent/child information to rdt_resource and
    rdt_domain
  x86/resctrl: Update mkdir_mondata_subdir() for Sub-NUMA domains
  x86/resctrl: Update rmdir_mondata_subdir_allrdtgrp() for Sub-NUMA
    domains
  x86/resctrl: Mark L3 monitor files with summation flag.
  x86/resctrl: Update __mon_event_count() for Sub-NUMA domains
  x86/resctrl: Determine Sub-NUMA configuration

 include/linux/resctrl.h                   |  20 ++-
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  23 ++-
 arch/x86/kernel/cpu/resctrl/core.c        |  76 +++++++---
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   3 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 136 +++++++++++++++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 170 +++++++++++++++++-----
 8 files changed, 364 insertions(+), 71 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 01/10] x86/resctrl: Prepare for new domain scope
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 02/10] x86/resctrl: Add new rdt_resource for sub-node monitoring Tony Luck
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  9 ++++-
 arch/x86/kernel/cpu/resctrl/core.c        | 46 ++++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..ed693bfe474d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,13 +150,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		RCU list of all domains for this resource
@@ -174,7 +179,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7751eea19fd2..4c5e985e1388 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -68,7 +68,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -82,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -96,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -108,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -392,9 +392,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -484,6 +481,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -499,7 +509,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -507,12 +517,14 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	lockdep_assert_held(&domain_list_lock);
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
+
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -552,15 +564,21 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b7291f60399c..2bf021d42500 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -577,7 +577,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 884b88e25141..0013b1b39c17 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 02f213f1c51c..b8588ce88eef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1454,10 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 02/10] x86/resctrl: Add new rdt_resource for sub-node monitoring
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-03-27 20:03 ` [PATCH 01/10] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 03/10] x86/resctrl: Add new "enabled" state for monitor resources Tony Luck
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

When a system enables Sub-NUMA Cluster (SNC) mode each CPU socket is
divided into 2 or more NUMA nodes. CPU cores and memory are divided
between these nodes. L3 cache monitoring functions are independently
implemented on each SNC node.

Add a new rdt_resource as a foundation for the structures to track
monitor resources in each of the L3 sub-nodes.

Add new scope option RESCTRL_NODE to build domains scoped at the
NUMA node level.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                |  1 +
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
 3 files changed, 14 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ed693bfe474d..f39a07b27a98 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -153,6 +153,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8f40fb35db78..24fad5ecc158 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -472,6 +472,7 @@ enum resctrl_res_level {
 	RDT_RESOURCE_L2,
 	RDT_RESOURCE_MBA,
 	RDT_RESOURCE_SMBA,
+	RDT_RESOURCE_SUBL3,
 
 	/* Must be the last */
 	RDT_NUM_RESOURCES,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4c5e985e1388..395bcb3199f8 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -115,6 +115,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.fflags			= RFTYPE_RES_MB,
 		},
 	},
+	[RDT_RESOURCE_SUBL3] =
+	{
+		.r_resctrl = {
+			.rid			= RDT_RESOURCE_SUBL3,
+			.name			= "SUBL3",
+			.scope			= RESCTRL_NODE,
+			.domains		= domain_init(RDT_RESOURCE_SUBL3),
+			.fflags			= RFTYPE_RES_CACHE,
+		},
+	},
 };
 
 /*
@@ -487,6 +497,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 03/10] x86/resctrl: Add new "enabled" state for monitor resources
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-03-27 20:03 ` [PATCH 01/10] x86/resctrl: Prepare for new domain scope Tony Luck
  2024-03-27 20:03 ` [PATCH 02/10] x86/resctrl: Add new rdt_resource for sub-node monitoring Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 04/10] x86/resctrl: Add pointer to enabled monitor resource Tony Luck
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

When Sub-NUMA cluster is enabled resctrl needs to build domain
lists for both the regular L3 resource (scoped at L3 cache level)
and the SUBL3 resource (scoped at NUMA node level).

But only one of these resources will be used for all monitoring
functions.

Add a new "enabled" flag to indicate which resource should allocate
space for MBM counters, run MBM overflow and LLC occupancy timeouts
etc.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                | 2 ++
 arch/x86/kernel/cpu/resctrl/internal.h | 4 ++++
 arch/x86/kernel/cpu/resctrl/core.c     | 4 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c  | 1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++----
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f39a07b27a98..dea79f6a8122 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -161,6 +161,7 @@ enum resctrl_scope {
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
+ * @mon_enabled:	Monitor feature enabled for this resource
  * @num_rmid:		Number of RMIDs available
  * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
@@ -179,6 +180,7 @@ struct rdt_resource {
 	int			rid;
 	bool			alloc_capable;
 	bool			mon_capable;
+	bool			mon_enabled;
 	int			num_rmid;
 	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24fad5ecc158..5fcff861e185 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -514,6 +514,10 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
 	for_each_rdt_resource(r)					      \
 		if (r->mon_capable)
 
+#define for_each_mon_enabled_rdt_resource(r)				      \
+	for_each_rdt_resource(r)					      \
+		if (r->mon_enabled)
+
 /* CPUID.(EAX=10H, ECX=ResID=1).EAX */
 union cpuid_0x10_1_eax {
 	struct {
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 395bcb3199f8..bfa179f20802 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -559,7 +559,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	if (r->mon_enabled && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
@@ -1002,7 +1002,7 @@ static void __exit resctrl_exit(void)
 
 	rdtgroup_exit();
 
-	if (r->mon_capable)
+	if (r->mon_enabled)
 		rdt_put_mon_l3_config();
 }
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c34a35ec0f03..84a2056190c8 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1060,6 +1060,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	l3_mon_evt_init(r);
 
 	r->mon_capable = true;
+	r->mon_enabled = true;
 
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b8588ce88eef..ffcafe567b25 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3150,7 +3150,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
 	 * Create the subdirectories for each domain. Note that all events
 	 * in a domain like L3 are grouped into a resource whose domain is L3
 	 */
-	for_each_mon_capable_rdt_resource(r) {
+	for_each_mon_enabled_rdt_resource(r) {
 		ret = mkdir_mondata_subdir_alldom(kn, r, prgrp);
 		if (ret)
 			goto out_destroy;
@@ -3937,7 +3937,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
 
-	if (!r->mon_capable)
+	if (!r->mon_enabled)
 		goto out_unlock;
 
 	/*
@@ -4011,7 +4011,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 		goto out_unlock;
 	}
 
-	if (!r->mon_capable)
+	if (!r->mon_enabled)
 		goto out_unlock;
 
 	err = domain_setup_mon_state(r, d);
@@ -4074,7 +4074,7 @@ void resctrl_offline_cpu(unsigned int cpu)
 		}
 	}
 
-	if (!l3->mon_capable)
+	if (!l3->mon_enabled)
 		goto out_unlock;
 
 	d = get_domain_from_cpu(cpu, l3);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 04/10] x86/resctrl: Add pointer to enabled monitor resource
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (2 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 03/10] x86/resctrl: Add new "enabled" state for monitor resources Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 05/10] x86/resctrl: Add parent/child information to rdt_resource and rdt_domain Tony Luck
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

Sub-NUMA cluster enabled systems do all monitoring using the SUBL3
resource. Others use the regular L3 resource.

Replace the hard-coded rdt_resources_all[RDT_RESOURCE_L3].r_resctrl
references with a new pointer "rdt_l3_mon_resource".

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  1 +
 arch/x86/kernel/cpu/resctrl/core.c     |  2 +-
 arch/x86/kernel/cpu/resctrl/monitor.c  | 19 ++++++++++++-------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  7 +++----
 4 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5fcff861e185..21d81f51838f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -464,6 +464,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 extern struct mutex rdtgroup_mutex;
 
 extern struct rdt_hw_resource rdt_resources_all[];
+extern struct rdt_resource *rdt_l3_mon_resource;
 extern struct rdtgroup rdtgroup_default;
 extern struct dentry *debugfs_resctrl;
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bfa179f20802..2fa04375da8c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -996,7 +996,7 @@ late_initcall(resctrl_late_init);
 
 static void __exit resctrl_exit(void)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = rdt_l3_mon_resource;
 
 	cpuhp_remove_state(rdt_online);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 84a2056190c8..95455cb187eb 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -78,6 +78,9 @@ static struct rmid_entry	*rmid_ptrs;
  */
 bool rdt_mon_capable;
 
+/* Resource (L3 or SUBL3) that is base for monitoring */
+struct rdt_resource *rdt_l3_mon_resource;
+
 /*
  * Global to indicate which monitoring events are enabled.
  */
@@ -321,7 +324,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
  */
 void __check_limbo(struct rdt_domain *d, bool force_free)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = rdt_l3_mon_resource;
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	struct rmid_entry *entry;
 	u32 idx, cur_idx = 1;
@@ -467,7 +470,7 @@ int alloc_rmid(u32 closid)
 
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_resource *r = rdt_l3_mon_resource;
 	struct rdt_domain *d;
 	u32 idx;
 
@@ -839,7 +842,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	if (!resctrl_mounted || !resctrl_arch_mon_capable())
 		goto out_unlock;
 
-	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	r = rdt_l3_mon_resource;
 	d = container_of(work, struct rdt_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -1036,7 +1039,9 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	 */
 	resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold);
 
-	ret = dom_data_init(r);
+	rdt_l3_mon_resource = r;
+
+	ret = dom_data_init(rdt_l3_mon_resource);
 	if (ret)
 		return ret;
 
@@ -1057,10 +1062,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 		}
 	}
 
-	l3_mon_evt_init(r);
+	l3_mon_evt_init(rdt_l3_mon_resource);
 
-	r->mon_capable = true;
-	r->mon_enabled = true;
+	rdt_l3_mon_resource->mon_capable = true;
+	rdt_l3_mon_resource->mon_enabled = true;
 
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ffcafe567b25..a2ebd7e051bb 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2703,7 +2703,7 @@ static int rdt_get_tree(struct fs_context *fc)
 		resctrl_mounted = true;
 
 	if (is_mbm_enabled()) {
-		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+		r = rdt_l3_mon_resource;
 		list_for_each_entry(dom, &r->domains, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
 						   RESCTRL_PICK_ANY_CPU);
@@ -4062,7 +4062,6 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
 
 void resctrl_offline_cpu(unsigned int cpu)
 {
-	struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdtgroup *rdtgrp;
 	struct rdt_domain *d;
 
@@ -4074,10 +4073,10 @@ void resctrl_offline_cpu(unsigned int cpu)
 		}
 	}
 
-	if (!l3->mon_enabled)
+	if (!rdt_l3_mon_resource->mon_enabled)
 		goto out_unlock;
 
-	d = get_domain_from_cpu(cpu, l3);
+	d = get_domain_from_cpu(cpu, rdt_l3_mon_resource);
 	if (d) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 05/10] x86/resctrl: Add parent/child information to rdt_resource and rdt_domain
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (3 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 04/10] x86/resctrl: Add pointer to enabled monitor resource Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 06/10] x86/resctrl: Update mkdir_mondata_subdir() for Sub-NUMA domains Tony Luck
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

The L3 resource is a "parent" to the SUBL3 resource in that each of the
domains in L3 is subdivided into two or more SUBL3 domains.

Add parent/child pointers to the rdt_resource structure and a
parent domain id number to the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                |  6 ++++++
 arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index dea79f6a8122..b7ade7627e08 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -70,6 +70,7 @@ struct resctrl_staged_config {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
+ * @parent:		parent domain number in parent resource
  * @plr:		pseudo-locked region (if any) associated with domain
  * @staged_config:	parsed configuration to be applied
  * @mbps_val:		When mba_sc is enabled, this holds the array of user
@@ -87,6 +88,7 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
+	int				parent;
 	struct pseudo_lock_region	*plr;
 	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
 	u32				*mbps_val;
@@ -173,6 +175,8 @@ enum resctrl_scope {
  * @format_str:		Per resource format string to show domain value
  * @parse_ctrlval:	Per resource function pointer to parse control values
  * @evt_list:		List of monitoring events
+ * @parent:		Parent of this resource
+ * @child:		Child of this resource
  * @fflags:		flags to choose base and info files
  * @cdp_capable:	Is the CDP feature available on this resource
  */
@@ -194,6 +198,8 @@ struct rdt_resource {
 						 struct resctrl_schema *s,
 						 struct rdt_domain *d);
 	struct list_head	evt_list;
+	struct rdt_resource	*parent;
+	struct rdt_resource	*child;
 	unsigned long		fflags;
 	bool			cdp_capable;
 };
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 2fa04375da8c..1f5d7ee0096e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -552,6 +552,18 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	d->id = id;
 	cpumask_set_cpu(cpu, &d->cpu_mask);
 
+	if (r->parent) {
+		struct rdt_domain *pd;
+
+		list_for_each_entry(pd, &r->parent->domains, list)
+			if (cpumask_test_cpu(cpu, &pd->cpu_mask))
+				goto found;
+		WARN_ON_ONCE(1);
+		return;
+found:
+		d->parent = pd->id;
+	}
+
 	rdt_domain_reconfigure_cdp(r);
 
 	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a2ebd7e051bb..f2af58a791a4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1107,9 +1107,11 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
 				 struct seq_file *seq, void *v)
 {
 	struct rdt_resource *r = of->kn->parent->priv;
+	struct list_head *evt_list;
 	struct mon_evt *mevt;
 
-	list_for_each_entry(mevt, &r->evt_list, list) {
+	evt_list = r->child ? &r->child->evt_list : &r->evt_list;
+	list_for_each_entry(mevt, evt_list, list) {
 		seq_printf(seq, "%s\n", mevt->name);
 		if (mevt->configurable)
 			seq_printf(seq, "%s_config\n", mevt->name);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 06/10] x86/resctrl: Update mkdir_mondata_subdir() for Sub-NUMA domains
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (4 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 05/10] x86/resctrl: Add parent/child information to rdt_resource and rdt_domain Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 07/10] x86/resctrl: Update rmdir_mondata_subdir_allrdtgrp() " Tony Luck
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

When Sub-NUMA domain mode is enabled all the monitoring happens
at the SUBL3 resource level. Changes are required to populate the
"mon_data" directories in the resctrl file system.

When making the first SUBL3 directory code must first create the
parent mon_L3_XX directory and fill it with files for each of
the monitoring functions. Subsequent SUBL3 creation requests will
simply use the directory that was created by the first request.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h | 12 +++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 74 ++++++++++++++++++--------
 2 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 21d81f51838f..7f05502454c5 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -456,6 +456,18 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 	return container_of(r, struct rdt_hw_resource, r_resctrl);
 }
 
+static inline struct rdt_domain *get_parent_domain(struct rdt_resource *r,
+						   struct rdt_domain *d)
+{
+	struct  rdt_domain *pd;
+
+	list_for_each_entry(pd, &r->parent->domains, list)
+		if (pd->id == d->parent)
+			return pd;
+
+	return NULL;
+}
+
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 	      struct rdt_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f2af58a791a4..22effd8dc207 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3026,31 +3026,16 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 	}
 }
 
-static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
-				struct rdt_resource *r, struct rdtgroup *prgrp)
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain *d,
+			     struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
-	struct kernfs_node *kn;
 	struct mon_evt *mevt;
 	struct rmid_read rr;
-	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
-	/* create the directory */
-	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
-	if (IS_ERR(kn))
-		return PTR_ERR(kn);
-
-	ret = rdtgroup_kn_set_ugid(kn);
-	if (ret)
-		goto out_destroy;
-
-	if (WARN_ON(list_empty(&r->evt_list))) {
-		ret = -EPERM;
-		goto out_destroy;
-	}
+	if (WARN_ON(list_empty(&r->evt_list)))
+		return -EPERM;
 
 	priv.u.rid = r->rid;
 	priv.u.domid = d->id;
@@ -3058,11 +3043,58 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
 		if (ret)
-			goto out_destroy;
+			return ret;
 
 		if (is_mbm_event(mevt->evtid))
 			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
 	}
+
+	return 0;
+}
+
+static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
+				struct rdt_domain *d,
+				struct rdt_resource *r, struct rdtgroup *prgrp)
+{
+	struct kernfs_node *kn = parent_kn, *ckn;
+	struct rdt_domain *pd;
+	char name[32];
+	int ret;
+
+	if (r->parent) {
+		pd = get_parent_domain(r, d);
+		if (!pd)
+			return 0;
+		sprintf(name, "mon_%s_%02d", r->parent->name, pd->id);
+		kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
+		if (kn) {
+			parent_kn = kn;
+			goto subdir;
+		}
+		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+		if (IS_ERR(kn))
+			return PTR_ERR(kn);
+		ret = rdtgroup_kn_set_ugid(kn);
+		if (ret)
+			goto out_destroy;
+		ret = mon_add_all_files(kn, d, r, prgrp);
+		if (ret)
+			goto out_destroy;
+	}
+subdir:
+	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	/* create the directory */
+	ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
+	if (IS_ERR(ckn))
+		goto out_destroy;
+
+	ret = rdtgroup_kn_set_ugid(ckn);
+	if (ret)
+		goto out_destroy;
+
+	ret = mon_add_all_files(ckn, d, r, prgrp);
+	if (ret)
+		goto out_destroy;
 	kernfs_activate(kn);
 	return 0;
 
@@ -3078,8 +3110,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 					   struct rdt_domain *d)
 {
-	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
+	struct kernfs_node *parent_kn;
 	struct list_head *head;
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 07/10] x86/resctrl: Update rmdir_mondata_subdir_allrdtgrp() for Sub-NUMA domains
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (5 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 06/10] x86/resctrl: Update mkdir_mondata_subdir() for Sub-NUMA domains Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 08/10] x86/resctrl: Mark L3 monitor files with summation flag Tony Luck
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

Old code could simply remove the mon_L3_XX directory with all
contents when a domain was taken offline. With Sub-NUMA cluster
enabled this becomes a multi-stage process.

While the parent L3 domain is still online the offline process for
a SUBL3 domain removes just one directory from mon_L3_XX. When the
last such domain goes offline the whole mon_L3_XX must be removed.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 46 ++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 22effd8dc207..399dc3175292 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3012,17 +3012,49 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
  * and monitor groups with given domain id.
  */
 static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   unsigned int dom_id)
+					   struct rdt_domain *d)
 {
 	struct rdtgroup *prgrp, *crgrp;
-	char name[32];
+	char pname[32], name[32];
+	bool remove_all = false;
+	struct kernfs_node *kn;
+	struct rdt_domain *pd;
+
+	sprintf(name, "mon_%s_%02d", r->name, d->id);
+
+	if (!r->parent) {
+		list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+			kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+
+			list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
+				kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+		}
+		return;
+	}
+
+	sprintf(pname, "mon_%s_%02d", r->parent->name, d->parent);
+	pd = get_parent_domain(r, d);
+	if (!pd)
+		remove_all = true;
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
-		sprintf(name, "mon_%s_%02d", r->name, dom_id);
-		kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+		if (remove_all) {
+			kernfs_remove_by_name(prgrp->mon.mon_data_kn, pname);
+		} else {
+			kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, pname, NULL);
+			if (kn)
+				kernfs_remove_by_name(kn, name);
+		}
 
-		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
-			kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+			if (remove_all) {
+				kernfs_remove_by_name(crgrp->mon.mon_data_kn, pname);
+			} else {
+				kn = kernfs_find_and_get_ns(crgrp->mon.mon_data_kn, pname, NULL);
+				if (kn)
+					kernfs_remove_by_name(kn, name);
+			}
+		}
 	}
 }
 
@@ -3979,7 +4011,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (resctrl_mounted && resctrl_arch_mon_capable())
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 08/10] x86/resctrl: Mark L3 monitor files with summation flag.
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (6 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 07/10] x86/resctrl: Update rmdir_mondata_subdir_allrdtgrp() " Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 09/10] x86/resctrl: Update __mon_event_count() for Sub-NUMA domains Tony Luck
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

When Sub-NUMA cluster mode is enabled, the parent monitor
files in the mon_L3_XX directories need to sum all child
siblings from the SUBL3 domains.

Add a flag to the parent files and pass that down to the routines
that will read values from hardware RMID counters.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h    | 5 ++++-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 8 +++++---
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 7f05502454c5..4263fbdf7ee7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -130,6 +130,7 @@ struct mon_evt {
  * @priv:              Used to store monitoring event data in @u
  *                     as kernfs private data
  * @rid:               Resource id associated with the event file
+ * @sumdomains:        Report total of all SNC domains sharing an L3 cache
  * @evtid:             Event id associated with the event file
  * @domid:             The domain to which the event file belongs
  * @u:                 Name of the bit fields struct
@@ -137,7 +138,8 @@ struct mon_evt {
 union mon_data_bits {
 	void *priv;
 	struct {
-		unsigned int rid		: 10;
+		unsigned int rid		: 9;
+		unsigned int sumdomains		: 1;
 		enum resctrl_event_id evtid	: 8;
 		unsigned int domid		: 14;
 	} u;
@@ -149,6 +151,7 @@ struct rmid_read {
 	struct rdt_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
+	bool			sumdomains;
 	int			err;
 	u64			val;
 	void			*arch_mon_ctx;
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 2bf021d42500..fedcf50df6e5 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -574,6 +574,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	resid = md.u.rid;
 	domid = md.u.domid;
 	evtid = md.u.evtid;
+	rr.sumdomains = md.u.sumdomains;
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 399dc3175292..3e7f2e36b71e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3059,7 +3059,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain *d,
-			     struct rdt_resource *r, struct rdtgroup *prgrp)
+			     struct rdt_resource *r, struct rdtgroup *prgrp,
+			     bool sumdomains)
 {
 	union mon_data_bits priv;
 	struct mon_evt *mevt;
@@ -3071,6 +3072,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain *d,
 
 	priv.u.rid = r->rid;
 	priv.u.domid = d->id;
+	priv.u.sumdomains = sumdomains;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3109,7 +3111,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 		ret = rdtgroup_kn_set_ugid(kn);
 		if (ret)
 			goto out_destroy;
-		ret = mon_add_all_files(kn, d, r, prgrp);
+		ret = mon_add_all_files(kn, d, r, prgrp, true);
 		if (ret)
 			goto out_destroy;
 	}
@@ -3124,7 +3126,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	if (ret)
 		goto out_destroy;
 
-	ret = mon_add_all_files(ckn, d, r, prgrp);
+	ret = mon_add_all_files(ckn, d, r, prgrp, false);
 	if (ret)
 		goto out_destroy;
 	kernfs_activate(kn);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 09/10] x86/resctrl: Update __mon_event_count() for Sub-NUMA domains
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (7 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 08/10] x86/resctrl: Mark L3 monitor files with summation flag Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-03-27 20:03 ` [PATCH 10/10] x86/resctrl: Determine Sub-NUMA configuration Tony Luck
  2024-05-02 17:00 ` [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

Chack the flag to see if the request is for a file that must
calculate the sum of RMID counters from all sibling domains.

Note that when reading RMID counters to sum across SUBL3 domains
it is OK to read from any CPU in the parent L3 domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h               |  2 +
 arch/x86/kernel/cpu/resctrl/monitor.c | 56 ++++++++++++++++++++++++---
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index b7ade7627e08..4cf256053f53 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -177,6 +177,7 @@ enum resctrl_scope {
  * @evt_list:		List of monitoring events
  * @parent:		Parent of this resource
  * @child:		Child of this resource
+ * @num_siblings:	Number of sibling domains that share a parent
  * @fflags:		flags to choose base and info files
  * @cdp_capable:	Is the CDP feature available on this resource
  */
@@ -200,6 +201,7 @@ struct rdt_resource {
 	struct list_head	evt_list;
 	struct rdt_resource	*parent;
 	struct rdt_resource	*child;
+	int			num_siblings;
 	unsigned long		fflags;
 	bool			cdp_capable;
 };
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 95455cb187eb..1ba40d5f5d77 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -279,12 +279,24 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
-	int ret;
+	int cpu, ret;
 
 	resctrl_arch_rmid_read_context_check();
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
-		return -EINVAL;
+	/* This sanity check is now heavyweight. Keep it? */
+	cpu = smp_processor_id();
+	if (r->parent) {
+		struct rdt_domain *pd;
+
+		pd = rdt_find_domain(r->parent, d->parent, NULL);
+		if (WARN_ON_ONCE(!pd))
+			return -EINVAL;
+		if (!cpumask_test_cpu(cpu, &pd->cpu_mask))
+			return -EINVAL;
+	} else {
+		if (!cpumask_test_cpu(cpu, &d->cpu_mask))
+			return -EINVAL;
+	}
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
 	if (ret)
@@ -538,7 +550,7 @@ static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
 	}
 }
 
-static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
+static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr, u64 *rrval)
 {
 	struct mbm_state *m;
 	u64 tval = 0;
@@ -556,11 +568,44 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 	if (rr->err)
 		return rr->err;
 
-	rr->val += tval;
+	*rrval += tval;
 
 	return 0;
 }
 
+static u32 get_node_rmid(struct rdt_resource *r, struct rdt_domain *d, u32 rmid)
+{
+	int cpu = cpumask_any(&d->cpu_mask);
+
+	return rmid + (cpu_to_node(cpu) % r->num_siblings) * r->num_rmid;
+}
+
+static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
+{
+	struct rdt_domain *d;
+	struct rmid_read tmp;
+	u32 node_rmid;
+	int ret = 0;
+
+	if (!rr->sumdomains) {
+		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
+		return ___mon_event_count(closid, node_rmid, rr, &rr->val);
+	}
+
+	tmp = *rr;
+	list_for_each_entry(d, &rr->r->domains, list) {
+		if (d->parent == rr->d->parent) {
+			tmp.d = d;
+			node_rmid = get_node_rmid(rr->r, d, rmid);
+			ret = ___mon_event_count(closid, node_rmid, &tmp, &rr->val);
+			if (ret)
+				break;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * mbm_bw_count() - Update bw count from values previously read by
  *		    __mon_event_count().
@@ -1039,6 +1084,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	 */
 	resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold);
 
+	r->num_siblings = 1;
 	rdt_l3_mon_resource = r;
 
 	ret = dom_data_init(rdt_l3_mon_resource);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 10/10] x86/resctrl: Determine Sub-NUMA configuration
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (8 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 09/10] x86/resctrl: Update __mon_event_count() for Sub-NUMA domains Tony Luck
@ 2024-03-27 20:03 ` Tony Luck
  2024-05-02 17:00 ` [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
  10 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2024-03-27 20:03 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches, Tony Luck

There isn't an explicit enumeration of Sub-NUMA cluster mode. Use
the ratio of the number of CPUs that share an L3 cache instance with
CPU 0 against the number of CPUs that share a node with CPU0.

When Sub-NUMA cluster mode is enabled, adjust the number of RMIDs,
the sclaing factor, and setup the parent/child pointers in the
L3 and SUBL3 rdt_resource structures, etc.

As each Sub-NUMA domain is brought online, update the MSR_RMID_SNC_CONFIG
to remap RMID counters.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/monitor.c  | 68 ++++++++++++++++++++++++--
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 26 ++++++++++
 3 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 05956bd8bacf..b54c26016c93 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1158,6 +1158,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 1ba40d5f5d77..757d475158a3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -17,6 +17,7 @@
 
 #include <linux/cpu.h>
 #include <linux/module.h>
+#include <linux/cacheinfo.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
 
@@ -1051,16 +1052,59 @@ static void l3_mon_evt_init(struct rdt_resource *r)
 		list_add_tail(&mbm_local_event.list, &r->evt_list);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0),
+	{}
+};
+
+static __init int snc_get_config(void)
+{
+	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(0);
+	cpumask_t *l3_cpumask = NULL;
+	const cpumask_t	*node0_cpumask;
+	int i;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	for (i = 0; i < ci->num_leaves; i++) {
+		if (ci->info_list[i].level == 3) {
+			if (ci->info_list[i].attributes & CACHE_ID) {
+				l3_cpumask = &ci->info_list[i].shared_cpu_map;
+				break;
+			}
+		}
+	}
+	if (!l3_cpumask) {
+		pr_info("can't get CPU0 L3 mask\n");
+		return 1;
+	}
+
+	node0_cpumask = cpumask_of_node(cpu_to_node(0));
+
+	return bitmap_weight(cpumask_bits(l3_cpumask), nr_cpu_ids) /
+	       bitmap_weight(cpumask_bits(node0_cpumask), nr_cpu_ids);
+}
+
 int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 {
 	unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	unsigned int threshold;
+	int snc_ways;
 	int ret;
 
+	snc_ways = snc_get_config();
+	if (snc_ways > 1)
+		pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", snc_ways);
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
@@ -1084,8 +1128,23 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	 */
 	resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold);
 
-	r->num_siblings = 1;
-	rdt_l3_mon_resource = r;
+	if (snc_ways > 1) {
+		struct rdt_hw_resource *shw_res;
+
+		rdt_l3_mon_resource = &rdt_resources_all[RDT_RESOURCE_SUBL3].r_resctrl;
+		rdt_l3_mon_resource->num_rmid = r->num_rmid;
+		rdt_l3_mon_resource->num_siblings = snc_ways;
+
+		shw_res = resctrl_to_arch_res(rdt_l3_mon_resource);
+		shw_res->mon_scale = hw_res->mon_scale;
+		shw_res->mbm_width = hw_res->mbm_width;
+
+		r->child = rdt_l3_mon_resource;
+		rdt_l3_mon_resource->parent = r;
+	} else {
+		r->num_siblings = 1;
+		rdt_l3_mon_resource = r;
+	}
 
 	ret = dom_data_init(rdt_l3_mon_resource);
 	if (ret)
@@ -1110,6 +1169,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 
 	l3_mon_evt_init(rdt_l3_mon_resource);
 
+	r->mon_capable = true;
 	rdt_l3_mon_resource->mon_capable = true;
 	rdt_l3_mon_resource->mon_enabled = true;
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3e7f2e36b71e..b1f79fafa333 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -4067,6 +4067,29 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	int err = 0;
@@ -4082,6 +4105,9 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	if (!r->mon_enabled)
 		goto out_unlock;
 
+	if (r->num_siblings > 1)
+		snc_remap_rmids();
+
 	err = domain_setup_mon_state(r, d);
 	if (err)
 		goto out_unlock;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems
  2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (9 preceding siblings ...)
  2024-03-27 20:03 ` [PATCH 10/10] x86/resctrl: Determine Sub-NUMA configuration Tony Luck
@ 2024-05-02 17:00 ` Reinette Chatre
  2024-05-03 15:38   ` Luck, Tony
  10 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2024-05-02 17:00 UTC (permalink / raw
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini
  Cc: x86, linux-kernel, patches

Hi Tony,

On 3/27/2024 1:03 PM, Tony Luck wrote:
> This series on top of v6.9-rc1 plus these two patches:
> 
> Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@intel.com/
> 
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features.  Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
> 
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
> 
> This is a new approach triggered by the discussions that started with
> "How can users tell that SNC is enabled?" but then drifted into
> whether users of the legacy interface would really get what they
> expected when reading from monitor files in the mon_L3_* directories.
> 
> During that discussion I'd mentioned providing monitor values for both
> the L3 level, and also for each SNC node. That would provide full ABI
> compatibility while also giving the finer grained reporting from each
> SNC node.
> 
> Implementation sets up a new rdt_resource to track monitor resources
> with domains for each SNC node. This resource is only used when SNC
> mode is detected.
> 
> On SNC systems there is a parent-child relationship between the
> old L3 resource and the new SUBL3 resource. Reading from legacy
> files like mon_data/mon_L3_00/llc_occupancy reads and sums the RMID
> counters from all "child" domains in the SUBL3 resource. E.g. on
> an SNC3 system:
> 
> $ grep . mon_L3_01/llc_occupancy mon_L3_01/*/llc_occupancy
> mon_L3_01/llc_occupancy:413097984
> mon_L3_01/mon_SUBL3_03/llc_occupancy:141484032
> mon_L3_01/mon_SUBL3_04/llc_occupancy:135659520
> mon_L3_01/mon_SUBL3_05/llc_occupancy:135954432
> 
> So the L3 occupancy shows the total L3 occupancy which is
> the sum of the cache occupancy on each of the SNC nodes
> that share that L3 cache instance.
> 

This is back to the "duplicate a resource" solution that does the
unnecessary duplication of an entire resource and splits data structure
use between two resources for each resource to manage one portion.
My objections to this foundation on which this solution is
built remains.

This solution continues to build on this foundation in a way that
I find inappropriate. This solution wedges a new architectural concept
("parent" and "child" resources) into resctrl but does not integrate it.
All resources continue to be managed the same (not hierarchical) while
this artificial relationship is only selectively used when convenient
for SNC.

This solution also brings back a concept that was removed and yet
claims that it is "new" (bab6ee736873 ("x86/resctrl: Merge mon_capable
and mon_enabled"). Having gone through effort to clean this up deserves
a thorough justification why it is necessary to bring it back. I do
not see a justification though since it is used as a workaround
for resctrl to be able to deal with the unnecessary duplication of the
monitoring resource. Same thing for the rdt_l3_mon_resource global
pointer introduced here that is another workaround to deal with the
unnecessary duplication. These "tricks" to get this solution to work
makes this all very difficult to understand and maintain.

Using a "parent" resource to decide if one domain is a "child" of
another domain looks arbitrary to me. Just because the CPU mask of one
domain of one resource is a subset of the CPU mask of another domain of
another resource does not make the domain a child domain of the other.

It looks to me as though the "info" directory will now display two
monitoring resources to user space? I find that to be confusing and
the data exposed in that way is not clear to me ... for example,
both will now allow to set the global max_threshold_occupancy? How
should the number RMIDs be interpreted?

As I understand this rewrite aims to support the new file
hierarchy and to be able to have higher level files contain the sum of
the files below. I do not think a rewrite is necessary to
support this. I do continue to believe that the previous SNC enabling
is appropriate and as I understand support for this new display to user
space can be added to it. For example, could this not be accomplished
by adding one new member to struct rdt_resource that specifies the scope at
which to display the monitor data to user space? mkdir_mondata_subdir() can
use this field to decide how to create the files ... if the "display"
scope is the same as the monitoring scope then files are created as
it is now, if the "display" scope is different (required to be a superset)
then the files are created within a directory that matches the "display"
scope. I do not see why it would be required to have to work from a
stored CPU mask to determine relationships. I believe the id of the
"display" scope to create a sub-directory in can be determined dynamically
at this time. Similarly, the metadata of the files are used to indicate
when a sum is required.

Reinette


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems
  2024-05-02 17:00 ` [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
@ 2024-05-03 15:38   ` Luck, Tony
  0 siblings, 0 replies; 13+ messages in thread
From: Luck, Tony @ 2024-05-03 15:38 UTC (permalink / raw
  To: Chatre, Reinette, Yu, Fenghua, Wieczor-Retman, Maciej,
	Peter Newman, James Morse, Babu Moger, Drew Fustini
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> As I understand this rewrite aims to support the new file
> hierarchy and to be able to have higher level files contain the sum of
> the files below. I do not think a rewrite is necessary to
> support this. I do continue to believe that the previous SNC enabling
> is appropriate and as I understand support for this new display to user
> space can be added to it. For example, could this not be accomplished
> by adding one new member to struct rdt_resource that specifies the scope at
> which to display the monitor data to user space? mkdir_mondata_subdir() can
> use this field to decide how to create the files ... if the "display"
> scope is the same as the monitoring scope then files are created as
> it is now, if the "display" scope is different (required to be a superset)
> then the files are created within a directory that matches the "display"
> scope. I do not see why it would be required to have to work from a
> stored CPU mask to determine relationships. I believe the id of the
> "display" scope to create a sub-directory in can be determined dynamically
> at this time. Similarly, the metadata of the files are used to indicate
> when a sum is required.

Reinette,

I was concerned about dynamically figuring out the relationships. In
particular I thought it would be much easier to track addition/removal
of domains at both SNC node and L3 cache level if I used the existing
cpu hotplug tracking functions. Hence separate rdt_resources for each.

But after reading your suggestions above, I experimented with building
on top of the v17 series. It doesn't look anywhere as complex as I had
imagined. I'm going to pursue that today.

Thanks

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-03 15:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-27 20:03 [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2024-03-27 20:03 ` [PATCH 01/10] x86/resctrl: Prepare for new domain scope Tony Luck
2024-03-27 20:03 ` [PATCH 02/10] x86/resctrl: Add new rdt_resource for sub-node monitoring Tony Luck
2024-03-27 20:03 ` [PATCH 03/10] x86/resctrl: Add new "enabled" state for monitor resources Tony Luck
2024-03-27 20:03 ` [PATCH 04/10] x86/resctrl: Add pointer to enabled monitor resource Tony Luck
2024-03-27 20:03 ` [PATCH 05/10] x86/resctrl: Add parent/child information to rdt_resource and rdt_domain Tony Luck
2024-03-27 20:03 ` [PATCH 06/10] x86/resctrl: Update mkdir_mondata_subdir() for Sub-NUMA domains Tony Luck
2024-03-27 20:03 ` [PATCH 07/10] x86/resctrl: Update rmdir_mondata_subdir_allrdtgrp() " Tony Luck
2024-03-27 20:03 ` [PATCH 08/10] x86/resctrl: Mark L3 monitor files with summation flag Tony Luck
2024-03-27 20:03 ` [PATCH 09/10] x86/resctrl: Update __mon_event_count() for Sub-NUMA domains Tony Luck
2024-03-27 20:03 ` [PATCH 10/10] x86/resctrl: Determine Sub-NUMA configuration Tony Luck
2024-05-02 17:00 ` [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems Reinette Chatre
2024-05-03 15:38   ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).