[PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems
@ 2024-05-03 20:33 Tony Luck
  2024-05-03 20:33 ` [PATCH v17 1/9] x86/resctrl: Prepare for new domain scope Tony Luck
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Note: Jump straight to patch 7 for the new stuff. Just minor tweaks in
other patches.

This series based on top of TIP x86/cache branch:
 931be446c6cb ("x86/resctrl: Add tracepoint for llc_occupancy tracking")

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPUs support an MSR that can partition the RMID counters
in the same way. This allows monitoring features to be used. Legacy
monitoring files provide the sum of counters from each SNC node for
backwards compatibility. Additional  files per SNC node provide details
per node.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since v16: https://lore.kernel.org/all/20240312214247.91772-1-tony.luck@intel.com/

Patch 1: Reinette pointed out that rdt_find_domain() no longer returns ERR_PTR()
but one of the callers was still checking return with IS_ERR().

Patch 2: Tip tree added a tracing patch. That needed s/d->id/d->hdr.id/

Patch 3: Reinette: Keep the "RCU" in the kerneldoc description of the
domain list fields after the split into separate ctrl/mon lists.

Patch 4: No change

Patch 5: No change

Patch 6: Drop the change that divided output of the resctrl "size" file
by the number of SNC domains per L3 cache. Now that this series
preserves the contents of the legacy llc_occupancy files this isn't
useful.

Patch 7: NEW in this series. Add per-SNC domain monitor files while
making the original files sum across SNC nodes.

Patch 8: (formerly 7) No change

Patch 9: (formerly 8) Add documentation for new per-SNC directories and files

Tony Luck (9):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC)
    monitoring
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  17 +
 include/linux/resctrl.h                   |  89 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  72 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 430 ++++++++++++++++++----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  57 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  98 +++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 263 ++++++++-----
 9 files changed, 759 insertions(+), 294 deletions(-)

base-commit: 931be446c6cbc15691dd499957e961f4e1d56afb
-- 
2.44.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v17 1/9] x86/resctrl: Prepare for new domain scope
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-03 20:33 ` [PATCH v17 2/9] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  9 ++++-
 arch/x86/kernel/cpu/resctrl/core.c        | 46 ++++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..ed693bfe474d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,13 +150,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		RCU list of all domains for this resource
@@ -174,7 +179,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7751eea19fd2..4c5e985e1388 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -68,7 +68,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -82,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -96,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -108,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -392,9 +392,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -484,6 +481,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -499,7 +509,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -507,12 +517,14 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	lockdep_assert_held(&domain_list_lock);
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
+
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -552,15 +564,21 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b7291f60399c..2bf021d42500 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -577,7 +577,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 492c8e28c4ce..e3ee5c9d9f08 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 	plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
 
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == plr->s->res->cache_level) {
+		if (ci->info_list[i].level == scope) {
 			plr->line_size = ci->info_list[i].coherency_line_size;
 			return 0;
 		}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 02f213f1c51c..b8588ce88eef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1454,10 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->cache_level) {
+		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 2/9] x86/resctrl: Prepare to split rdt_domain structure
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-05-03 20:33 ` [PATCH v17 1/9] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-03 20:33 ` [PATCH v17 3/9] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 24 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 14 +++---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 81 insertions(+), 73 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ed693bfe474d..f63fcf17a3bc 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -59,10 +59,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -77,9 +87,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4c5e985e1388..7c15959c2768 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -355,9 +355,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -393,12 +393,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -526,7 +526,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	d = rdt_find_domain(r, id, &add_pos);
 
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -537,8 +537,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -552,11 +552,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail_rcu(&d->list, add_pos);
+	list_add_tail_rcu(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del_rcu(&d->list);
+		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
 		domain_free(hw_dom);
 	}
@@ -583,10 +583,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del_rcu(&d->list);
+		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
 
 		/*
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 2bf021d42500..6246f48b0449 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -148,7 +148,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -231,8 +231,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -280,7 +280,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		msr_param.res = NULL;
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -330,7 +330,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 			}
 		}
 		if (msr_param.res)
-			smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+			smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
 	}
 
 	return 0;
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	lockdep_assert_cpus_held();
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -460,7 +460,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -489,7 +489,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -537,7 +537,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		return;
 	}
 
-	cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU);
+	cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
 
 	/*
 	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
@@ -546,7 +546,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	 * counters on some platforms if its called in IRQ context.
 	 */
 	if (tick_nohz_full_cpu(cpu))
-		smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+		smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 	else
 		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2345e6836593..ab8a198d88b3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -281,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 
 	resctrl_arch_rmid_read_context_check();
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -364,7 +364,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 			 * CLOSID and RMID because there may be dependencies between them
 			 * on some architectures.
 			 */
-			trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
+			trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, val);
 		}
 
 		if (force_free || !rmid_dirty) {
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/*
 		 * For the first limbo RMID in the domain,
 		 * setup up the limbo worker.
@@ -801,7 +801,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	__check_limbo(d, false);
 
 	if (has_busy_rmid(d)) {
-		d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+		d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
 							   RESCTRL_PICK_ANY_CPU);
 		schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo,
 					 delay);
@@ -825,7 +825,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+	cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
 	dom->cqm_work_cpu = cpu;
 
 	if (cpu < nr_cpu_ids)
@@ -868,7 +868,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	 * Re-check for housekeeping CPUs. This allows the overflow handler to
 	 * move off a nohz_full CPU quickly.
 	 */
-	d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
 						   RESCTRL_PICK_ANY_CPU);
 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
 
@@ -897,7 +897,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
 	 */
 	if (!resctrl_mounted || !resctrl_arch_mon_capable())
 		return;
-	cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+	cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
 	dom->mbm_work_cpu = cpu;
 
 	if (cpu < nr_cpu_ids)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e3ee5c9d9f08..e07ee41c237d 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -859,10 +859,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -870,7 +870,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1202,7 +1202,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1532,7 +1532,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b8588ce88eef..e6e2753738c9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -317,7 +317,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -1021,12 +1021,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1458,7 +1458,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == r->scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1506,7 +1506,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1518,7 +1518,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1536,7 +1536,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1596,7 +1596,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1608,7 +1608,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1616,7 +1616,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1682,7 +1682,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1732,8 +1732,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			mbm_config_write_domain(r, d, evtid, val);
 			goto next;
 		}
@@ -2280,14 +2280,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2316,7 +2316,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2365,7 +2365,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2704,7 +2704,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
 						   RESCTRL_PICK_ANY_CPU);
 	}
@@ -2831,13 +2831,13 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
 		msr_param.dom = d;
-		smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+		smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
 	}
 
 	return 0;
@@ -3035,7 +3035,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3051,7 +3051,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3261,7 +3261,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3284,7 +3284,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3299,7 +3299,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3945,7 +3945,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (resctrl_mounted && resctrl_arch_mon_capable())
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 3/9] x86/resctrl: Prepare for different scope for control/monitor operations
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-05-03 20:33 ` [PATCH v17 1/9] x86/resctrl: Prepare for new domain scope Tony Luck
  2024-05-03 20:33 ` [PATCH v17 2/9] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-03 20:33 ` [PATCH v17 4/9] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   7 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 224 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  60 +++---
 7 files changed, 240 insertions(+), 96 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f63fcf17a3bc..96ddf9ff3183 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -58,15 +58,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -169,10 +176,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		RCU list of all domains for this resource
+ * @ctrl_domains:	RCU list of all control domains for this resource
+ * @mon_domains:	RCU list of all monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -187,10 +196,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -236,8 +247,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 void resctrl_online_cpu(unsigned int cpu);
 void resctrl_offline_cpu(unsigned int cpu);
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index f1d926832ec8..377679b79919 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -558,8 +558,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -578,7 +578,8 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7c15959c2768..66a5a270d66f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,7 +60,8 @@ static void mba_wrmsr_intel(struct msr_param *m);
 static void cat_wrmsr(struct msr_param *m);
 static void mba_wrmsr_amd(struct msr_param *m);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -68,8 +69,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -82,8 +85,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -96,8 +99,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -108,8 +111,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -349,13 +352,28 @@ static void cat_wrmsr(struct msr_param *m)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
+		/* Find the domain that contains this CPU */
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
+			return d;
+	}
+
+	return NULL;
+}
+
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+{
+	struct rdt_domain *d;
+
+	lockdep_assert_cpus_held();
+
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -379,26 +397,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise.  If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -494,38 +512,29 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	lockdep_assert_held(&domain_list_lock);
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, &add_pos);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+		d = container_of(hdr, struct rdt_domain, hdr);
 
-	if (d) {
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -538,23 +547,70 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail_rcu(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del_rcu(&d->hdr.list);
+		synchronize_rcu();
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	lockdep_assert_held(&domain_list_lock);
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail_rcu(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
@@ -562,30 +618,45 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
 
@@ -601,6 +672,53 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	lockdep_assert_held(&domain_list_lock);
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del_rcu(&d->hdr.list);
+		synchronize_rcu();
+		domain_free(hw_dom);
+
+		return;
+	}
+}
+
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 6246f48b0449..8cc36723f077 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -231,7 +231,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		msr_param.res = NULL;
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	lockdep_assert_cpus_held();
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -556,6 +556,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -576,11 +577,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ab8a198d88b3..82a44de8136f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		/*
 		 * For the first limbo RMID in the domain,
 		 * setup up the limbo worker.
@@ -687,7 +687,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	idx = resctrl_arch_rmid_idx_encode(closid, rmid);
 	pmbm_data = &dom_mbm->mbm_local[idx];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e07ee41c237d..20617a1d8261 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cpu_cacheinfo *ci;
 	int ret;
 	int i;
@@ -859,7 +859,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e6e2753738c9..7c1475f393ff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -1021,7 +1021,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1454,13 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	unsigned int size = 0;
 	int num_b, i;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
 	ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
 	for (i = 0; i < ci->num_leaves; i++) {
-		if (ci->info_list[i].level == r->scope) {
+		if (ci->info_list[i].level == r->ctrl_scope) {
 			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
 			break;
 		}
@@ -1518,7 +1518,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1608,7 +1608,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1732,7 +1732,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			mbm_config_write_domain(r, d, evtid, val);
 			goto next;
@@ -2280,7 +2280,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2365,7 +2365,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2704,7 +2704,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
 						   RESCTRL_PICK_ANY_CPU);
 	}
@@ -2828,10 +2828,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3284,7 +3284,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3299,7 +3299,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3930,15 +3930,19 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	mutex_lock(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
 
-	if (!r->mon_capable)
-		goto out_unlock;
+	mutex_unlock(&rdtgroup_mutex);
+}
+
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	mutex_lock(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3964,7 +3968,6 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 
 	domain_destroy_mon_state(d);
 
-out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
 }
 
@@ -3999,7 +4002,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	int err = 0;
 
@@ -4008,11 +4011,18 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) {
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		err = mba_sc_domain_allocate(r, d);
-		goto out_unlock;
 	}
 
-	if (!r->mon_capable)
-		goto out_unlock;
+	mutex_unlock(&rdtgroup_mutex);
+
+	return err;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	mutex_lock(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
@@ -4077,7 +4087,7 @@ void resctrl_offline_cpu(unsigned int cpu)
 	if (!l3->mon_capable)
 		goto out_unlock;
 
-	d = get_domain_from_cpu(cpu, l3);
+	d = get_mon_domain_from_cpu(cpu, l3);
 	if (d) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 4/9] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (2 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 3/9] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-03 20:33 ` [PATCH v17 5/9] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 48 ++++++++-------
 arch/x86/kernel/cpu/resctrl/internal.h    | 62 ++++++++++++--------
 arch/x86/kernel/cpu/resctrl/core.c        | 71 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 ++++++-------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 64 ++++++++++----------
 7 files changed, 174 insertions(+), 145 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 96ddf9ff3183..aa2c22a8e37b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -78,7 +78,23 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -87,13 +103,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -102,9 +113,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -208,7 +216,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -242,15 +250,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 void resctrl_online_cpu(unsigned int cpu);
 void resctrl_offline_cpu(unsigned int cpu);
 
@@ -279,7 +287,7 @@ void resctrl_offline_cpu(unsigned int cpu);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *arch_mon_ctx);
 
@@ -312,7 +320,7 @@ static inline void resctrl_arch_rmid_read_context_check(void)
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 closid, u32 rmid,
 			     enum resctrl_event_id eventid);
 
@@ -325,7 +333,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 377679b79919..135190e0711c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -147,7 +147,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -232,7 +232,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -355,25 +355,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -385,7 +401,7 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
  */
 struct msr_param {
 	struct rdt_resource	*res;
-	struct rdt_domain	*dom;
+	struct rdt_ctrl_domain	*dom;
 	u32			low;
 	u32			high;
 };
@@ -458,9 +474,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -564,22 +580,22 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
@@ -590,19 +606,19 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms,
 				int exclude_cpu);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
 			     int exclude_cpu);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 66a5a270d66f..cd58c9d4710f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -309,8 +309,8 @@ static void rdt_get_cdp_l2_config(void)
 
 static void mba_wrmsr_amd(struct msr_param *m)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
 	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
@@ -333,8 +333,8 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 
 static void mba_wrmsr_intel(struct msr_param *m)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
 	unsigned int i;
 
 	/*  Write the delay values for mba. */
@@ -344,17 +344,17 @@ static void mba_wrmsr_intel(struct msr_param *m)
 
 static void cat_wrmsr(struct msr_param *m)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
 	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	lockdep_assert_cpus_held();
 
@@ -367,9 +367,9 @@ struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 	return NULL;
 }
 
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	lockdep_assert_cpus_held();
 
@@ -440,18 +440,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -476,7 +481,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -515,10 +520,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	lockdep_assert_held(&domain_list_lock);
@@ -533,7 +538,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (hdr) {
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -553,7 +558,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -563,7 +568,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (err) {
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
@@ -571,9 +576,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	lockdep_assert_held(&domain_list_lock);
@@ -588,7 +593,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	if (hdr) {
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -604,7 +609,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -614,7 +619,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	if (err) {
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -629,9 +634,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
@@ -651,8 +656,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -661,12 +666,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		synchronize_rcu();
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -675,9 +680,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
@@ -697,15 +702,15 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8cc36723f077..3b9383612c35 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -60,7 +60,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -139,7 +139,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -208,8 +208,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -272,11 +272,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -297,17 +297,17 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
-	struct rdt_domain *d;
 	u32 idx;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		msr_param.res = NULL;
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -430,10 +430,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -442,7 +442,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -514,7 +514,7 @@ static int smp_mon_event_count(void *arg)
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	int cpu;
@@ -557,11 +557,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -582,7 +582,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 82a44de8136f..89d7e6fcbaa1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -209,7 +209,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -228,11 +228,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 unused, u32 rmid,
 			     enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -248,9 +248,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -269,12 +269,12 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *ignored)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -320,7 +320,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -378,7 +378,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
 }
 
-bool has_busy_rmid(struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_mon_domain *d)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 
@@ -479,7 +479,7 @@ int alloc_rmid(u32 closid)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	u32 idx;
 
 	lockdep_assert_held(&rdtgroup_mutex);
@@ -531,7 +531,7 @@ void free_rmid(u32 closid, u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
 				       u32 rmid, enum resctrl_event_id evtid)
 {
 	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -667,12 +667,12 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	u32 cur_bw, user_bw, idx;
 	struct list_head *head;
 	struct rdtgroup *entry;
@@ -733,7 +733,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
 		       u32 closid, u32 rmid)
 {
 	struct rmid_read rr;
@@ -791,12 +791,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
 void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -819,7 +819,7 @@ void cqm_handle_limbo(struct work_struct *work)
  * @exclude_cpu:   Which CPU the handler should not run on,
  *		   RESCTRL_PICK_ANY_CPU to pick any CPU.
  */
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
 			     int exclude_cpu)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -836,9 +836,9 @@ void mbm_handle_overflow(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
@@ -851,7 +851,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
@@ -885,7 +885,7 @@ void mbm_handle_overflow(struct work_struct *work)
  * @exclude_cpu:   Which CPU the handler should not run on,
  *		   RESCTRL_PICK_ANY_CPU to pick any CPU.
  */
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
 				int exclude_cpu)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 20617a1d8261..a4513d1d9f55 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7c1475f393ff..cc31ede1a1e7 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -92,8 +92,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -1012,7 +1012,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1243,7 +1243,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1298,7 +1298,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1329,10 +1329,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -1448,7 +1448,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	struct cpu_cacheinfo *ci;
 	unsigned int size = 0;
@@ -1480,9 +1480,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1594,7 +1594,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1602,7 +1602,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	cpus_read_lock();
@@ -1661,7 +1661,7 @@ static void mon_event_config_write(void *info)
 }
 
 static void mbm_config_write_domain(struct rdt_resource *r,
-				    struct rdt_domain *d, u32 evtid, u32 val)
+				    struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 
@@ -1702,7 +1702,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
@@ -2261,9 +2261,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2313,7 +2313,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2331,7 +2331,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2357,7 +2357,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2629,7 +2629,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2814,9 +2814,9 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2832,7 +2832,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -3025,7 +3025,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3074,7 +3074,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3096,7 +3096,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -3201,7 +3201,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3281,7 +3281,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3297,7 +3297,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3923,14 +3923,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	mutex_lock(&rdtgroup_mutex);
 
@@ -3940,7 +3940,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	mutex_lock(&rdtgroup_mutex);
 
@@ -3971,7 +3971,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	size_t tsize;
@@ -4002,7 +4002,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	int err = 0;
 
@@ -4018,7 +4018,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return err;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
@@ -4073,8 +4073,8 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
 void resctrl_offline_cpu(unsigned int cpu)
 {
 	struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_mon_domain *d;
 	struct rdtgroup *rdtgrp;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 5/9] x86/resctrl: Add node-scope to the options for feature scope
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (3 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 4/9] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-03 20:33 ` [PATCH v17 6/9] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index aa2c22a8e37b..5c7775343c3e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -176,6 +176,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cd58c9d4710f..c34ce367c456 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -510,6 +510,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 6/9] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (4 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 5/9] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-03 20:33 ` [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring Tony Luck
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, adjust from the logical RMID to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) Disable the "-o mba_MBps" mount option in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  6 ++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 16 +++++++++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  3 ++-
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 135190e0711c..49440f194253 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -484,6 +484,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
 extern struct rdtgroup rdtgroup_default;
 extern struct dentry *debugfs_resctrl;

+extern unsigned int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c34ce367c456..cb181796f73b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -331,6 +331,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 	return r->default_ctrl;
 }

+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
 static void mba_wrmsr_intel(struct msr_param *m)
 {
 	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 89d7e6fcbaa1..d0bbeb410750 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -187,8 +187,18 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)

 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;

+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -197,7 +207,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);

 	if (msr_val & RMID_VAL_ERROR)
@@ -1022,8 +1032,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;

 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index cc31ede1a1e7..0923492a8bd0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2346,7 +2346,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }

 /*
-- 
2.44.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (5 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 6/9] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-10 21:24   ` Reinette Chatre
  2024-05-03 20:33 ` [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Add a field to the rdt_resource structure to track whether monitoring
resources are tracked by hardware at a different scope (NODE) from
the legacy L3 scope.

Add a field to the rdt_mon_domain structure to track the L3 cache id
which can be used to find all the domains that need resource counts
summed to provide accurate values in the legacy monitoring files.

When SNC is enabled create extra directories and files in each mon_data
directory to report per-SNC node counts.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |   4 +
 arch/x86/kernel/cpu/resctrl/internal.h    |   5 +-
 arch/x86/kernel/cpu/resctrl/core.c        |   2 +
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   1 +
 arch/x86/kernel/cpu/resctrl/monitor.c     |  52 +++++++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 115 +++++++++++++++++-----
 6 files changed, 137 insertions(+), 42 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 5c7775343c3e..2f8ac925bc18 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
 /**
  * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
+ * @display_id:		shared id used to identify domains to be summed for display
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
  */
 struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
+	int				display_id;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
@@ -187,6 +189,7 @@ enum resctrl_scope {
  * @num_rmid:		Number of RMIDs available
  * @ctrl_scope:		Scope of this resource for control functions
  * @mon_scope:		Scope of this resource for monitor functions
+ * @mon_display_scope:	Scope for user reporting monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @ctrl_domains:	RCU list of all control domains for this resource
@@ -207,6 +210,7 @@ struct rdt_resource {
 	int			num_rmid;
 	enum resctrl_scope	ctrl_scope;
 	enum resctrl_scope	mon_scope;
+	enum resctrl_scope	mon_display_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	ctrl_domains;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 49440f194253..d41b388bb499 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -132,6 +132,7 @@ struct mon_evt {
  *                     as kernfs private data
  * @rid:               Resource id associated with the event file
  * @evtid:             Event id associated with the event file
+ * @sum:               Sum across domains with same display_id
  * @domid:             The domain to which the event file belongs
  * @u:                 Name of the bit fields struct
  */
@@ -139,7 +140,8 @@ union mon_data_bits {
 	void *priv;
 	struct {
 		unsigned int rid		: 10;
-		enum resctrl_event_id evtid	: 8;
+		enum resctrl_event_id evtid	: 7;
+		unsigned int sum		: 1;
 		unsigned int domid		: 14;
 	} u;
 };
@@ -150,6 +152,7 @@ struct rmid_read {
 	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
+	bool			sumdomains;
 	int			err;
 	u64			val;
 	void			*arch_mon_ctx;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cb181796f73b..a949e69308cd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -71,6 +71,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.name			= "L3",
 			.ctrl_scope		= RESCTRL_L3_CACHE,
 			.mon_scope		= RESCTRL_L3_CACHE,
+			.mon_display_scope	= RESCTRL_L3_CACHE,
 			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
 			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
@@ -613,6 +614,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->display_id = get_domain_id_from_scope(cpu, r->mon_display_scope);
 	d->hdr.type = RESCTRL_MON_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3b9383612c35..a4ead8ffbaf3 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -575,6 +575,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	resid = md.u.rid;
 	domid = md.u.domid;
 	evtid = md.u.evtid;
+	rr.sumdomains = md.u.sum;
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d0bbeb410750..2e795b261b6f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -16,6 +16,7 @@
  */
 
 #include <linux/cpu.h>
+#include <linux/cacheinfo.h>
 #include <linux/module.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
@@ -187,18 +188,8 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
-	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	int cpu = smp_processor_id();
-	int rmid_offset = 0;
 	u64 msr_val;
 
-	/*
-	 * When SNC mode is on, need to compute the offset to read the
-	 * physical RMID counter for the node to which this CPU belongs.
-	 */
-	if (snc_nodes_per_l3_cache > 1)
-		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
-
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -207,7 +198,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -291,7 +282,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 	resctrl_arch_rmid_read_context_check();
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
+	if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -556,7 +547,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
 	}
 }
 
-static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
+static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr, u64 *rrval)
 {
 	struct mbm_state *m;
 	u64 tval = 0;
@@ -574,11 +565,44 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 	if (rr->err)
 		return rr->err;
 
-	rr->val += tval;
+	*rrval += tval;
 
 	return 0;
 }
 
+static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
+{
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
+
+	return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+}
+
+static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
+{
+	struct rdt_mon_domain *d;
+	struct rmid_read tmp;
+	u32 node_rmid;
+	int ret = 0;
+
+	if (!rr->sumdomains) {
+		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
+		return ___mon_event_count(closid, node_rmid, rr, &rr->val);
+	}
+
+	tmp = *rr;
+	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
+		if (d->display_id == rr->d->display_id) {
+			tmp.d = d;
+			node_rmid = get_node_rmid(rr->r, d, rmid);
+			ret = ___mon_event_count(closid, node_rmid, &tmp, &rr->val);
+			if (ret)
+				break;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * mbm_bw_count() - Update bw count from values previously read by
  *		    __mon_event_count().
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 0923492a8bd0..a56ae08ca255 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3011,57 +3011,118 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
  * and monitor groups with given domain id.
  */
 static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   unsigned int dom_id)
+					   struct rdt_mon_domain *d)
 {
 	struct rdtgroup *prgrp, *crgrp;
+	struct rdt_mon_domain *dom;
+	bool remove_all = true;
+	struct kernfs_node *kn;
+	char subname[32];
 	char name[32];
 
+	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
+	if (r->mon_scope != r->mon_display_scope) {
+		int count = 0;
+
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
+			if (d->display_id == dom->display_id)
+				count++;
+		if (count > 1) {
+			remove_all = false;
+			sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+		}
+	}
+
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
-		sprintf(name, "mon_%s_%02d", r->name, dom_id);
-		kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+		if (remove_all) {
+			kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+		} else {
+			kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
+			if (kn)
+				kernfs_remove_by_name(kn, subname);
+		}
 
-		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
-			kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+			if (remove_all) {
+				kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+			} else {
+				kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
+				if (kn)
+					kernfs_remove_by_name(kn, subname);
+			}
+		}
 	}
 }
 
-static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_mon_domain *d,
-				struct rdt_resource *r, struct rdtgroup *prgrp)
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+			     struct rdt_resource *r, struct rdtgroup *prgrp,
+			     bool do_sum)
 {
 	union mon_data_bits priv;
-	struct kernfs_node *kn;
 	struct mon_evt *mevt;
 	struct rmid_read rr;
-	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
-	/* create the directory */
-	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
-	if (IS_ERR(kn))
-		return PTR_ERR(kn);
-
-	ret = rdtgroup_kn_set_ugid(kn);
-	if (ret)
-		goto out_destroy;
-
-	if (WARN_ON(list_empty(&r->evt_list))) {
-		ret = -EPERM;
-		goto out_destroy;
-	}
+	if (WARN_ON(list_empty(&r->evt_list)))
+		return -EPERM;
 
 	priv.u.rid = r->rid;
 	priv.u.domid = d->hdr.id;
+	priv.u.sum = do_sum;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
 		if (ret)
-			goto out_destroy;
+			return ret;
 
 		if (is_mbm_event(mevt->evtid))
 			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
 	}
+
+	return 0;
+}
+
+static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
+				struct rdt_mon_domain *d,
+				struct rdt_resource *r, struct rdtgroup *prgrp)
+{
+	struct kernfs_node *kn, *ckn;
+	char name[32];
+	bool do_sum;
+	int ret;
+
+	do_sum = r->mon_scope != r->mon_display_scope;
+	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
+	kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
+	if (!kn) {
+		/* create the directory */
+		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+		if (IS_ERR(kn))
+			return PTR_ERR(kn);
+
+		ret = rdtgroup_kn_set_ugid(kn);
+		if (ret)
+			goto out_destroy;
+		ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
+		if (ret)
+			goto out_destroy;
+	}
+
+	if (do_sum) {
+		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+		ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
+		if (IS_ERR(ckn))
+			goto out_destroy;
+
+		ret = rdtgroup_kn_set_ugid(ckn);
+		if (ret)
+			goto out_destroy;
+
+		ret = mon_add_all_files(ckn, d, r, prgrp, false);
+		if (ret)
+			goto out_destroy;
+	}
+
 	kernfs_activate(kn);
 	return 0;
 
@@ -3077,8 +3138,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 					   struct rdt_mon_domain *d)
 {
-	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
+	struct kernfs_node *parent_kn;
 	struct list_head *head;
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -3950,7 +4011,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
 	 * per domain monitor data directories.
 	 */
 	if (resctrl_mounted && resctrl_arch_mon_capable())
-		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
+		rmdir_mondata_subdir_allrdtgrp(r, d);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (6 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-10 21:24   ` Reinette Chatre
  2024-05-03 20:33 ` [PATCH v17 9/9] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2024-05-14 15:02 ` [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Maciej Wieczor-Retman
  9 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e72c2b872957..ce54a1ffe1e5 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1165,6 +1165,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a949e69308cd..6a1727ea1dfe 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -21,7 +21,9 @@
 #include <linux/err.h>
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
 
+#include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/resctrl.h>
 #include "internal.h"
@@ -746,11 +748,42 @@ static void clear_closid_rmid(int cpu)
 	      RESCTRL_RESERVED_CLOSID);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+	u64 val;
+
+	/* Only need to enable once per package. */
+	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 static int resctrl_arch_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
 	mutex_lock(&domain_list_lock);
+
+	if (snc_nodes_per_l3_cache > 1)
+		snc_remap_rmids(cpu);
+
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	mutex_unlock(&domain_list_lock);
@@ -990,11 +1023,97 @@ static __init bool get_rdt_resources(void)
 	return (rdt_mon_capable || rdt_alloc_capable);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+	X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	unsigned long *node_caches;
+	int mem_only_nodes = 0;
+	int cpu, node, ret;
+	int num_l3_caches;
+	int cache_id;
+
+	if (!x86_match_cpu(snc_cpu_ids))
+		return 1;
+
+	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+	if (!node_caches)
+		return 1;
+
+	cpus_read_lock();
+
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+	for_each_node(node) {
+		cpu = cpumask_first(cpumask_of_node(node));
+		if (cpu < nr_cpu_ids) {
+			cache_id = get_cpu_cacheinfo_id(cpu, 3);
+			if (cache_id != -1)
+				set_bit(cache_id, node_caches);
+		} else {
+			mem_only_nodes++;
+		}
+	}
+	cpus_read_unlock();
+
+	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+	kfree(node_caches);
+
+	if (!num_l3_caches)
+		goto insane;
+
+	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+		goto insane;
+
+	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
+	switch (ret) {
+	case 1:
+		break;
+	case 2:
+	case 3:
+	case 4:
+		pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+		break;
+	default:
+		goto insane;
+	}
+
+	return ret;
+insane:
+	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+		(nr_node_ids - mem_only_nodes), num_l3_caches);
+	return 1;
+}
+
 static __init void rdt_init_res_defs_intel(void)
 {
 	struct rdt_hw_resource *hw_res;
 	struct rdt_resource *r;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	for_each_rdt_resource(r) {
 		hw_res = resctrl_to_arch_res(r);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v17 9/9] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (7 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2024-05-03 20:33 ` Tony Luck
  2024-05-14 15:02 ` [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Maciej Wieczor-Retman
  9 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-03 20:33 UTC (permalink / raw
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

*** This patch needs updating for new files for monitoring ***

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 627e23869bca..401f6bfb4a3c 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
 	all tasks in the group. In CTRL_MON groups these files provide
 	the sum for all tasks in the CTRL_MON group and all tasks in
 	MON groups. Please see example section for more details on usage.
+	On systems with Sub-NUMA (SNC) cluster enabled there are extra
+	directories for each node (located within the "mon_L3_XX" directory
+	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+	where "YY" is the node number.
 
 "mon_hw_id":
 	Available only with debug option. The identifier used by hardware
@@ -484,6 +488,19 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-03 20:33 ` [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring Tony Luck
@ 2024-05-10 21:24   ` Reinette Chatre
  2024-05-13 17:05     ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-10 21:24 UTC (permalink / raw
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 5/3/2024 1:33 PM, Tony Luck wrote:

(Could you please start the changelog with some context?)

> Add a field to the rdt_resource structure to track whether monitoring
> resources are tracked by hardware at a different scope (NODE) from
> the legacy L3 scope.

This seems to describe @mon_scope that was introduced in patch #3?

> 
> Add a field to the rdt_mon_domain structure to track the L3 cache id
> which can be used to find all the domains that need resource counts
> summed to provide accurate values in the legacy monitoring files.

Why is this field necessary? Can this not be obtained dynamically?


> 
> When SNC is enabled create extra directories and files in each mon_data
> directory to report per-SNC node counts.

The above cryptic sentence is the closest the changelog gets to explaining
what this patch aims to do. Could you please enhance the changelog to
describe what this patch aims to do and more importantly how it goes about
doing so? This patch contains a significant number of undocumented quirks 
and between the cryptic changelog and undocumented quirks in the patch I find
it very hard to understand what it is trying to do and why.

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                   |   4 +
>  arch/x86/kernel/cpu/resctrl/internal.h    |   5 +-
>  arch/x86/kernel/cpu/resctrl/core.c        |   2 +
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   1 +
>  arch/x86/kernel/cpu/resctrl/monitor.c     |  52 +++++++---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 115 +++++++++++++++++-----
>  6 files changed, 137 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 5c7775343c3e..2f8ac925bc18 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
>  /**
>   * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
>   * @hdr:		common header for different domain types
> + * @display_id:		shared id used to identify domains to be summed for display
>   * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
>   * @mbm_total:		saved state for MBM total bandwidth
>   * @mbm_local:		saved state for MBM local bandwidth
> @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
>   */
>  struct rdt_mon_domain {
>  	struct rdt_domain_hdr		hdr;
> +	int				display_id;

(it is not clear to me why this is needed)

>  	unsigned long			*rmid_busy_llc;
>  	struct mbm_state		*mbm_total;
>  	struct mbm_state		*mbm_local;
> @@ -187,6 +189,7 @@ enum resctrl_scope {
>   * @num_rmid:		Number of RMIDs available
>   * @ctrl_scope:		Scope of this resource for control functions
>   * @mon_scope:		Scope of this resource for monitor functions
> + * @mon_display_scope:	Scope for user reporting monitor functions
>   * @cache:		Cache allocation related data
>   * @membw:		If the component has bandwidth controls, their properties.
>   * @ctrl_domains:	RCU list of all control domains for this resource
> @@ -207,6 +210,7 @@ struct rdt_resource {
>  	int			num_rmid;
>  	enum resctrl_scope	ctrl_scope;
>  	enum resctrl_scope	mon_scope;
> +	enum resctrl_scope	mon_display_scope;
>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
>  	struct list_head	ctrl_domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 49440f194253..d41b388bb499 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -132,6 +132,7 @@ struct mon_evt {
>   *                     as kernfs private data
>   * @rid:               Resource id associated with the event file
>   * @evtid:             Event id associated with the event file
> + * @sum:               Sum across domains with same display_id
>   * @domid:             The domain to which the event file belongs
>   * @u:                 Name of the bit fields struct
>   */
> @@ -139,7 +140,8 @@ union mon_data_bits {
>  	void *priv;
>  	struct {
>  		unsigned int rid		: 10;
> -		enum resctrl_event_id evtid	: 8;
> +		enum resctrl_event_id evtid	: 7;
> +		unsigned int sum		: 1;
>  		unsigned int domid		: 14;
>  	} u;

(No explanation about why evtid had to shrink and why it is ok
to do so.)

>  };
> @@ -150,6 +152,7 @@ struct rmid_read {
>  	struct rdt_mon_domain	*d;
>  	enum resctrl_event_id	evtid;
>  	bool			first;
> +	bool			sumdomains;
>  	int			err;
>  	u64			val;
>  	void			*arch_mon_ctx;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index cb181796f73b..a949e69308cd 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -71,6 +71,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  			.name			= "L3",
>  			.ctrl_scope		= RESCTRL_L3_CACHE,
>  			.mon_scope		= RESCTRL_L3_CACHE,
> +			.mon_display_scope	= RESCTRL_L3_CACHE,
>  			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
>  			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
>  			.parse_ctrlval		= parse_cbm,
> @@ -613,6 +614,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>  
>  	d = &hw_dom->d_resctrl;
>  	d->hdr.id = id;
> +	d->display_id = get_domain_id_from_scope(cpu, r->mon_display_scope);
>  	d->hdr.type = RESCTRL_MON_DOMAIN;
>  	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3b9383612c35..a4ead8ffbaf3 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -575,6 +575,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  	resid = md.u.rid;
>  	domid = md.u.domid;
>  	evtid = md.u.evtid;
> +	rr.sumdomains = md.u.sum;
>  
>  	r = &rdt_resources_all[resid].r_resctrl;
>  	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index d0bbeb410750..2e795b261b6f 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -16,6 +16,7 @@
>   */
>  
>  #include <linux/cpu.h>
> +#include <linux/cacheinfo.h>

Can this be alphabetical?

>  #include <linux/module.h>
>  #include <linux/sizes.h>
>  #include <linux/slab.h>
> @@ -187,18 +188,8 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>  
>  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  {
> -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> -	int cpu = smp_processor_id();
> -	int rmid_offset = 0;
>  	u64 msr_val;
>  
> -	/*
> -	 * When SNC mode is on, need to compute the offset to read the
> -	 * physical RMID counter for the node to which this CPU belongs.
> -	 */
> -	if (snc_nodes_per_l3_cache > 1)
> -		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> -

This removes code that was just added in previous patch. Can the end goal
be reached without this churn? I expect doing so will make this patch easier to
follow.

>  	/*
>  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>  	 * with a valid event code for supported resource type and the bits
> @@ -207,7 +198,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>  	 * are error bits.
>  	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>  
>  	if (msr_val & RMID_VAL_ERROR)
> @@ -291,7 +282,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>  
>  	resctrl_arch_rmid_read_context_check();
>  
> -	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> +	if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
>  		return -EINVAL;

Does this mean that when SNC is enabled then reading data for an event within a particular
monitor domain ("node scope") can read its data from any CPU within the L3 domain
("mon_display_scope") even if that CPU is not associated with the node for which it
is reading the data?

If so this really turns many resctrl assumptions and architecture on its head since the
resctrl expectation is that only CPUs within a domain's cpumask can be used to interact
with the domain. This in turn makes this seemingly general feature actually SNC specific.

  
>  	ret = __rmid_read(rmid, eventid, &msr_val);
> @@ -556,7 +547,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
>  	}
>  }
>  
> -static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> +static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr, u64 *rrval)
>  {
>  	struct mbm_state *m;
>  	u64 tval = 0;
> @@ -574,11 +565,44 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>  	if (rr->err)
>  		return rr->err;
>  
> -	rr->val += tval;
> +	*rrval += tval;
>  

Why is rrval needed?

>  	return 0;
>  }
>  
> +static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
> +{
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
> +
> +	return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +}
> +
> +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> +{
> +	struct rdt_mon_domain *d;
> +	struct rmid_read tmp;
> +	u32 node_rmid;
> +	int ret = 0;
> +
> +	if (!rr->sumdomains) {
> +		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
> +		return ___mon_event_count(closid, node_rmid, rr, &rr->val);
> +	}
> +
> +	tmp = *rr;
> +	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> +		if (d->display_id == rr->d->display_id) {
> +			tmp.d = d;
> +			node_rmid = get_node_rmid(rr->r, d, rmid);
> +			ret = ___mon_event_count(closid, node_rmid, &tmp, &rr->val);

If I understand correctly this function is run per IPI on a CPU associated
with one of the monitor domains (depends on which one came online first),
and then it will read the monitor data of the other domains from the same
CPU? This is unexpected since the expectation is that monitor data
needs to be read from a CPU associated with the domain it is
reading data for.

Also, providing tmp as well as rr->val seems unnecessary?

> +			if (ret)
> +				break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * mbm_bw_count() - Update bw count from values previously read by
>   *		    __mon_event_count().
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 0923492a8bd0..a56ae08ca255 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3011,57 +3011,118 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>   * and monitor groups with given domain id.
>   */
>  static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> -					   unsigned int dom_id)
> +					   struct rdt_mon_domain *d)
>  {
>  	struct rdtgroup *prgrp, *crgrp;
> +	struct rdt_mon_domain *dom;
> +	bool remove_all = true;
> +	struct kernfs_node *kn;
> +	char subname[32];
>  	char name[32];
>  
> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> +	if (r->mon_scope != r->mon_display_scope) {
> +		int count = 0;
> +
> +		list_for_each_entry(dom, &r->mon_domains, hdr.list)
> +			if (d->display_id == dom->display_id)
> +				count++;
> +		if (count > 1) {
> +			remove_all = false;
> +			sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +		}
> +	}


This seems awkward. I wonder if it may not be simpler to just
remove the directory and on completion check if the parent has
any subdirectories left and remove the parent if there are no
subdirectories remaining. Something possible via reading the inode's
i_nlink that is accessible via kernfs_get_inode(). What do you think?

> +
>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> -		sprintf(name, "mon_%s_%02d", r->name, dom_id);
> -		kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
> +		if (remove_all) {
> +			kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
> +		} else {
> +			kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
> +			if (kn)
> +				kernfs_remove_by_name(kn, subname);
> +		}
>  
> -		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> -			kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
> +		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
> +			if (remove_all) {
> +				kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
> +			} else {
> +				kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
> +				if (kn)
> +					kernfs_remove_by_name(kn, subname);
> +			}
> +		}
>  	}
>  }
>  
> -static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> -				struct rdt_mon_domain *d,
> -				struct rdt_resource *r, struct rdtgroup *prgrp)
> +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> +			     struct rdt_resource *r, struct rdtgroup *prgrp,
> +			     bool do_sum)
>  {
>  	union mon_data_bits priv;
> -	struct kernfs_node *kn;
>  	struct mon_evt *mevt;
>  	struct rmid_read rr;
> -	char name[32];
>  	int ret;
>  
> -	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> -	/* create the directory */
> -	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> -	if (IS_ERR(kn))
> -		return PTR_ERR(kn);
> -
> -	ret = rdtgroup_kn_set_ugid(kn);
> -	if (ret)
> -		goto out_destroy;
> -
> -	if (WARN_ON(list_empty(&r->evt_list))) {
> -		ret = -EPERM;
> -		goto out_destroy;
> -	}
> +	if (WARN_ON(list_empty(&r->evt_list)))
> +		return -EPERM;
>  
>  	priv.u.rid = r->rid;
>  	priv.u.domid = d->hdr.id;
> +	priv.u.sum = do_sum;
>  	list_for_each_entry(mevt, &r->evt_list, list) {
>  		priv.u.evtid = mevt->evtid;
>  		ret = mon_addfile(kn, mevt->name, priv.priv);
>  		if (ret)
> -			goto out_destroy;
> +			return ret;
>  
>  		if (is_mbm_event(mevt->evtid))
>  			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);

I do not think that the "do_sum" file should be doing any initialization, this
will be repeated for the "real" mon domain, no?

>  	}
> +
> +	return 0;
> +}
> +
> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> +				struct rdt_mon_domain *d,
> +				struct rdt_resource *r, struct rdtgroup *prgrp)
> +{
> +	struct kernfs_node *kn, *ckn;
> +	char name[32];
> +	bool do_sum;
> +	int ret;
> +
> +	do_sum = r->mon_scope != r->mon_display_scope;
> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> +	kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> +	if (!kn) {
> +		/* create the directory */
> +		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> +		if (IS_ERR(kn))
> +			return PTR_ERR(kn);
> +
> +		ret = rdtgroup_kn_set_ugid(kn);
> +		if (ret)
> +			goto out_destroy;
> +		ret = mon_add_all_files(kn, d, r, prgrp, do_sum);

This does not look right. If I understand correctly the private data
of these event files will have whichever mon domain came up first as
its domain id. That seems completely arbitrary and does not reflect
accurate state for this file. Since "do_sum" is essentially a "flag"
on how this file can be treated, can its "dom_id" not rather be
the "monitor scope domain id"? Could that not help to eliminate 
that per-domain "display_id"?

> +		if (ret)
> +			goto out_destroy;
> +	}
> +
> +	if (do_sum) {
> +		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +		ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> +		if (IS_ERR(ckn))
> +			goto out_destroy;
> +
> +		ret = rdtgroup_kn_set_ugid(ckn);
> +		if (ret)
> +			goto out_destroy;
> +
> +		ret = mon_add_all_files(ckn, d, r, prgrp, false);
> +		if (ret)
> +			goto out_destroy;
> +	}
> +
>  	kernfs_activate(kn);
>  	return 0;
>  
> @@ -3077,8 +3138,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>  static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>  					   struct rdt_mon_domain *d)
>  {
> -	struct kernfs_node *parent_kn;
>  	struct rdtgroup *prgrp, *crgrp;
> +	struct kernfs_node *parent_kn;
>  	struct list_head *head;
>  
>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> @@ -3950,7 +4011,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
>  	 * per domain monitor data directories.
>  	 */
>  	if (resctrl_mounted && resctrl_arch_mon_capable())
> -		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
> +		rmdir_mondata_subdir_allrdtgrp(r, d);
>  
>  	if (is_mbm_enabled())
>  		cancel_delayed_work(&d->mbm_over);

Reinette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-05-03 20:33 ` [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
@ 2024-05-10 21:24   ` Reinette Chatre
  2024-05-13 17:17     ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-10 21:24 UTC (permalink / raw
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 5/3/2024 1:33 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
> 
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> 
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/msr-index.h   |   1 +
>  arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
>  2 files changed, 120 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e72c2b872957..ce54a1ffe1e5 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1165,6 +1165,7 @@
>  #define MSR_IA32_QM_CTR			0xc8e
>  #define MSR_IA32_PQR_ASSOC		0xc8f
>  #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>  #define MSR_IA32_L2_CBM_BASE		0xd10
>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index a949e69308cd..6a1727ea1dfe 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -21,7 +21,9 @@
>  #include <linux/err.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>  
> +#include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  #include <asm/resctrl.h>
>  #include "internal.h"
> @@ -746,11 +748,42 @@ static void clear_closid_rmid(int cpu)
>  	      RESCTRL_RESERVED_CLOSID);
>  }
>  
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters
> + * must adjust RMID counter numbers based on SNC node. See
> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> +	u64 val;
> +
> +	/* Only need to enable once per package. */
> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>  static int resctrl_arch_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
>  	mutex_lock(&domain_list_lock);
> +
> +	if (snc_nodes_per_l3_cache > 1)
> +		snc_remap_rmids(cpu);
> +
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	mutex_unlock(&domain_list_lock);
> @@ -990,11 +1023,97 @@ static __init bool get_rdt_resources(void)
>  	return (rdt_mon_capable || rdt_alloc_capable);
>  }
>  
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> +	X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * ratio of NUMA nodes to L3 cache instances.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> +	unsigned long *node_caches;
> +	int mem_only_nodes = 0;
> +	int cpu, node, ret;
> +	int num_l3_caches;
> +	int cache_id;
> +
> +	if (!x86_match_cpu(snc_cpu_ids))
> +		return 1;
> +
> +	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
> +	if (!node_caches)
> +		return 1;
> +
> +	cpus_read_lock();
> +
> +	if (num_online_cpus() != num_present_cpus())
> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +
> +	for_each_node(node) {
> +		cpu = cpumask_first(cpumask_of_node(node));
> +		if (cpu < nr_cpu_ids) {
> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> +			if (cache_id != -1)
> +				set_bit(cache_id, node_caches);
> +		} else {
> +			mem_only_nodes++;
> +		}
> +	}
> +	cpus_read_unlock();
> +
> +	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
> +	kfree(node_caches);
> +
> +	if (!num_l3_caches)
> +		goto insane;
> +
> +	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
> +	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
> +		goto insane;
> +
> +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +
> +	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
> +	switch (ret) {
> +	case 1:
> +		break;
> +	case 2:
> +	case 3:
> +	case 4:
> +		pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +		break;
> +	default:
> +		goto insane;
> +	}
> +
> +	return ret;
> +insane:
> +	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
> +		(nr_node_ids - mem_only_nodes), num_l3_caches);
> +	return 1;
> +}

I find it confusing how dramatically this SNC detection code changed without
any explanations. This detection seems to match the SNC detection code from v16 but
after v16 you posted a new SNC detection implementation that did SNC detection totally
differently [1] from v16. Instead of keeping with the "new" detection this implements
what was in v16. Could you please help me understand what motivated the different
implementations and why the big differences?

Reinette


[1] https://lore.kernel.org/lkml/20240327200352.236835-11-tony.luck@intel.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-10 21:24   ` Reinette Chatre
@ 2024-05-13 17:05     ` Tony Luck
  2024-05-13 18:53       ` Reinette Chatre
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-13 17:05 UTC (permalink / raw
  To: Reinette Chatre
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
> Hi Tony,

Hi Reinette,

Thanks for the review. Detailed comments below. But overall I'm
going to split patch 7 into a bunch of smaller changes, each with
a better commit message.

> On 5/3/2024 1:33 PM, Tony Luck wrote:
> 
> (Could you please start the changelog with some context?)
> 
> > Add a field to the rdt_resource structure to track whether monitoring
> > resources are tracked by hardware at a different scope (NODE) from
> > the legacy L3 scope.
> 
> This seems to describe @mon_scope that was introduced in patch #3?

Not really. Patch #3 made the change so that control an monitor
functions can have different scope. That's still needed as with SNC
enabled the underlying data collection is at the node level for
monitoring, while control stays at the L3 cache scope.

This new field describes the legacy scope of monitoring, so that
resctrl can provide correctly scoped monitor files for legacy
applications that aren't aware of SNC. So I'm using this both
to indicate when SNC is enabled (with mon_scope != mon_display_scope)
or disabled (when they are the same).

> > 
> > Add a field to the rdt_mon_domain structure to track the L3 cache id
> > which can be used to find all the domains that need resource counts
> > summed to provide accurate values in the legacy monitoring files.
> 
> Why is this field necessary? Can this not be obtained dynamically?

I could compute it each time I need it (when making/removing
directories, or finding which SNC domains share an L3 domain).

	id = get_domain_id_from_scope(cpumask_any(&d->cpu_mask), r->mon_display_scope);
	if (id < 0)
		// error path

But it seemed better to just discover this once at domain creation time.

> 
> > 
> > When SNC is enabled create extra directories and files in each mon_data
> > directory to report per-SNC node counts.
> 
> The above cryptic sentence is the closest the changelog gets to explaining
> what this patch aims to do. Could you please enhance the changelog to
> describe what this patch aims to do and more importantly how it goes about
> doing so? This patch contains a significant number of undocumented quirks 
> and between the cryptic changelog and undocumented quirks in the patch I find
> it very hard to understand what it is trying to do and why.
> 
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  include/linux/resctrl.h                   |   4 +
> >  arch/x86/kernel/cpu/resctrl/internal.h    |   5 +-
> >  arch/x86/kernel/cpu/resctrl/core.c        |   2 +
> >  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   1 +
> >  arch/x86/kernel/cpu/resctrl/monitor.c     |  52 +++++++---
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 115 +++++++++++++++++-----
> >  6 files changed, 137 insertions(+), 42 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 5c7775343c3e..2f8ac925bc18 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
> >  /**
> >   * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> >   * @hdr:		common header for different domain types
> > + * @display_id:		shared id used to identify domains to be summed for display
> >   * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
> >   * @mbm_total:		saved state for MBM total bandwidth
> >   * @mbm_local:		saved state for MBM local bandwidth
> > @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
> >   */
> >  struct rdt_mon_domain {
> >  	struct rdt_domain_hdr		hdr;
> > +	int				display_id;
> 
> (it is not clear to me why this is needed)

Described above. I will include that when I split this into its own
patch.

> >  	unsigned long			*rmid_busy_llc;
> >  	struct mbm_state		*mbm_total;
> >  	struct mbm_state		*mbm_local;
> > @@ -187,6 +189,7 @@ enum resctrl_scope {
> >   * @num_rmid:		Number of RMIDs available
> >   * @ctrl_scope:		Scope of this resource for control functions
> >   * @mon_scope:		Scope of this resource for monitor functions
> > + * @mon_display_scope:	Scope for user reporting monitor functions
> >   * @cache:		Cache allocation related data
> >   * @membw:		If the component has bandwidth controls, their properties.
> >   * @ctrl_domains:	RCU list of all control domains for this resource
> > @@ -207,6 +210,7 @@ struct rdt_resource {
> >  	int			num_rmid;
> >  	enum resctrl_scope	ctrl_scope;
> >  	enum resctrl_scope	mon_scope;
> > +	enum resctrl_scope	mon_display_scope;
> >  	struct resctrl_cache	cache;
> >  	struct resctrl_membw	membw;
> >  	struct list_head	ctrl_domains;
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 49440f194253..d41b388bb499 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -132,6 +132,7 @@ struct mon_evt {
> >   *                     as kernfs private data
> >   * @rid:               Resource id associated with the event file
> >   * @evtid:             Event id associated with the event file
> > + * @sum:               Sum across domains with same display_id
> >   * @domid:             The domain to which the event file belongs
> >   * @u:                 Name of the bit fields struct
> >   */
> > @@ -139,7 +140,8 @@ union mon_data_bits {
> >  	void *priv;
> >  	struct {
> >  		unsigned int rid		: 10;
> > -		enum resctrl_event_id evtid	: 8;
> > +		enum resctrl_event_id evtid	: 7;
> > +		unsigned int sum		: 1;
> >  		unsigned int domid		: 14;
> >  	} u;
> 
> (No explanation about why evtid had to shrink and why it is ok
> to do so.)

Will split this into its own patch and provide description of need
and safety.

> >  };
> > @@ -150,6 +152,7 @@ struct rmid_read {
> >  	struct rdt_mon_domain	*d;
> >  	enum resctrl_event_id	evtid;
> >  	bool			first;
> > +	bool			sumdomains;
> >  	int			err;
> >  	u64			val;
> >  	void			*arch_mon_ctx;
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index cb181796f73b..a949e69308cd 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -71,6 +71,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> >  			.name			= "L3",
> >  			.ctrl_scope		= RESCTRL_L3_CACHE,
> >  			.mon_scope		= RESCTRL_L3_CACHE,
> > +			.mon_display_scope	= RESCTRL_L3_CACHE,
> >  			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
> >  			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
> >  			.parse_ctrlval		= parse_cbm,
> > @@ -613,6 +614,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> >  
> >  	d = &hw_dom->d_resctrl;
> >  	d->hdr.id = id;
> > +	d->display_id = get_domain_id_from_scope(cpu, r->mon_display_scope);
> >  	d->hdr.type = RESCTRL_MON_DOMAIN;
> >  	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > index 3b9383612c35..a4ead8ffbaf3 100644
> > --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > @@ -575,6 +575,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> >  	resid = md.u.rid;
> >  	domid = md.u.domid;
> >  	evtid = md.u.evtid;
> > +	rr.sumdomains = md.u.sum;
> >  
> >  	r = &rdt_resources_all[resid].r_resctrl;
> >  	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index d0bbeb410750..2e795b261b6f 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -16,6 +16,7 @@
> >   */
> >  
> >  #include <linux/cpu.h>
> > +#include <linux/cacheinfo.h>
> 
> Can this be alphabetical?

Sure. Will fix.

> >  #include <linux/module.h>
> >  #include <linux/sizes.h>
> >  #include <linux/slab.h>
> > @@ -187,18 +188,8 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> >  
> >  static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  {
> > -	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > -	int cpu = smp_processor_id();
> > -	int rmid_offset = 0;
> >  	u64 msr_val;
> >  
> > -	/*
> > -	 * When SNC mode is on, need to compute the offset to read the
> > -	 * physical RMID counter for the node to which this CPU belongs.
> > -	 */
> > -	if (snc_nodes_per_l3_cache > 1)
> > -		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > -
> 
> This removes code that was just added in previous patch. Can the end goal
> be reached without this churn? I expect doing so will make this patch easier to
> follow.

Oops, yes. I will delete this from patch 6 to avoid churn.

> >  	/*
> >  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> >  	 * with a valid event code for supported resource type and the bits
> > @@ -207,7 +198,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> >  	 * are error bits.
> >  	 */
> > -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> > +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> >  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
> >  
> >  	if (msr_val & RMID_VAL_ERROR)
> > @@ -291,7 +282,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> >  
> >  	resctrl_arch_rmid_read_context_check();
> >  
> > -	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> > +	if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
> >  		return -EINVAL;
> 
> Does this mean that when SNC is enabled then reading data for an event within a particular
> monitor domain ("node scope") can read its data from any CPU within the L3 domain
> ("mon_display_scope") even if that CPU is not associated with the node for which it
> is reading the data?

Yes.

> If so this really turns many resctrl assumptions and architecture on its head since the
> resctrl expectation is that only CPUs within a domain's cpumask can be used to interact
> with the domain. This in turn makes this seemingly general feature actually SNC specific.

This is only an expectation for x86 features using IA32_QM_EVTSEL/IA32_QM_CTR
MSR method to read counters. ARM doesn't have the "CPU must be in
domain" restriction (as far as I can tell). Nor does the Intel IO RDT
(which uses MMIO space for control registers, these can be read/written
from any CPU).

We do know that those two MSRs can be read from any CPU that shares an
L3 cache. It would seem to be pointless overhead to force a cross
processor interrupt to read them from a different CPU just to satisfy
a "must be in same domain" non-requirement. I'l split this into its
own patch with suitable description.

> >  	ret = __rmid_read(rmid, eventid, &msr_val);
> > @@ -556,7 +547,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> >  	}
> >  }
> >  
> > -static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> > +static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr, u64 *rrval)
> >  {
> >  	struct mbm_state *m;
> >  	u64 tval = 0;
> > @@ -574,11 +565,44 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> >  	if (rr->err)
> >  		return rr->err;
> >  
> > -	rr->val += tval;
> > +	*rrval += tval;
> >  
> 
> Why is rrval needed?

I don't think it is anymore. I think I wanted it while I was developing
this set of changes. But I will drop it.

> >  	return 0;
> >  }
> >  
> > +static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
> > +{
> > +	int cpu = cpumask_any(&d->hdr.cpu_mask);
> > +
> > +	return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > +}
> > +
> > +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> > +{
> > +	struct rdt_mon_domain *d;
> > +	struct rmid_read tmp;
> > +	u32 node_rmid;
> > +	int ret = 0;
> > +
> > +	if (!rr->sumdomains) {
> > +		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
> > +		return ___mon_event_count(closid, node_rmid, rr, &rr->val);
> > +	}
> > +
> > +	tmp = *rr;
> > +	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> > +		if (d->display_id == rr->d->display_id) {
> > +			tmp.d = d;
> > +			node_rmid = get_node_rmid(rr->r, d, rmid);
> > +			ret = ___mon_event_count(closid, node_rmid, &tmp, &rr->val);
> 
> If I understand correctly this function is run per IPI on a CPU associated
> with one of the monitor domains (depends on which one came online first),
> and then it will read the monitor data of the other domains from the same
> CPU? This is unexpected since the expectation is that monitor data
> needs to be read from a CPU associated with the domain it is
> reading data for.

See earlier note. The counter can be read from any CPU sharing the same
L3. Adding unnecessary IPI is pointless overhead. But I will add
comments.

> Also, providing tmp as well as rr->val seems unnecessary?

I think I was unsure about modifying the domain field in the struct
rmid_read in the middle of the call chain. But the original caller
mon_event_read() doesn't look at rr->domain after the smp_call*()
function returns. I will drop "tmp".

> > +			if (ret)
> > +				break;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  /*
> >   * mbm_bw_count() - Update bw count from values previously read by
> >   *		    __mon_event_count().
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 0923492a8bd0..a56ae08ca255 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -3011,57 +3011,118 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> >   * and monitor groups with given domain id.
> >   */
> >  static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > -					   unsigned int dom_id)
> > +					   struct rdt_mon_domain *d)
> >  {
> >  	struct rdtgroup *prgrp, *crgrp;
> > +	struct rdt_mon_domain *dom;
> > +	bool remove_all = true;
> > +	struct kernfs_node *kn;
> > +	char subname[32];
> >  	char name[32];
> >  
> > +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> > +	if (r->mon_scope != r->mon_display_scope) {
> > +		int count = 0;
> > +
> > +		list_for_each_entry(dom, &r->mon_domains, hdr.list)
> > +			if (d->display_id == dom->display_id)
> > +				count++;
> > +		if (count > 1) {
> > +			remove_all = false;
> > +			sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > +		}
> > +	}
> 
> 
> This seems awkward. I wonder if it may not be simpler to just
> remove the directory and on completion check if the parent has
> any subdirectories left and remove the parent if there are no
> subdirectories remaining. Something possible via reading the inode's
> i_nlink that is accessible via kernfs_get_inode(). What do you think?

kernfs_get_inode() needs a pointer to the "struct super_block" for the
filesystem. Resctrl filesystem code doesn't seem to keep track of that
anywhere. Only mentioned in rdt_kill_sb() where core kernfs code passes
it in as the argument. When registering/mounting the resctrl filesystem
there's a "struct fs_context *fc" ... is there a function to get the
super block from that? Even if there is, I'd need to add a global to
save a copy of the fc_context.

> > +
> >  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> > -		sprintf(name, "mon_%s_%02d", r->name, dom_id);
> > -		kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
> > +		if (remove_all) {
> > +			kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
> > +		} else {
> > +			kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
> > +			if (kn)
> > +				kernfs_remove_by_name(kn, subname);
> > +		}
> >  
> > -		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> > -			kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
> > +		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
> > +			if (remove_all) {
> > +				kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
> > +			} else {
> > +				kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
> > +				if (kn)
> > +					kernfs_remove_by_name(kn, subname);
> > +			}
> > +		}
> >  	}
> >  }
> >  
> > -static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > -				struct rdt_mon_domain *d,
> > -				struct rdt_resource *r, struct rdtgroup *prgrp)
> > +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> > +			     struct rdt_resource *r, struct rdtgroup *prgrp,
> > +			     bool do_sum)
> >  {
> >  	union mon_data_bits priv;
> > -	struct kernfs_node *kn;
> >  	struct mon_evt *mevt;
> >  	struct rmid_read rr;
> > -	char name[32];
> >  	int ret;
> >  
> > -	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> > -	/* create the directory */
> > -	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> > -	if (IS_ERR(kn))
> > -		return PTR_ERR(kn);
> > -
> > -	ret = rdtgroup_kn_set_ugid(kn);
> > -	if (ret)
> > -		goto out_destroy;
> > -
> > -	if (WARN_ON(list_empty(&r->evt_list))) {
> > -		ret = -EPERM;
> > -		goto out_destroy;
> > -	}
> > +	if (WARN_ON(list_empty(&r->evt_list)))
> > +		return -EPERM;
> >  
> >  	priv.u.rid = r->rid;
> >  	priv.u.domid = d->hdr.id;
> > +	priv.u.sum = do_sum;
> >  	list_for_each_entry(mevt, &r->evt_list, list) {
> >  		priv.u.evtid = mevt->evtid;
> >  		ret = mon_addfile(kn, mevt->name, priv.priv);
> >  		if (ret)
> > -			goto out_destroy;
> > +			return ret;
> >  
> >  		if (is_mbm_event(mevt->evtid))
> >  			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
> 
> I do not think that the "do_sum" file should be doing any initialization, this
> will be repeated for the "real" mon domain, no?

Good point. I'll drop from the "sum" files and just run it for the
"real" ones.

> >  	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > +				struct rdt_mon_domain *d,
> > +				struct rdt_resource *r, struct rdtgroup *prgrp)
> > +{
> > +	struct kernfs_node *kn, *ckn;
> > +	char name[32];
> > +	bool do_sum;
> > +	int ret;
> > +
> > +	do_sum = r->mon_scope != r->mon_display_scope;
> > +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> > +	kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> > +	if (!kn) {
> > +		/* create the directory */
> > +		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> > +		if (IS_ERR(kn))
> > +			return PTR_ERR(kn);
> > +
> > +		ret = rdtgroup_kn_set_ugid(kn);
> > +		if (ret)
> > +			goto out_destroy;
> > +		ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
> 
> This does not look right. If I understand correctly the private data
> of these event files will have whichever mon domain came up first as
> its domain id. That seems completely arbitrary and does not reflect
> accurate state for this file. Since "do_sum" is essentially a "flag"
> on how this file can be treated, can its "dom_id" not rather be
> the "monitor scope domain id"? Could that not help to eliminate 
> that per-domain "display_id"?

You are correct that this should be the "monitor scope domain id" rather
than the first SNC domain that appears. I'll change to use that. I don't
think it helps in removing the per-domain display_id.

> > +		if (ret)
> > +			goto out_destroy;
> > +	}
> > +
> > +	if (do_sum) {
> > +		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > +		ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> > +		if (IS_ERR(ckn))
> > +			goto out_destroy;
> > +
> > +		ret = rdtgroup_kn_set_ugid(ckn);
> > +		if (ret)
> > +			goto out_destroy;
> > +
> > +		ret = mon_add_all_files(ckn, d, r, prgrp, false);
> > +		if (ret)
> > +			goto out_destroy;
> > +	}
> > +
> >  	kernfs_activate(kn);
> >  	return 0;
> >  
> > @@ -3077,8 +3138,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> >  static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> >  					   struct rdt_mon_domain *d)
> >  {
> > -	struct kernfs_node *parent_kn;
> >  	struct rdtgroup *prgrp, *crgrp;
> > +	struct kernfs_node *parent_kn;
> >  	struct list_head *head;
> >  
> >  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> > @@ -3950,7 +4011,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> >  	 * per domain monitor data directories.
> >  	 */
> >  	if (resctrl_mounted && resctrl_arch_mon_capable())
> > -		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
> > +		rmdir_mondata_subdir_allrdtgrp(r, d);
> >  
> >  	if (is_mbm_enabled())
> >  		cancel_delayed_work(&d->mbm_over);
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-05-10 21:24   ` Reinette Chatre
@ 2024-05-13 17:17     ` Tony Luck
  2024-05-13 18:53       ` Reinette Chatre
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-13 17:17 UTC (permalink / raw
  To: Reinette Chatre
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

On Fri, May 10, 2024 at 02:24:49PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/3/2024 1:33 PM, Tony Luck wrote:
> > There isn't a simple hardware bit that indicates whether a CPU is
> > running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> > the ratio of NUMA nodes to L3 cache instances.
> > 
> > When SNC mode is detected, reconfigure the RMID counters by updating
> > the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> > 
> > Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> > on the second SNC node to start from zero.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  arch/x86/include/asm/msr-index.h   |   1 +
> >  arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
> >  2 files changed, 120 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index e72c2b872957..ce54a1ffe1e5 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -1165,6 +1165,7 @@
> >  #define MSR_IA32_QM_CTR			0xc8e
> >  #define MSR_IA32_PQR_ASSOC		0xc8f
> >  #define MSR_IA32_L3_CBM_BASE		0xc90
> > +#define MSR_RMID_SNC_CONFIG		0xca0
> >  #define MSR_IA32_L2_CBM_BASE		0xd10
> >  #define MSR_IA32_MBA_THRTL_BASE		0xd50
> >  
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index a949e69308cd..6a1727ea1dfe 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -21,7 +21,9 @@
> >  #include <linux/err.h>
> >  #include <linux/cacheinfo.h>
> >  #include <linux/cpuhotplug.h>
> > +#include <linux/mod_devicetable.h>
> >  
> > +#include <asm/cpu_device_id.h>
> >  #include <asm/intel-family.h>
> >  #include <asm/resctrl.h>
> >  #include "internal.h"
> > @@ -746,11 +748,42 @@ static void clear_closid_rmid(int cpu)
> >  	      RESCTRL_RESERVED_CLOSID);
> >  }
> >  
> > +/*
> > + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> > + * which indicates that RMIDs are configured in legacy mode.
> > + * This mode is incompatible with Linux resctrl semantics
> > + * as RMIDs are partitioned between SNC nodes, which requires
> > + * a user to know which RMID is allocated to a task.
> > + * Clearing bit 0 reconfigures the RMID counters for use
> > + * in Sub NUMA Cluster mode. This mode is better for Linux.
> > + * The RMID space is divided between all SNC nodes with the
> > + * RMIDs renumbered to start from zero in each node when
> > + * couning operations from tasks. Code to read the counters
> > + * must adjust RMID counter numbers based on SNC node. See
> > + * __rmid_read() for code that does this.
> > + */
> > +static void snc_remap_rmids(int cpu)
> > +{
> > +	u64 val;
> > +
> > +	/* Only need to enable once per package. */
> > +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> > +		return;
> > +
> > +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> > +	val &= ~BIT_ULL(0);
> > +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> > +}
> > +
> >  static int resctrl_arch_online_cpu(unsigned int cpu)
> >  {
> >  	struct rdt_resource *r;
> >  
> >  	mutex_lock(&domain_list_lock);
> > +
> > +	if (snc_nodes_per_l3_cache > 1)
> > +		snc_remap_rmids(cpu);
> > +
> >  	for_each_capable_rdt_resource(r)
> >  		domain_add_cpu(cpu, r);
> >  	mutex_unlock(&domain_list_lock);
> > @@ -990,11 +1023,97 @@ static __init bool get_rdt_resources(void)
> >  	return (rdt_mon_capable || rdt_alloc_capable);
> >  }
> >  
> > +/* CPU models that support MSR_RMID_SNC_CONFIG */
> > +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> > +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> > +	X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0),
> > +	{}
> > +};
> > +
> > +/*
> > + * There isn't a simple hardware bit that indicates whether a CPU is running
> > + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> > + * ratio of NUMA nodes to L3 cache instances.
> > + * It is not possible to accurately determine SNC state if the system is
> > + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> > + * to L3 caches. It will be OK if system is booted with hyperthreading
> > + * disabled (since this doesn't affect the ratio).
> > + */
> > +static __init int snc_get_config(void)
> > +{
> > +	unsigned long *node_caches;
> > +	int mem_only_nodes = 0;
> > +	int cpu, node, ret;
> > +	int num_l3_caches;
> > +	int cache_id;
> > +
> > +	if (!x86_match_cpu(snc_cpu_ids))
> > +		return 1;
> > +
> > +	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
> > +	if (!node_caches)
> > +		return 1;
> > +
> > +	cpus_read_lock();
> > +
> > +	if (num_online_cpus() != num_present_cpus())
> > +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> > +
> > +	for_each_node(node) {
> > +		cpu = cpumask_first(cpumask_of_node(node));
> > +		if (cpu < nr_cpu_ids) {
> > +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> > +			if (cache_id != -1)
> > +				set_bit(cache_id, node_caches);
> > +		} else {
> > +			mem_only_nodes++;
> > +		}
> > +	}
> > +	cpus_read_unlock();
> > +
> > +	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
> > +	kfree(node_caches);
> > +
> > +	if (!num_l3_caches)
> > +		goto insane;
> > +
> > +	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
> > +	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
> > +		goto insane;
> > +
> > +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> > +
> > +	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
> > +	switch (ret) {
> > +	case 1:
> > +		break;
> > +	case 2:
> > +	case 3:
> > +	case 4:
> > +		pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
> > +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> > +		break;
> > +	default:
> > +		goto insane;
> > +	}
> > +
> > +	return ret;
> > +insane:
> > +	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
> > +		(nr_node_ids - mem_only_nodes), num_l3_caches);
> > +	return 1;
> > +}
> 
> I find it confusing how dramatically this SNC detection code changed without
> any explanations. This detection seems to match the SNC detection code from v16 but
> after v16 you posted a new SNC detection implementation that did SNC detection totally
> differently [1] from v16. Instead of keeping with the "new" detection this implements
> what was in v16. Could you please help me understand what motivated the different
> implementations and why the big differences?

Reinette,

Do you like the detection code in that version? You didn't make any
comments about it.

I switched back to the v16 code because that had survived review before
and I just wanted to make the modifications to add both per-L3 and
per-SNC node monitoring files.

I can pull that into the next iteration if you want.

-Tony
> 
> [1] https://lore.kernel.org/lkml/20240327200352.236835-11-tony.luck@intel.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-13 17:05     ` Tony Luck
@ 2024-05-13 18:53       ` Reinette Chatre
  2024-05-14  0:21         ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-13 18:53 UTC (permalink / raw
  To: Tony Luck
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

Hi Tony,

On 5/13/2024 10:05 AM, Tony Luck wrote:
> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
>> Hi Tony,
> 
> Hi Reinette,
> 
> Thanks for the review. Detailed comments below. But overall I'm
> going to split patch 7 into a bunch of smaller changes, each with
> a better commit message.
> 
>> On 5/3/2024 1:33 PM, Tony Luck wrote:
>>
>> (Could you please start the changelog with some context?)
>>
>>> Add a field to the rdt_resource structure to track whether monitoring
>>> resources are tracked by hardware at a different scope (NODE) from
>>> the legacy L3 scope.
>>
>> This seems to describe @mon_scope that was introduced in patch #3?
> 
> Not really. Patch #3 made the change so that control an monitor
> functions can have different scope. That's still needed as with SNC
> enabled the underlying data collection is at the node level for
> monitoring, while control stays at the L3 cache scope.
> 
> This new field describes the legacy scope of monitoring, so that
> resctrl can provide correctly scoped monitor files for legacy
> applications that aren't aware of SNC. So I'm using this both
> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
> or disabled (when they are the same).

This seems to enforce the idea that these new additions aim to be
generic on the surface but the only goal is to support SNC.

> 
>>>
>>> Add a field to the rdt_mon_domain structure to track the L3 cache id
>>> which can be used to find all the domains that need resource counts
>>> summed to provide accurate values in the legacy monitoring files.
>>
>> Why is this field necessary? Can this not be obtained dynamically?
> 
> I could compute it each time I need it (when making/removing
> directories, or finding which SNC domains share an L3 domain).
> 
> 	id = get_domain_id_from_scope(cpumask_any(&d->cpu_mask), r->mon_display_scope);
> 	if (id < 0)
> 		// error path
> 
> But it seemed better to just discover this once at domain creation time.

This may be more clear in the next version?

...

>>>  	/*
>>>  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>>>  	 * with a valid event code for supported resource type and the bits
>>> @@ -207,7 +198,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>>>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>>>  	 * are error bits.
>>>  	 */
>>> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
>>> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>>>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>>>  
>>>  	if (msr_val & RMID_VAL_ERROR)
>>> @@ -291,7 +282,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>  
>>>  	resctrl_arch_rmid_read_context_check();
>>>  
>>> -	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>>> +	if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
>>>  		return -EINVAL;
>>
>> Does this mean that when SNC is enabled then reading data for an event within a particular
>> monitor domain ("node scope") can read its data from any CPU within the L3 domain
>> ("mon_display_scope") even if that CPU is not associated with the node for which it
>> is reading the data?
> 
> Yes.
> 
>> If so this really turns many resctrl assumptions and architecture on its head since the
>> resctrl expectation is that only CPUs within a domain's cpumask can be used to interact
>> with the domain. This in turn makes this seemingly general feature actually SNC specific.
> 
> This is only an expectation for x86 features using IA32_QM_EVTSEL/IA32_QM_CTR
> MSR method to read counters. ARM doesn't have the "CPU must be in
> domain" restriction (as far as I can tell). Nor does the Intel IO RDT
> (which uses MMIO space for control registers, these can be read/written
> from any CPU).
> 
> We do know that those two MSRs can be read from any CPU that shares an
> L3 cache. It would seem to be pointless overhead to force a cross
> processor interrupt to read them from a different CPU just to satisfy
> a "must be in same domain" non-requirement. I'l split this into its
> own patch with suitable description.

I did not suggest that this should be done with multiple IPIs. My comment
was related to this addition that claims to be generic but really just focuses
on support for SNC. Any  future addition that may want to build on this would
need to be aware of these expectations, which are not obvious at this time.

...

 
>>>  	return 0;
>>>  }
>>>  
>>> +static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
>>> +{
>>> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>>> +
>>> +	return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
>>> +}
>>> +
>>> +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>>> +{
>>> +	struct rdt_mon_domain *d;
>>> +	struct rmid_read tmp;
>>> +	u32 node_rmid;
>>> +	int ret = 0;
>>> +
>>> +	if (!rr->sumdomains) {
>>> +		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
>>> +		return ___mon_event_count(closid, node_rmid, rr, &rr->val);
>>> +	}
>>> +
>>> +	tmp = *rr;
>>> +	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
>>> +		if (d->display_id == rr->d->display_id) {
>>> +			tmp.d = d;
>>> +			node_rmid = get_node_rmid(rr->r, d, rmid);
>>> +			ret = ___mon_event_count(closid, node_rmid, &tmp, &rr->val);
>>
>> If I understand correctly this function is run per IPI on a CPU associated
>> with one of the monitor domains (depends on which one came online first),
>> and then it will read the monitor data of the other domains from the same
>> CPU? This is unexpected since the expectation is that monitor data
>> needs to be read from a CPU associated with the domain it is
>> reading data for.
> 
> See earlier note. The counter can be read from any CPU sharing the same
> L3. Adding unnecessary IPI is pointless overhead. But I will add
> comments.

I did not suggest to add extra IPIs, my comment was related to how this
feature wedges itself into resctrl.

> 
>> Also, providing tmp as well as rr->val seems unnecessary?
> 
> I think I was unsure about modifying the domain field in the struct
> rmid_read in the middle of the call chain. But the original caller
> mon_event_read() doesn't look at rr->domain after the smp_call*()
> function returns. I will drop "tmp".
> 
>>> +			if (ret)
>>> +				break;
>>> +		}
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  /*
>>>   * mbm_bw_count() - Update bw count from values previously read by
>>>   *		    __mon_event_count().
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 0923492a8bd0..a56ae08ca255 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -3011,57 +3011,118 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>>>   * and monitor groups with given domain id.
>>>   */
>>>  static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>>> -					   unsigned int dom_id)
>>> +					   struct rdt_mon_domain *d)
>>>  {
>>>  	struct rdtgroup *prgrp, *crgrp;
>>> +	struct rdt_mon_domain *dom;
>>> +	bool remove_all = true;
>>> +	struct kernfs_node *kn;
>>> +	char subname[32];
>>>  	char name[32];
>>>  
>>> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
>>> +	if (r->mon_scope != r->mon_display_scope) {
>>> +		int count = 0;
>>> +
>>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list)
>>> +			if (d->display_id == dom->display_id)
>>> +				count++;
>>> +		if (count > 1) {
>>> +			remove_all = false;
>>> +			sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
>>> +		}
>>> +	}
>>
>>
>> This seems awkward. I wonder if it may not be simpler to just
>> remove the directory and on completion check if the parent has
>> any subdirectories left and remove the parent if there are no
>> subdirectories remaining. Something possible via reading the inode's
>> i_nlink that is accessible via kernfs_get_inode(). What do you think?
> 
> kernfs_get_inode() needs a pointer to the "struct super_block" for the
> filesystem. Resctrl filesystem code doesn't seem to keep track of that
> anywhere. Only mentioned in rdt_kill_sb() where core kernfs code passes
> it in as the argument. When registering/mounting the resctrl filesystem
> there's a "struct fs_context *fc" ... is there a function to get the
> super block from that? Even if there is, I'd need to add a global to
> save a copy of the fc_context.

hmmm ... I expected that struct file or struct dentry may be reachable
from where sb can be obtained but I can only see that now for the
paths that provide struct kernfs_open_file.


...

>
>>>  	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>> +				struct rdt_mon_domain *d,
>>> +				struct rdt_resource *r, struct rdtgroup *prgrp)
>>> +{
>>> +	struct kernfs_node *kn, *ckn;
>>> +	char name[32];
>>> +	bool do_sum;
>>> +	int ret;
>>> +
>>> +	do_sum = r->mon_scope != r->mon_display_scope;
>>> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
>>> +	kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
>>> +	if (!kn) {
>>> +		/* create the directory */
>>> +		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>>> +		if (IS_ERR(kn))
>>> +			return PTR_ERR(kn);
>>> +
>>> +		ret = rdtgroup_kn_set_ugid(kn);
>>> +		if (ret)
>>> +			goto out_destroy;
>>> +		ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
>>
>> This does not look right. If I understand correctly the private data
>> of these event files will have whichever mon domain came up first as
>> its domain id. That seems completely arbitrary and does not reflect
>> accurate state for this file. Since "do_sum" is essentially a "flag"
>> on how this file can be treated, can its "dom_id" not rather be
>> the "monitor scope domain id"? Could that not help to eliminate 
>> that per-domain "display_id"?
> 
> You are correct that this should be the "monitor scope domain id" rather
> than the first SNC domain that appears. I'll change to use that. I don't
> think it helps in removing the per-domain display_id.

Wouldn't the file metadata then be the "display_id"?

Reinette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-05-13 17:17     ` Tony Luck
@ 2024-05-13 18:53       ` Reinette Chatre
  2024-05-14  0:28         ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-13 18:53 UTC (permalink / raw
  To: Tony Luck
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

Hi Tony,

On 5/13/2024 10:17 AM, Tony Luck wrote:
> On Fri, May 10, 2024 at 02:24:49PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/3/2024 1:33 PM, Tony Luck wrote:
>>> There isn't a simple hardware bit that indicates whether a CPU is
>>> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
>>> the ratio of NUMA nodes to L3 cache instances.
>>>
>>> When SNC mode is detected, reconfigure the RMID counters by updating
>>> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
>>>
>>> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
>>> on the second SNC node to start from zero.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>>  arch/x86/include/asm/msr-index.h   |   1 +
>>>  arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
>>>  2 files changed, 120 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
>>> index e72c2b872957..ce54a1ffe1e5 100644
>>> --- a/arch/x86/include/asm/msr-index.h
>>> +++ b/arch/x86/include/asm/msr-index.h
>>> @@ -1165,6 +1165,7 @@
>>>  #define MSR_IA32_QM_CTR			0xc8e
>>>  #define MSR_IA32_PQR_ASSOC		0xc8f
>>>  #define MSR_IA32_L3_CBM_BASE		0xc90
>>> +#define MSR_RMID_SNC_CONFIG		0xca0
>>>  #define MSR_IA32_L2_CBM_BASE		0xd10
>>>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
>>>  
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index a949e69308cd..6a1727ea1dfe 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -21,7 +21,9 @@
>>>  #include <linux/err.h>
>>>  #include <linux/cacheinfo.h>
>>>  #include <linux/cpuhotplug.h>
>>> +#include <linux/mod_devicetable.h>
>>>  
>>> +#include <asm/cpu_device_id.h>
>>>  #include <asm/intel-family.h>
>>>  #include <asm/resctrl.h>
>>>  #include "internal.h"
>>> @@ -746,11 +748,42 @@ static void clear_closid_rmid(int cpu)
>>>  	      RESCTRL_RESERVED_CLOSID);
>>>  }
>>>  
>>> +/*
>>> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
>>> + * which indicates that RMIDs are configured in legacy mode.
>>> + * This mode is incompatible with Linux resctrl semantics
>>> + * as RMIDs are partitioned between SNC nodes, which requires
>>> + * a user to know which RMID is allocated to a task.
>>> + * Clearing bit 0 reconfigures the RMID counters for use
>>> + * in Sub NUMA Cluster mode. This mode is better for Linux.
>>> + * The RMID space is divided between all SNC nodes with the
>>> + * RMIDs renumbered to start from zero in each node when
>>> + * couning operations from tasks. Code to read the counters
>>> + * must adjust RMID counter numbers based on SNC node. See
>>> + * __rmid_read() for code that does this.
>>> + */
>>> +static void snc_remap_rmids(int cpu)
>>> +{
>>> +	u64 val;
>>> +
>>> +	/* Only need to enable once per package. */
>>> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
>>> +		return;
>>> +
>>> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
>>> +	val &= ~BIT_ULL(0);
>>> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
>>> +}
>>> +
>>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>>  {
>>>  	struct rdt_resource *r;
>>>  
>>>  	mutex_lock(&domain_list_lock);
>>> +
>>> +	if (snc_nodes_per_l3_cache > 1)
>>> +		snc_remap_rmids(cpu);
>>> +
>>>  	for_each_capable_rdt_resource(r)
>>>  		domain_add_cpu(cpu, r);
>>>  	mutex_unlock(&domain_list_lock);
>>> @@ -990,11 +1023,97 @@ static __init bool get_rdt_resources(void)
>>>  	return (rdt_mon_capable || rdt_alloc_capable);
>>>  }
>>>  
>>> +/* CPU models that support MSR_RMID_SNC_CONFIG */
>>> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
>>> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
>>> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
>>> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
>>> +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
>>> +	X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0),
>>> +	{}
>>> +};
>>> +
>>> +/*
>>> + * There isn't a simple hardware bit that indicates whether a CPU is running
>>> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
>>> + * ratio of NUMA nodes to L3 cache instances.
>>> + * It is not possible to accurately determine SNC state if the system is
>>> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
>>> + * to L3 caches. It will be OK if system is booted with hyperthreading
>>> + * disabled (since this doesn't affect the ratio).
>>> + */
>>> +static __init int snc_get_config(void)
>>> +{
>>> +	unsigned long *node_caches;
>>> +	int mem_only_nodes = 0;
>>> +	int cpu, node, ret;
>>> +	int num_l3_caches;
>>> +	int cache_id;
>>> +
>>> +	if (!x86_match_cpu(snc_cpu_ids))
>>> +		return 1;
>>> +
>>> +	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
>>> +	if (!node_caches)
>>> +		return 1;
>>> +
>>> +	cpus_read_lock();
>>> +
>>> +	if (num_online_cpus() != num_present_cpus())
>>> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
>>> +
>>> +	for_each_node(node) {
>>> +		cpu = cpumask_first(cpumask_of_node(node));
>>> +		if (cpu < nr_cpu_ids) {
>>> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
>>> +			if (cache_id != -1)
>>> +				set_bit(cache_id, node_caches);
>>> +		} else {
>>> +			mem_only_nodes++;
>>> +		}
>>> +	}
>>> +	cpus_read_unlock();
>>> +
>>> +	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
>>> +	kfree(node_caches);
>>> +
>>> +	if (!num_l3_caches)
>>> +		goto insane;
>>> +
>>> +	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
>>> +	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
>>> +		goto insane;
>>> +
>>> +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
>>> +
>>> +	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
>>> +	switch (ret) {
>>> +	case 1:
>>> +		break;
>>> +	case 2:
>>> +	case 3:
>>> +	case 4:
>>> +		pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
>>> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
>>> +		break;
>>> +	default:
>>> +		goto insane;
>>> +	}
>>> +
>>> +	return ret;
>>> +insane:
>>> +	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
>>> +		(nr_node_ids - mem_only_nodes), num_l3_caches);
>>> +	return 1;
>>> +}
>>
>> I find it confusing how dramatically this SNC detection code changed without
>> any explanations. This detection seems to match the SNC detection code from v16 but
>> after v16 you posted a new SNC detection implementation that did SNC detection totally
>> differently [1] from v16. Instead of keeping with the "new" detection this implements
>> what was in v16. Could you please help me understand what motivated the different
>> implementations and why the big differences?
> 
> Reinette,
> 
> Do you like the detection code in that version? You didn't make any
> comments about it.

It was a drop-in replacement for a portion that was not relevant to the
architecture discussion that I focused on ... hence my surprise that it
just came and went without any comment.

> I switched back to the v16 code because that had survived review before
> and I just wanted to make the modifications to add both per-L3 and
> per-SNC node monitoring files.
> 
> I can pull that into the next iteration if you want.

It is not clear to me why you switched back and forth between the detection
algorithms. I expect big changes to be accompanied with explanation of what changed,
why one is better than the other, or if they are considered "similar", what
are the pros/cons. Am I missing something so obvious that causes you to think
the work does not need the explanation I asked your help with?

Reinette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-13 18:53       ` Reinette Chatre
@ 2024-05-14  0:21         ` Tony Luck
  2024-05-14 15:08           ` Reinette Chatre
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-14  0:21 UTC (permalink / raw
  To: Reinette Chatre
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/13/2024 10:05 AM, Tony Luck wrote:
> > On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> > 
> > Hi Reinette,
> > 
> > Thanks for the review. Detailed comments below. But overall I'm
> > going to split patch 7 into a bunch of smaller changes, each with
> > a better commit message.
> > 
> >> On 5/3/2024 1:33 PM, Tony Luck wrote:
> >>
> >> (Could you please start the changelog with some context?)
> >>
> >>> Add a field to the rdt_resource structure to track whether monitoring
> >>> resources are tracked by hardware at a different scope (NODE) from
> >>> the legacy L3 scope.
> >>
> >> This seems to describe @mon_scope that was introduced in patch #3?
> > 
> > Not really. Patch #3 made the change so that control an monitor
> > functions can have different scope. That's still needed as with SNC
> > enabled the underlying data collection is at the node level for
> > monitoring, while control stays at the L3 cache scope.
> > 
> > This new field describes the legacy scope of monitoring, so that
> > resctrl can provide correctly scoped monitor files for legacy
> > applications that aren't aware of SNC. So I'm using this both
> > to indicate when SNC is enabled (with mon_scope != mon_display_scope)
> > or disabled (when they are the same).
> 
> This seems to enforce the idea that these new additions aim to be
> generic on the surface but the only goal is to support SNC.

If you have some more ideas on how to make this more generic and
less SNC specific I'm all ears.

> > 
> >>>
> >>> Add a field to the rdt_mon_domain structure to track the L3 cache id
> >>> which can be used to find all the domains that need resource counts
> >>> summed to provide accurate values in the legacy monitoring files.
> >>
> >> Why is this field necessary? Can this not be obtained dynamically?
> > 
> > I could compute it each time I need it (when making/removing
> > directories, or finding which SNC domains share an L3 domain).
> > 
> > 	id = get_domain_id_from_scope(cpumask_any(&d->cpu_mask), r->mon_display_scope);
> > 	if (id < 0)
> > 		// error path
> > 
> > But it seemed better to just discover this once at domain creation time.
> 
> This may be more clear in the next version?

My goal is to be more clear next version.

> ...
> 
> >>>  	/*
> >>>  	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> >>>  	 * with a valid event code for supported resource type and the bits
> >>> @@ -207,7 +198,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> >>>  	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> >>>  	 * are error bits.
> >>>  	 */
> >>> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> >>> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> >>>  	rdmsrl(MSR_IA32_QM_CTR, msr_val);
> >>>  
> >>>  	if (msr_val & RMID_VAL_ERROR)
> >>> @@ -291,7 +282,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> >>>  
> >>>  	resctrl_arch_rmid_read_context_check();
> >>>  
> >>> -	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> >>> +	if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
> >>>  		return -EINVAL;
> >>
> >> Does this mean that when SNC is enabled then reading data for an event within a particular
> >> monitor domain ("node scope") can read its data from any CPU within the L3 domain
> >> ("mon_display_scope") even if that CPU is not associated with the node for which it
> >> is reading the data?
> > 
> > Yes.
> > 
> >> If so this really turns many resctrl assumptions and architecture on its head since the
> >> resctrl expectation is that only CPUs within a domain's cpumask can be used to interact
> >> with the domain. This in turn makes this seemingly general feature actually SNC specific.
> > 
> > This is only an expectation for x86 features using IA32_QM_EVTSEL/IA32_QM_CTR
> > MSR method to read counters. ARM doesn't have the "CPU must be in
> > domain" restriction (as far as I can tell). Nor does the Intel IO RDT
> > (which uses MMIO space for control registers, these can be read/written
> > from any CPU).
> > 
> > We do know that those two MSRs can be read from any CPU that shares an
> > L3 cache. It would seem to be pointless overhead to force a cross
> > processor interrupt to read them from a different CPU just to satisfy
> > a "must be in same domain" non-requirement. I'l split this into its
> > own patch with suitable description.
> 
> I did not suggest that this should be done with multiple IPIs. My comment
> was related to this addition that claims to be generic but really just focuses
> on support for SNC. Any  future addition that may want to build on this would
> need to be aware of these expectations, which are not obvious at this time.

I can add some more comments to make this more obvious.

> ...
> 
>  
> >>>  	return 0;
> >>>  }
> >>>  
> >>> +static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
> >>> +{
> >>> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
> >>> +
> >>> +	return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> >>> +}
> >>> +
> >>> +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> >>> +{
> >>> +	struct rdt_mon_domain *d;
> >>> +	struct rmid_read tmp;
> >>> +	u32 node_rmid;
> >>> +	int ret = 0;
> >>> +
> >>> +	if (!rr->sumdomains) {
> >>> +		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
> >>> +		return ___mon_event_count(closid, node_rmid, rr, &rr->val);
> >>> +	}
> >>> +
> >>> +	tmp = *rr;
> >>> +	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> >>> +		if (d->display_id == rr->d->display_id) {
> >>> +			tmp.d = d;
> >>> +			node_rmid = get_node_rmid(rr->r, d, rmid);
> >>> +			ret = ___mon_event_count(closid, node_rmid, &tmp, &rr->val);
> >>
> >> If I understand correctly this function is run per IPI on a CPU associated
> >> with one of the monitor domains (depends on which one came online first),
> >> and then it will read the monitor data of the other domains from the same
> >> CPU? This is unexpected since the expectation is that monitor data
> >> needs to be read from a CPU associated with the domain it is
> >> reading data for.
> > 
> > See earlier note. The counter can be read from any CPU sharing the same
> > L3. Adding unnecessary IPI is pointless overhead. But I will add
> > comments.
> 
> I did not suggest to add extra IPIs, my comment was related to how this
> feature wedges itself into resctrl.

Sorry for my misunderstanding.

> > 
> >> Also, providing tmp as well as rr->val seems unnecessary?
> > 
> > I think I was unsure about modifying the domain field in the struct
> > rmid_read in the middle of the call chain. But the original caller
> > mon_event_read() doesn't look at rr->domain after the smp_call*()
> > function returns. I will drop "tmp".
> > 
> >>> +			if (ret)
> >>> +				break;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>>  /*
> >>>   * mbm_bw_count() - Update bw count from values previously read by
> >>>   *		    __mon_event_count().
> >>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> >>> index 0923492a8bd0..a56ae08ca255 100644
> >>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> >>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> >>> @@ -3011,57 +3011,118 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> >>>   * and monitor groups with given domain id.
> >>>   */
> >>>  static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> >>> -					   unsigned int dom_id)
> >>> +					   struct rdt_mon_domain *d)
> >>>  {
> >>>  	struct rdtgroup *prgrp, *crgrp;
> >>> +	struct rdt_mon_domain *dom;
> >>> +	bool remove_all = true;
> >>> +	struct kernfs_node *kn;
> >>> +	char subname[32];
> >>>  	char name[32];
> >>>  
> >>> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> >>> +	if (r->mon_scope != r->mon_display_scope) {
> >>> +		int count = 0;
> >>> +
> >>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list)
> >>> +			if (d->display_id == dom->display_id)
> >>> +				count++;
> >>> +		if (count > 1) {
> >>> +			remove_all = false;
> >>> +			sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> >>> +		}
> >>> +	}
> >>
> >>
> >> This seems awkward. I wonder if it may not be simpler to just
> >> remove the directory and on completion check if the parent has
> >> any subdirectories left and remove the parent if there are no
> >> subdirectories remaining. Something possible via reading the inode's
> >> i_nlink that is accessible via kernfs_get_inode(). What do you think?
> > 
> > kernfs_get_inode() needs a pointer to the "struct super_block" for the
> > filesystem. Resctrl filesystem code doesn't seem to keep track of that
> > anywhere. Only mentioned in rdt_kill_sb() where core kernfs code passes
> > it in as the argument. When registering/mounting the resctrl filesystem
> > there's a "struct fs_context *fc" ... is there a function to get the
> > super block from that? Even if there is, I'd need to add a global to
> > save a copy of the fc_context.
> 
> hmmm ... I expected that struct file or struct dentry may be reachable
> from where sb can be obtained but I can only see that now for the
> paths that provide struct kernfs_open_file.

I'm going to keep this the same then. The "rmdir" call path doesn't have
any open files to plumb down to this function.

> 
> ...
> 
> >
> >>>  	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> >>> +				struct rdt_mon_domain *d,
> >>> +				struct rdt_resource *r, struct rdtgroup *prgrp)
> >>> +{
> >>> +	struct kernfs_node *kn, *ckn;
> >>> +	char name[32];
> >>> +	bool do_sum;
> >>> +	int ret;
> >>> +
> >>> +	do_sum = r->mon_scope != r->mon_display_scope;
> >>> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> >>> +	kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> >>> +	if (!kn) {
> >>> +		/* create the directory */
> >>> +		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> >>> +		if (IS_ERR(kn))
> >>> +			return PTR_ERR(kn);
> >>> +
> >>> +		ret = rdtgroup_kn_set_ugid(kn);
> >>> +		if (ret)
> >>> +			goto out_destroy;
> >>> +		ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
> >>
> >> This does not look right. If I understand correctly the private data
> >> of these event files will have whichever mon domain came up first as
> >> its domain id. That seems completely arbitrary and does not reflect
> >> accurate state for this file. Since "do_sum" is essentially a "flag"
> >> on how this file can be treated, can its "dom_id" not rather be
> >> the "monitor scope domain id"? Could that not help to eliminate 
> >> that per-domain "display_id"?
> > 
> > You are correct that this should be the "monitor scope domain id" rather
> > than the first SNC domain that appears. I'll change to use that. I don't
> > think it helps in removing the per-domain display_id.
> 
> Wouldn't the file metadata then be the "display_id"?

Yes. The metadata is the display_id for files that need to sum across
SNC nodes, but the domain id for ones where no summation is needed.

> Reinette

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable
  2024-05-13 18:53       ` Reinette Chatre
@ 2024-05-14  0:28         ` Tony Luck
  0 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2024-05-14  0:28 UTC (permalink / raw
  To: Reinette Chatre
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

On Mon, May 13, 2024 at 11:53:26AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/13/2024 10:17 AM, Tony Luck wrote:
> > On Fri, May 10, 2024 at 02:24:49PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 5/3/2024 1:33 PM, Tony Luck wrote:
> >>> There isn't a simple hardware bit that indicates whether a CPU is
> >>> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> >>> the ratio of NUMA nodes to L3 cache instances.
> >>>
> >>> When SNC mode is detected, reconfigure the RMID counters by updating
> >>> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
> >>>
> >>> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> >>> on the second SNC node to start from zero.
> >>>
> >>> Signed-off-by: Tony Luck <tony.luck@intel.com>
> >>> ---
> >>>  arch/x86/include/asm/msr-index.h   |   1 +
> >>>  arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
> >>>  2 files changed, 120 insertions(+)
> >>>
> >>> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> >>> index e72c2b872957..ce54a1ffe1e5 100644
> >>> --- a/arch/x86/include/asm/msr-index.h
> >>> +++ b/arch/x86/include/asm/msr-index.h
> >>> @@ -1165,6 +1165,7 @@
> >>>  #define MSR_IA32_QM_CTR			0xc8e
> >>>  #define MSR_IA32_PQR_ASSOC		0xc8f
> >>>  #define MSR_IA32_L3_CBM_BASE		0xc90
> >>> +#define MSR_RMID_SNC_CONFIG		0xca0
> >>>  #define MSR_IA32_L2_CBM_BASE		0xd10
> >>>  #define MSR_IA32_MBA_THRTL_BASE		0xd50
> >>>  
> >>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> >>> index a949e69308cd..6a1727ea1dfe 100644
> >>> --- a/arch/x86/kernel/cpu/resctrl/core.c
> >>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> >>> @@ -21,7 +21,9 @@
> >>>  #include <linux/err.h>
> >>>  #include <linux/cacheinfo.h>
> >>>  #include <linux/cpuhotplug.h>
> >>> +#include <linux/mod_devicetable.h>
> >>>  
> >>> +#include <asm/cpu_device_id.h>
> >>>  #include <asm/intel-family.h>
> >>>  #include <asm/resctrl.h>
> >>>  #include "internal.h"
> >>> @@ -746,11 +748,42 @@ static void clear_closid_rmid(int cpu)
> >>>  	      RESCTRL_RESERVED_CLOSID);
> >>>  }
> >>>  
> >>> +/*
> >>> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> >>> + * which indicates that RMIDs are configured in legacy mode.
> >>> + * This mode is incompatible with Linux resctrl semantics
> >>> + * as RMIDs are partitioned between SNC nodes, which requires
> >>> + * a user to know which RMID is allocated to a task.
> >>> + * Clearing bit 0 reconfigures the RMID counters for use
> >>> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> >>> + * The RMID space is divided between all SNC nodes with the
> >>> + * RMIDs renumbered to start from zero in each node when
> >>> + * couning operations from tasks. Code to read the counters
> >>> + * must adjust RMID counter numbers based on SNC node. See
> >>> + * __rmid_read() for code that does this.
> >>> + */
> >>> +static void snc_remap_rmids(int cpu)
> >>> +{
> >>> +	u64 val;
> >>> +
> >>> +	/* Only need to enable once per package. */
> >>> +	if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> >>> +		return;
> >>> +
> >>> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> >>> +	val &= ~BIT_ULL(0);
> >>> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> >>> +}
> >>> +
> >>>  static int resctrl_arch_online_cpu(unsigned int cpu)
> >>>  {
> >>>  	struct rdt_resource *r;
> >>>  
> >>>  	mutex_lock(&domain_list_lock);
> >>> +
> >>> +	if (snc_nodes_per_l3_cache > 1)
> >>> +		snc_remap_rmids(cpu);
> >>> +
> >>>  	for_each_capable_rdt_resource(r)
> >>>  		domain_add_cpu(cpu, r);
> >>>  	mutex_unlock(&domain_list_lock);
> >>> @@ -990,11 +1023,97 @@ static __init bool get_rdt_resources(void)
> >>>  	return (rdt_mon_capable || rdt_alloc_capable);
> >>>  }
> >>>  
> >>> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> >>> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> >>> +	X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0),
> >>> +	{}
> >>> +};
> >>> +
> >>> +/*
> >>> + * There isn't a simple hardware bit that indicates whether a CPU is running
> >>> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> >>> + * ratio of NUMA nodes to L3 cache instances.
> >>> + * It is not possible to accurately determine SNC state if the system is
> >>> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> >>> + * to L3 caches. It will be OK if system is booted with hyperthreading
> >>> + * disabled (since this doesn't affect the ratio).
> >>> + */
> >>> +static __init int snc_get_config(void)
> >>> +{
> >>> +	unsigned long *node_caches;
> >>> +	int mem_only_nodes = 0;
> >>> +	int cpu, node, ret;
> >>> +	int num_l3_caches;
> >>> +	int cache_id;
> >>> +
> >>> +	if (!x86_match_cpu(snc_cpu_ids))
> >>> +		return 1;
> >>> +
> >>> +	node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
> >>> +	if (!node_caches)
> >>> +		return 1;
> >>> +
> >>> +	cpus_read_lock();
> >>> +
> >>> +	if (num_online_cpus() != num_present_cpus())
> >>> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> >>> +
> >>> +	for_each_node(node) {
> >>> +		cpu = cpumask_first(cpumask_of_node(node));
> >>> +		if (cpu < nr_cpu_ids) {
> >>> +			cache_id = get_cpu_cacheinfo_id(cpu, 3);
> >>> +			if (cache_id != -1)
> >>> +				set_bit(cache_id, node_caches);
> >>> +		} else {
> >>> +			mem_only_nodes++;
> >>> +		}
> >>> +	}
> >>> +	cpus_read_unlock();
> >>> +
> >>> +	num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
> >>> +	kfree(node_caches);
> >>> +
> >>> +	if (!num_l3_caches)
> >>> +		goto insane;
> >>> +
> >>> +	/* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
> >>> +	if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
> >>> +		goto insane;
> >>> +
> >>> +	ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> >>> +
> >>> +	/* sanity check #2: Only valid results are 1, 2, 3, 4 */
> >>> +	switch (ret) {
> >>> +	case 1:
> >>> +		break;
> >>> +	case 2:
> >>> +	case 3:
> >>> +	case 4:
> >>> +		pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
> >>> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> >>> +		break;
> >>> +	default:
> >>> +		goto insane;
> >>> +	}
> >>> +
> >>> +	return ret;
> >>> +insane:
> >>> +	pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
> >>> +		(nr_node_ids - mem_only_nodes), num_l3_caches);
> >>> +	return 1;
> >>> +}
> >>
> >> I find it confusing how dramatically this SNC detection code changed without
> >> any explanations. This detection seems to match the SNC detection code from v16 but
> >> after v16 you posted a new SNC detection implementation that did SNC detection totally
> >> differently [1] from v16. Instead of keeping with the "new" detection this implements
> >> what was in v16. Could you please help me understand what motivated the different
> >> implementations and why the big differences?
> > 
> > Reinette,
> > 
> > Do you like the detection code in that version? You didn't make any
> > comments about it.
> 
> It was a drop-in replacement for a portion that was not relevant to the
> architecture discussion that I focused on ... hence my surprise that it
> just came and went without any comment.

So it will be back again when I post v18 as it is somewhat simpler
(doesn't rely on allocating a bitmap to count L3 cache instances).

I'll update comments in that patch, in the code, and in the change
log in the cover letter.

> > I switched back to the v16 code because that had survived review before
> > and I just wanted to make the modifications to add both per-L3 and
> > per-SNC node monitoring files.
> > 
> > I can pull that into the next iteration if you want.
> 
> It is not clear to me why you switched back and forth between the detection
> algorithms. I expect big changes to be accompanied with explanation of what changed,
> why one is better than the other, or if they are considered "similar", what
> are the pros/cons. Am I missing something so obvious that causes you to think
> the work does not need the explanation I asked your help with?

The change deserved some comments when it suddenly appeared. One of the
many issues with my detour from the progression. It disappeared because
I reverted to the previously reviewed version.

> Reinette

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems
  2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (8 preceding siblings ...)
  2024-05-03 20:33 ` [PATCH v17 9/9] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2024-05-14 15:02 ` Maciej Wieczor-Retman
  9 siblings, 0 replies; 26+ messages in thread
From: Maciej Wieczor-Retman @ 2024-05-14 15:02 UTC (permalink / raw
  To: Tony Luck
  Cc: Fenghua Yu, Reinette Chatre, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

Hi,
thanks for adding the 7th patch, I tested the series on a Sapphire Rapids with
resctrl selftests and the summation works well on SNC-2.

Now I can also drop a few patches from my series on SNC in resctrl selftests and
only keep the part about discovering kernel support.

On 2024-05-03 at 13:33:16 -0700, Tony Luck wrote:
>Note: Jump straight to patch 7 for the new stuff. Just minor tweaks in
>other patches.
>
>This series based on top of TIP x86/cache branch:
> 931be446c6cb ("x86/resctrl: Add tracepoint for llc_occupancy tracking")
>
>The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
>that share an L3 cache into two or more sets. This plays havoc with the
>Resource Director Technology (RDT) monitoring features.  Prior to this
>patch Intel has advised that SNC and RDT are incompatible.
>
>Some of these CPUs support an MSR that can partition the RMID counters
>in the same way. This allows monitoring features to be used. Legacy
>monitoring files provide the sum of counters from each SNC node for
>backwards compatibility. Additional  files per SNC node provide details
>per node.
>
>Cache and memory bandwidth allocation features continue to operate at
>the scope of the L3 cache.
>
>Signed-off-by: Tony Luck <tony.luck@intel.com>
>
>---
>Changes since v16: https://lore.kernel.org/all/20240312214247.91772-1-tony.luck@intel.com/
>
>Patch 1: Reinette pointed out that rdt_find_domain() no longer returns ERR_PTR()
>but one of the callers was still checking return with IS_ERR().
>
>Patch 2: Tip tree added a tracing patch. That needed s/d->id/d->hdr.id/
>
>Patch 3: Reinette: Keep the "RCU" in the kerneldoc description of the
>domain list fields after the split into separate ctrl/mon lists.
>
>Patch 4: No change
>
>Patch 5: No change
>
>Patch 6: Drop the change that divided output of the resctrl "size" file
>by the number of SNC domains per L3 cache. Now that this series
>preserves the contents of the legacy llc_occupancy files this isn't
>useful.
>
>Patch 7: NEW in this series. Add per-SNC domain monitor files while
>making the original files sum across SNC nodes.
>
>Patch 8: (formerly 7) No change
>
>Patch 9: (formerly 8) Add documentation for new per-SNC directories and files
>
>Tony Luck (9):
>  x86/resctrl: Prepare for new domain scope
>  x86/resctrl: Prepare to split rdt_domain structure
>  x86/resctrl: Prepare for different scope for control/monitor
>    operations
>  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
>  x86/resctrl: Add node-scope to the options for feature scope
>  x86/resctrl: Introduce snc_nodes_per_l3_cache
>  x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC)
>    monitoring
>  x86/resctrl: Sub NUMA Cluster detection and enable
>  x86/resctrl: Update documentation with Sub-NUMA cluster changes
>
> Documentation/arch/x86/resctrl.rst        |  17 +
> include/linux/resctrl.h                   |  89 +++--
> arch/x86/include/asm/msr-index.h          |   1 +
> arch/x86/kernel/cpu/resctrl/internal.h    |  72 ++--
> arch/x86/kernel/cpu/resctrl/core.c        | 430 ++++++++++++++++++----
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  57 +--
> arch/x86/kernel/cpu/resctrl/monitor.c     |  98 +++--
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 263 ++++++++-----
> 9 files changed, 759 insertions(+), 294 deletions(-)
>
>
>base-commit: 931be446c6cbc15691dd499957e961f4e1d56afb
>-- 
>2.44.0
>

Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>

-- 
Kind regards
Maciej Wieczór-Retman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-14  0:21         ` Tony Luck
@ 2024-05-14 15:08           ` Reinette Chatre
  2024-05-14 18:26             ` Luck, Tony
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-14 15:08 UTC (permalink / raw
  To: Tony Luck
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86, linux-kernel, patches

Hi Tony,

On 5/13/2024 5:21 PM, Tony Luck wrote:
> On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
>> On 5/13/2024 10:05 AM, Tony Luck wrote:
>>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
>>> Thanks for the review. Detailed comments below. But overall I'm
>>> going to split patch 7 into a bunch of smaller changes, each with
>>> a better commit message.
>>>
>>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
>>>>
>>>> (Could you please start the changelog with some context?)
>>>>
>>>>> Add a field to the rdt_resource structure to track whether monitoring
>>>>> resources are tracked by hardware at a different scope (NODE) from
>>>>> the legacy L3 scope.
>>>>
>>>> This seems to describe @mon_scope that was introduced in patch #3?
>>>
>>> Not really. Patch #3 made the change so that control an monitor
>>> functions can have different scope. That's still needed as with SNC
>>> enabled the underlying data collection is at the node level for
>>> monitoring, while control stays at the L3 cache scope.
>>>
>>> This new field describes the legacy scope of monitoring, so that
>>> resctrl can provide correctly scoped monitor files for legacy
>>> applications that aren't aware of SNC. So I'm using this both
>>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
>>> or disabled (when they are the same).
>>
>> This seems to enforce the idea that these new additions aim to be
>> generic on the surface but the only goal is to support SNC.
> 
> If you have some more ideas on how to make this more generic and
> less SNC specific I'm all ears.

It may not end up being totally generic. It should not pretend to be
when it is not. It makes the flows difficult to follow when there are
these unexpected checks/quirks in what claims to be core code.

>>>>>  	}
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>>>> +				struct rdt_mon_domain *d,
>>>>> +				struct rdt_resource *r, struct rdtgroup *prgrp)
>>>>> +{
>>>>> +	struct kernfs_node *kn, *ckn;
>>>>> +	char name[32];
>>>>> +	bool do_sum;
>>>>> +	int ret;
>>>>> +
>>>>> +	do_sum = r->mon_scope != r->mon_display_scope;
>>>>> +	sprintf(name, "mon_%s_%02d", r->name, d->display_id);
>>>>> +	kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
>>>>> +	if (!kn) {
>>>>> +		/* create the directory */
>>>>> +		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>>>>> +		if (IS_ERR(kn))
>>>>> +			return PTR_ERR(kn);
>>>>> +
>>>>> +		ret = rdtgroup_kn_set_ugid(kn);
>>>>> +		if (ret)
>>>>> +			goto out_destroy;
>>>>> +		ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
>>>>
>>>> This does not look right. If I understand correctly the private data
>>>> of these event files will have whichever mon domain came up first as
>>>> its domain id. That seems completely arbitrary and does not reflect
>>>> accurate state for this file. Since "do_sum" is essentially a "flag"
>>>> on how this file can be treated, can its "dom_id" not rather be
>>>> the "monitor scope domain id"? Could that not help to eliminate 
>>>> that per-domain "display_id"?
>>>
>>> You are correct that this should be the "monitor scope domain id" rather
>>> than the first SNC domain that appears. I'll change to use that. I don't
>>> think it helps in removing the per-domain display_id.
>>
>> Wouldn't the file metadata then be the "display_id"?
> 
> Yes. The metadata is the display_id for files that need to sum across
> SNC nodes, but the domain id for ones where no summation is needed.

Right ... and there is a "sum" flag to tell which is which?

Reinette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-14 15:08           ` Reinette Chatre
@ 2024-05-14 18:26             ` Luck, Tony
  2024-05-14 20:30               ` Reinette Chatre
  0 siblings, 1 reply; 26+ messages in thread
From: Luck, Tony @ 2024-05-14 18:26 UTC (permalink / raw
  To: Chatre, Reinette
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

> On 5/13/2024 5:21 PM, Tony Luck wrote:
> > On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
> >> On 5/13/2024 10:05 AM, Tony Luck wrote:
> >>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
> >>> Thanks for the review. Detailed comments below. But overall I'm
> >>> going to split patch 7 into a bunch of smaller changes, each with
> >>> a better commit message.
> >>>
> >>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
> >>>>
> >>>> (Could you please start the changelog with some context?)
> >>>>
> >>>>> Add a field to the rdt_resource structure to track whether monitoring
> >>>>> resources are tracked by hardware at a different scope (NODE) from
> >>>>> the legacy L3 scope.
> >>>>
> >>>> This seems to describe @mon_scope that was introduced in patch #3?
> >>>
> >>> Not really. Patch #3 made the change so that control an monitor
> >>> functions can have different scope. That's still needed as with SNC
> >>> enabled the underlying data collection is at the node level for
> >>> monitoring, while control stays at the L3 cache scope.
> >>>
> >>> This new field describes the legacy scope of monitoring, so that
> >>> resctrl can provide correctly scoped monitor files for legacy
> >>> applications that aren't aware of SNC. So I'm using this both
> >>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
> >>> or disabled (when they are the same).
> >>
> >> This seems to enforce the idea that these new additions aim to be
> >> generic on the surface but the only goal is to support SNC.
> >
> > If you have some more ideas on how to make this more generic and
> > less SNC specific I'm all ears.
>
> It may not end up being totally generic. It should not pretend to be
> when it is not. It makes the flows difficult to follow when there are
> these unexpected checks/quirks in what claims to be core code.

Do you want some sort of warning comments in pieces of code
that are SNC specific?

>
> >>>>>         }
> >>>>> +
> >>>>> +       return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> >>>>> +                               struct rdt_mon_domain *d,
> >>>>> +                               struct rdt_resource *r, struct rdtgroup *prgrp)
> >>>>> +{
> >>>>> +       struct kernfs_node *kn, *ckn;
> >>>>> +       char name[32];
> >>>>> +       bool do_sum;
> >>>>> +       int ret;
> >>>>> +
> >>>>> +       do_sum = r->mon_scope != r->mon_display_scope;
> >>>>> +       sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> >>>>> +       kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> >>>>> +       if (!kn) {
> >>>>> +               /* create the directory */
> >>>>> +               kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> >>>>> +               if (IS_ERR(kn))
> >>>>> +                       return PTR_ERR(kn);
> >>>>> +
> >>>>> +               ret = rdtgroup_kn_set_ugid(kn);
> >>>>> +               if (ret)
> >>>>> +                       goto out_destroy;
> >>>>> +               ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
> >>>>
> >>>> This does not look right. If I understand correctly the private data
> >>>> of these event files will have whichever mon domain came up first as
> >>>> its domain id. That seems completely arbitrary and does not reflect
> >>>> accurate state for this file. Since "do_sum" is essentially a "flag"
> >>>> on how this file can be treated, can its "dom_id" not rather be
> >>>> the "monitor scope domain id"? Could that not help to eliminate
> >>>
> >>> You are correct that this should be the "monitor scope domain id" rather
> >>> than the first SNC domain that appears. I'll change to use that. I don't
> >>> think it helps in removing the per-domain display_id.
> >>
> >> Wouldn't the file metadata then be the "display_id"?
> >
> > Yes. The metadata is the display_id for files that need to sum across
> > SNC nodes, but the domain id for ones where no summation is needed.
>
> Right ... and there is a "sum" flag to tell which is which?

Yes. sum==0 means the domid field is the one and only domain to
report for this resctrl monitor file. sum==1 means the domid field is
the display_id - all domains with this display_id must be summed to
provide the result to present to the user.

I've tried to capture that in the kerneldoc comment for struct mon_event.
Here's what I'm planning to include in v18 (Outlook will probably mangle
the formatting ... just imagine that the text lines up neatly):

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 49440f194253..3411557d761a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -132,14 +132,19 @@ struct mon_evt {
  *                     as kernfs private data
  * @rid:               Resource id associated with the event file
  * @evtid:             Event id associated with the event file
- * @domid:             The domain to which the event file belongs
+ * @sum:               Set when event must be summed across multiple
+ *                     domains.
+ * @domid:             When @sum is zero this is the domain to which
+ *                     the event file belongs. When sum is one this
+ *                     is the display_id of all domains to be summed
  * @u:                 Name of the bit fields struct
  */
 union mon_data_bits {
        void *priv;
        struct {
                unsigned int rid                : 10;
-               enum resctrl_event_id evtid     : 8;
+               enum resctrl_event_id evtid     : 7;
+               unsigned int sum                : 1;
                unsigned int domid              : 14;
        } u;
 };

-Tony

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-14 18:26             ` Luck, Tony
@ 2024-05-14 20:30               ` Reinette Chatre
  2024-05-14 21:53                 ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-14 20:30 UTC (permalink / raw
  To: Luck, Tony
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 5/14/2024 11:26 AM, Luck, Tony wrote:
>> On 5/13/2024 5:21 PM, Tony Luck wrote:
>>> On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
>>>> On 5/13/2024 10:05 AM, Tony Luck wrote:
>>>>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
>>>>> Thanks for the review. Detailed comments below. But overall I'm
>>>>> going to split patch 7 into a bunch of smaller changes, each with
>>>>> a better commit message.
>>>>>
>>>>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
>>>>>>
>>>>>> (Could you please start the changelog with some context?)
>>>>>>
>>>>>>> Add a field to the rdt_resource structure to track whether monitoring
>>>>>>> resources are tracked by hardware at a different scope (NODE) from
>>>>>>> the legacy L3 scope.
>>>>>>
>>>>>> This seems to describe @mon_scope that was introduced in patch #3?
>>>>>
>>>>> Not really. Patch #3 made the change so that control an monitor
>>>>> functions can have different scope. That's still needed as with SNC
>>>>> enabled the underlying data collection is at the node level for
>>>>> monitoring, while control stays at the L3 cache scope.
>>>>>
>>>>> This new field describes the legacy scope of monitoring, so that
>>>>> resctrl can provide correctly scoped monitor files for legacy
>>>>> applications that aren't aware of SNC. So I'm using this both
>>>>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
>>>>> or disabled (when they are the same).
>>>>
>>>> This seems to enforce the idea that these new additions aim to be
>>>> generic on the surface but the only goal is to support SNC.
>>>
>>> If you have some more ideas on how to make this more generic and
>>> less SNC specific I'm all ears.
>>
>> It may not end up being totally generic. It should not pretend to be
>> when it is not. It makes the flows difficult to follow when there are
>> these unexpected checks/quirks in what claims to be core code.
> 
> Do you want some sort of warning comments in pieces of code
> that are SNC specific?

I cannot think now where warnings will be appropriate but if you
find instances then please do. To start the quirks can at least be
documented. For example, "Only user of <feature> is SNC, which does
not require <custom> so simplify by <describe shortcut> ..."

> 
>>
>>>>>>>         }
>>>>>>> +
>>>>>>> +       return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>>>>>> +                               struct rdt_mon_domain *d,
>>>>>>> +                               struct rdt_resource *r, struct rdtgroup *prgrp)
>>>>>>> +{
>>>>>>> +       struct kernfs_node *kn, *ckn;
>>>>>>> +       char name[32];
>>>>>>> +       bool do_sum;
>>>>>>> +       int ret;
>>>>>>> +
>>>>>>> +       do_sum = r->mon_scope != r->mon_display_scope;
>>>>>>> +       sprintf(name, "mon_%s_%02d", r->name, d->display_id);
>>>>>>> +       kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
>>>>>>> +       if (!kn) {
>>>>>>> +               /* create the directory */
>>>>>>> +               kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>>>>>>> +               if (IS_ERR(kn))
>>>>>>> +                       return PTR_ERR(kn);
>>>>>>> +
>>>>>>> +               ret = rdtgroup_kn_set_ugid(kn);
>>>>>>> +               if (ret)
>>>>>>> +                       goto out_destroy;
>>>>>>> +               ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
>>>>>>
>>>>>> This does not look right. If I understand correctly the private data
>>>>>> of these event files will have whichever mon domain came up first as
>>>>>> its domain id. That seems completely arbitrary and does not reflect
>>>>>> accurate state for this file. Since "do_sum" is essentially a "flag"
>>>>>> on how this file can be treated, can its "dom_id" not rather be
>>>>>> the "monitor scope domain id"? Could that not help to eliminate
>>>>>
>>>>> You are correct that this should be the "monitor scope domain id" rather
>>>>> than the first SNC domain that appears. I'll change to use that. I don't
>>>>> think it helps in removing the per-domain display_id.
>>>>
>>>> Wouldn't the file metadata then be the "display_id"?
>>>
>>> Yes. The metadata is the display_id for files that need to sum across
>>> SNC nodes, but the domain id for ones where no summation is needed.
>>
>> Right ... and there is a "sum" flag to tell which is which?
> 
> Yes. sum==0 means the domid field is the one and only domain to
> report for this resctrl monitor file. sum==1 means the domid field is
> the display_id - all domains with this display_id must be summed to
> provide the result to present to the user.
> 
> I've tried to capture that in the kerneldoc comment for struct mon_event.
> Here's what I'm planning to include in v18 (Outlook will probably mangle
> the formatting ... just imagine that the text lines up neatly):
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 49440f194253..3411557d761a 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -132,14 +132,19 @@ struct mon_evt {
>   *                     as kernfs private data
>   * @rid:               Resource id associated with the event file
>   * @evtid:             Event id associated with the event file
> - * @domid:             The domain to which the event file belongs
> + * @sum:               Set when event must be summed across multiple
> + *                     domains.
> + * @domid:             When @sum is zero this is the domain to which
> + *                     the event file belongs. When sum is one this
> + *                     is the display_id of all domains to be summed

Here is where I would like to understand why it cannot just be
"When sum is one this is the domain id of the scope at which (for which?)
the events must be summed." Although, you already mentioned this will be
clear in next posting.

>   * @u:                 Name of the bit fields struct
>   */
>  union mon_data_bits {
>         void *priv;
>         struct {
>                 unsigned int rid                : 10;
> -               enum resctrl_event_id evtid     : 8;
> +               enum resctrl_event_id evtid     : 7;
> +               unsigned int sum                : 1;
>                 unsigned int domid              : 14;
>         } u;
>  };
> 
> -Tony

Reinette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-14 20:30               ` Reinette Chatre
@ 2024-05-14 21:53                 ` Tony Luck
  2024-05-15 16:47                   ` Reinette Chatre
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-14 21:53 UTC (permalink / raw
  To: Reinette Chatre
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

On Tue, May 14, 2024 at 01:30:05PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/14/2024 11:26 AM, Luck, Tony wrote:
> >> On 5/13/2024 5:21 PM, Tony Luck wrote:
> >>> On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
> >>>> On 5/13/2024 10:05 AM, Tony Luck wrote:
> >>>>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
> >>>>> Thanks for the review. Detailed comments below. But overall I'm
> >>>>> going to split patch 7 into a bunch of smaller changes, each with
> >>>>> a better commit message.
> >>>>>
> >>>>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
> >>>>>>
> >>>>>> (Could you please start the changelog with some context?)
> >>>>>>
> >>>>>>> Add a field to the rdt_resource structure to track whether monitoring
> >>>>>>> resources are tracked by hardware at a different scope (NODE) from
> >>>>>>> the legacy L3 scope.
> >>>>>>
> >>>>>> This seems to describe @mon_scope that was introduced in patch #3?
> >>>>>
> >>>>> Not really. Patch #3 made the change so that control an monitor
> >>>>> functions can have different scope. That's still needed as with SNC
> >>>>> enabled the underlying data collection is at the node level for
> >>>>> monitoring, while control stays at the L3 cache scope.
> >>>>>
> >>>>> This new field describes the legacy scope of monitoring, so that
> >>>>> resctrl can provide correctly scoped monitor files for legacy
> >>>>> applications that aren't aware of SNC. So I'm using this both
> >>>>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
> >>>>> or disabled (when they are the same).
> >>>>
> >>>> This seems to enforce the idea that these new additions aim to be
> >>>> generic on the surface but the only goal is to support SNC.
> >>>
> >>> If you have some more ideas on how to make this more generic and
> >>> less SNC specific I'm all ears.
> >>
> >> It may not end up being totally generic. It should not pretend to be
> >> when it is not. It makes the flows difficult to follow when there are
> >> these unexpected checks/quirks in what claims to be core code.
> > 
> > Do you want some sort of warning comments in pieces of code
> > that are SNC specific?
> 
> I cannot think now where warnings will be appropriate but if you
> find instances then please do. To start the quirks can at least be
> documented. For example, "Only user of <feature> is SNC, which does
> not require <custom> so simplify by <describe shortcut> ..."

The main spot that triggered this line of discussion was changing the
sanity check that operations to read monitors is being done from a
CPU within the right domain. I've added a short comment on the new
check:

-       if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
+       /* Event counts can only be read from a CPU on the same L3 cache */
+       if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
                return -EINVAL;

But my change embeds the assumption that monitor events are L3 scoped.

Should it be something like this (to keep the non-SNC case generic):

	if (r->mon_scope == r->mon_display_scope) {
		if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
			return -EINVAL;
	} else {
		/*
		 * SNC: OK to read events on any CPU sharing same L3
		 * cache instance.
		 */
		 if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
		 	return -EINVAL;
	}

> 
> > 
> >>
> >>>>>>>         }
> >>>>>>> +
> >>>>>>> +       return 0;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> >>>>>>> +                               struct rdt_mon_domain *d,
> >>>>>>> +                               struct rdt_resource *r, struct rdtgroup *prgrp)
> >>>>>>> +{
> >>>>>>> +       struct kernfs_node *kn, *ckn;
> >>>>>>> +       char name[32];
> >>>>>>> +       bool do_sum;
> >>>>>>> +       int ret;
> >>>>>>> +
> >>>>>>> +       do_sum = r->mon_scope != r->mon_display_scope;
> >>>>>>> +       sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> >>>>>>> +       kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> >>>>>>> +       if (!kn) {
> >>>>>>> +               /* create the directory */
> >>>>>>> +               kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> >>>>>>> +               if (IS_ERR(kn))
> >>>>>>> +                       return PTR_ERR(kn);
> >>>>>>> +
> >>>>>>> +               ret = rdtgroup_kn_set_ugid(kn);
> >>>>>>> +               if (ret)
> >>>>>>> +                       goto out_destroy;
> >>>>>>> +               ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
> >>>>>>
> >>>>>> This does not look right. If I understand correctly the private data
> >>>>>> of these event files will have whichever mon domain came up first as
> >>>>>> its domain id. That seems completely arbitrary and does not reflect
> >>>>>> accurate state for this file. Since "do_sum" is essentially a "flag"
> >>>>>> on how this file can be treated, can its "dom_id" not rather be
> >>>>>> the "monitor scope domain id"? Could that not help to eliminate
> >>>>>
> >>>>> You are correct that this should be the "monitor scope domain id" rather
> >>>>> than the first SNC domain that appears. I'll change to use that. I don't
> >>>>> think it helps in removing the per-domain display_id.
> >>>>
> >>>> Wouldn't the file metadata then be the "display_id"?
> >>>
> >>> Yes. The metadata is the display_id for files that need to sum across
> >>> SNC nodes, but the domain id for ones where no summation is needed.
> >>
> >> Right ... and there is a "sum" flag to tell which is which?
> > 
> > Yes. sum==0 means the domid field is the one and only domain to
> > report for this resctrl monitor file. sum==1 means the domid field is
> > the display_id - all domains with this display_id must be summed to
> > provide the result to present to the user.
> > 
> > I've tried to capture that in the kerneldoc comment for struct mon_event.
> > Here's what I'm planning to include in v18 (Outlook will probably mangle
> > the formatting ... just imagine that the text lines up neatly):
> > 
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 49440f194253..3411557d761a 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -132,14 +132,19 @@ struct mon_evt {
> >   *                     as kernfs private data
> >   * @rid:               Resource id associated with the event file
> >   * @evtid:             Event id associated with the event file
> > - * @domid:             The domain to which the event file belongs
> > + * @sum:               Set when event must be summed across multiple
> > + *                     domains.
> > + * @domid:             When @sum is zero this is the domain to which
> > + *                     the event file belongs. When sum is one this
> > + *                     is the display_id of all domains to be summed
> 
> Here is where I would like to understand why it cannot just be
> "When sum is one this is the domain id of the scope at which (for which?)
> the events must be summed." Although, you already mentioned this will be
> clear in next posting.
> 
> >   * @u:                 Name of the bit fields struct
> >   */
> >  union mon_data_bits {
> >         void *priv;
> >         struct {
> >                 unsigned int rid                : 10;
> > -               enum resctrl_event_id evtid     : 8;
> > +               enum resctrl_event_id evtid     : 7;
> > +               unsigned int sum                : 1;
> >                 unsigned int domid              : 14;
> >         } u;
> >  };
> > 
> > -Tony

Maybe an example might help. Assume an SNC system with two sockets,
three SNC nodes per socket, only supporting monitoring. The only domain
list created by resctrl is the mon_domains list on the RDT_RESOURCE_L3
resource. And it looks like this (with "disply_list" abbreviated to
"dspl" to keep the picture small):


       <------ SNC NODES ON SOCKET 0 ----->   <------ SNC NODES ON SOCKET 1 ------>
----> +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
      | id = 0   | | id = 1   | | id = 2   | | id = 3   | | id = 4   | | id = 5   |
      |          | |          | |          | |          | |          | |          |
      | dspl = 0 | | dspl = 0 | | dspl = 0 | | dspl = 1 | | dspl = 1 | | dspl = 1 |
      |          | |          | |          | |          | |          | |          |
      +----------+ +----------+ +----------+ +----------+ +----------+ +----------+

Reading the per-SNC node monitor values looks just the same as the
non-SNC case. The struct rmid_read passed across the smp_call*() has
the resource, domain, event, and reading the counters is essentially
unchanged.

Reading a file to sum event counts for SNC nodes on socket 1 needs to
find each of the "struct rdt_mon_domain" that are part of socket 1.
I'm doing that with meta data in the file that says sum=1 (need to add
up something) and domid=1 (the things to be added are those with
display_id = 1). So the code reads:

	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
		if (d->display_id == rr->d->display_id) {
			... call stuff to read and sum for domain "d"
		}
	}

The display_id is "the domain id of the scope at which (for which?)
the events must be summed." in your text above.

> Reinette

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-14 21:53                 ` Tony Luck
@ 2024-05-15 16:47                   ` Reinette Chatre
  2024-05-15 17:23                     ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Reinette Chatre @ 2024-05-15 16:47 UTC (permalink / raw
  To: Tony Luck
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 5/14/2024 2:53 PM, Tony Luck wrote:
> On Tue, May 14, 2024 at 01:30:05PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/14/2024 11:26 AM, Luck, Tony wrote:
>>>> On 5/13/2024 5:21 PM, Tony Luck wrote:
>>>>> On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
>>>>>> On 5/13/2024 10:05 AM, Tony Luck wrote:
>>>>>>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
>>>>>>> Thanks for the review. Detailed comments below. But overall I'm
>>>>>>> going to split patch 7 into a bunch of smaller changes, each with
>>>>>>> a better commit message.
>>>>>>>
>>>>>>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
>>>>>>>>
>>>>>>>> (Could you please start the changelog with some context?)
>>>>>>>>
>>>>>>>>> Add a field to the rdt_resource structure to track whether monitoring
>>>>>>>>> resources are tracked by hardware at a different scope (NODE) from
>>>>>>>>> the legacy L3 scope.
>>>>>>>>
>>>>>>>> This seems to describe @mon_scope that was introduced in patch #3?
>>>>>>>
>>>>>>> Not really. Patch #3 made the change so that control an monitor
>>>>>>> functions can have different scope. That's still needed as with SNC
>>>>>>> enabled the underlying data collection is at the node level for
>>>>>>> monitoring, while control stays at the L3 cache scope.
>>>>>>>
>>>>>>> This new field describes the legacy scope of monitoring, so that
>>>>>>> resctrl can provide correctly scoped monitor files for legacy
>>>>>>> applications that aren't aware of SNC. So I'm using this both
>>>>>>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
>>>>>>> or disabled (when they are the same).
>>>>>>
>>>>>> This seems to enforce the idea that these new additions aim to be
>>>>>> generic on the surface but the only goal is to support SNC.
>>>>>
>>>>> If you have some more ideas on how to make this more generic and
>>>>> less SNC specific I'm all ears.
>>>>
>>>> It may not end up being totally generic. It should not pretend to be
>>>> when it is not. It makes the flows difficult to follow when there are
>>>> these unexpected checks/quirks in what claims to be core code.
>>>
>>> Do you want some sort of warning comments in pieces of code
>>> that are SNC specific?
>>
>> I cannot think now where warnings will be appropriate but if you
>> find instances then please do. To start the quirks can at least be
>> documented. For example, "Only user of <feature> is SNC, which does
>> not require <custom> so simplify by <describe shortcut> ..."
> 
> The main spot that triggered this line of discussion was changing the
> sanity check that operations to read monitors is being done from a
> CPU within the right domain. I've added a short comment on the new
> check:
> 
> -       if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> +       /* Event counts can only be read from a CPU on the same L3 cache */
> +       if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
>                 return -EINVAL;
> 
> But my change embeds the assumption that monitor events are L3 scoped.
> 
> Should it be something like this (to keep the non-SNC case generic):
> 
> 	if (r->mon_scope == r->mon_display_scope) {
> 		if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> 			return -EINVAL;

Yes, keeping this check looks good to me ...

> 	} else {
> 		/*
> 		 * SNC: OK to read events on any CPU sharing same L3
> 		 * cache instance.
> 		 */
> 		 if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
> 		 	return -EINVAL;
> 	}

... while I remain unsure about where "display_id" fits in.

> 
>>
>>>
>>>>
>>>>>>>>>         }
>>>>>>>>> +
>>>>>>>>> +       return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>>>>>>>> +                               struct rdt_mon_domain *d,
>>>>>>>>> +                               struct rdt_resource *r, struct rdtgroup *prgrp)
>>>>>>>>> +{
>>>>>>>>> +       struct kernfs_node *kn, *ckn;
>>>>>>>>> +       char name[32];
>>>>>>>>> +       bool do_sum;
>>>>>>>>> +       int ret;
>>>>>>>>> +
>>>>>>>>> +       do_sum = r->mon_scope != r->mon_display_scope;
>>>>>>>>> +       sprintf(name, "mon_%s_%02d", r->name, d->display_id);
>>>>>>>>> +       kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
>>>>>>>>> +       if (!kn) {
>>>>>>>>> +               /* create the directory */
>>>>>>>>> +               kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>>>>>>>>> +               if (IS_ERR(kn))
>>>>>>>>> +                       return PTR_ERR(kn);
>>>>>>>>> +
>>>>>>>>> +               ret = rdtgroup_kn_set_ugid(kn);
>>>>>>>>> +               if (ret)
>>>>>>>>> +                       goto out_destroy;
>>>>>>>>> +               ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
>>>>>>>>
>>>>>>>> This does not look right. If I understand correctly the private data
>>>>>>>> of these event files will have whichever mon domain came up first as
>>>>>>>> its domain id. That seems completely arbitrary and does not reflect
>>>>>>>> accurate state for this file. Since "do_sum" is essentially a "flag"
>>>>>>>> on how this file can be treated, can its "dom_id" not rather be
>>>>>>>> the "monitor scope domain id"? Could that not help to eliminate
>>>>>>>
>>>>>>> You are correct that this should be the "monitor scope domain id" rather
>>>>>>> than the first SNC domain that appears. I'll change to use that. I don't
>>>>>>> think it helps in removing the per-domain display_id.
>>>>>>
>>>>>> Wouldn't the file metadata then be the "display_id"?
>>>>>
>>>>> Yes. The metadata is the display_id for files that need to sum across
>>>>> SNC nodes, but the domain id for ones where no summation is needed.
>>>>
>>>> Right ... and there is a "sum" flag to tell which is which?
>>>
>>> Yes. sum==0 means the domid field is the one and only domain to
>>> report for this resctrl monitor file. sum==1 means the domid field is
>>> the display_id - all domains with this display_id must be summed to
>>> provide the result to present to the user.
>>>
>>> I've tried to capture that in the kerneldoc comment for struct mon_event.
>>> Here's what I'm planning to include in v18 (Outlook will probably mangle
>>> the formatting ... just imagine that the text lines up neatly):
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>> index 49440f194253..3411557d761a 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>> @@ -132,14 +132,19 @@ struct mon_evt {
>>>   *                     as kernfs private data
>>>   * @rid:               Resource id associated with the event file
>>>   * @evtid:             Event id associated with the event file
>>> - * @domid:             The domain to which the event file belongs
>>> + * @sum:               Set when event must be summed across multiple
>>> + *                     domains.
>>> + * @domid:             When @sum is zero this is the domain to which
>>> + *                     the event file belongs. When sum is one this
>>> + *                     is the display_id of all domains to be summed
>>
>> Here is where I would like to understand why it cannot just be
>> "When sum is one this is the domain id of the scope at which (for which?)
>> the events must be summed." Although, you already mentioned this will be
>> clear in next posting.
>>
>>>   * @u:                 Name of the bit fields struct
>>>   */
>>>  union mon_data_bits {
>>>         void *priv;
>>>         struct {
>>>                 unsigned int rid                : 10;
>>> -               enum resctrl_event_id evtid     : 8;
>>> +               enum resctrl_event_id evtid     : 7;
>>> +               unsigned int sum                : 1;
>>>                 unsigned int domid              : 14;
>>>         } u;
>>>  };
>>>
>>> -Tony
> 
> Maybe an example might help. Assume an SNC system with two sockets,
> three SNC nodes per socket, only supporting monitoring. The only domain
> list created by resctrl is the mon_domains list on the RDT_RESOURCE_L3
> resource. And it looks like this (with "disply_list" abbreviated to
> "dspl" to keep the picture small):
> 
> 
>        <------ SNC NODES ON SOCKET 0 ----->   <------ SNC NODES ON SOCKET 1 ------>
> ----> +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
>       | id = 0   | | id = 1   | | id = 2   | | id = 3   | | id = 4   | | id = 5   |
>       |          | |          | |          | |          | |          | |          |
>       | dspl = 0 | | dspl = 0 | | dspl = 0 | | dspl = 1 | | dspl = 1 | | dspl = 1 |
>       |          | |          | |          | |          | |          | |          |
>       +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
> 
> Reading the per-SNC node monitor values looks just the same as the
> non-SNC case. The struct rmid_read passed across the smp_call*() has
> the resource, domain, event, and reading the counters is essentially
> unchanged.
> 
> Reading a file to sum event counts for SNC nodes on socket 1 needs to
> find each of the "struct rdt_mon_domain" that are part of socket 1.
> I'm doing that with meta data in the file that says sum=1 (need to add
> up something) and domid=1 (the things to be added are those with
> display_id = 1). So the code reads:
> 
> 	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> 		if (d->display_id == rr->d->display_id) {
> 			... call stuff to read and sum for domain "d"
> 		}
> 	}
> 
> The display_id is "the domain id of the scope at which (for which?)
> the events must be summed." in your text above.

My point remains that it is not clear (to me) why it is required to
carry the display_id around.

 	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
		/* determine @id of @d at rr->r->mon_display_scope */
 		if (id == domid) {
 			... call stuff to read and sum for domain "d"
 		}
 	}

Reinette


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-15 16:47                   ` Reinette Chatre
@ 2024-05-15 17:23                     ` Tony Luck
  2024-05-15 18:48                       ` Reinette Chatre
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2024-05-15 17:23 UTC (permalink / raw
  To: Reinette Chatre
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

On Wed, May 15, 2024 at 09:47:28AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/14/2024 2:53 PM, Tony Luck wrote:
> > On Tue, May 14, 2024 at 01:30:05PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 5/14/2024 11:26 AM, Luck, Tony wrote:
> >>>> On 5/13/2024 5:21 PM, Tony Luck wrote:
> >>>>> On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
> >>>>>> On 5/13/2024 10:05 AM, Tony Luck wrote:
> >>>>>>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
> >>>>>>> Thanks for the review. Detailed comments below. But overall I'm
> >>>>>>> going to split patch 7 into a bunch of smaller changes, each with
> >>>>>>> a better commit message.
> >>>>>>>
> >>>>>>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
> >>>>>>>>
> >>>>>>>> (Could you please start the changelog with some context?)
> >>>>>>>>
> >>>>>>>>> Add a field to the rdt_resource structure to track whether monitoring
> >>>>>>>>> resources are tracked by hardware at a different scope (NODE) from
> >>>>>>>>> the legacy L3 scope.
> >>>>>>>>
> >>>>>>>> This seems to describe @mon_scope that was introduced in patch #3?
> >>>>>>>
> >>>>>>> Not really. Patch #3 made the change so that control an monitor
> >>>>>>> functions can have different scope. That's still needed as with SNC
> >>>>>>> enabled the underlying data collection is at the node level for
> >>>>>>> monitoring, while control stays at the L3 cache scope.
> >>>>>>>
> >>>>>>> This new field describes the legacy scope of monitoring, so that
> >>>>>>> resctrl can provide correctly scoped monitor files for legacy
> >>>>>>> applications that aren't aware of SNC. So I'm using this both
> >>>>>>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
> >>>>>>> or disabled (when they are the same).
> >>>>>>
> >>>>>> This seems to enforce the idea that these new additions aim to be
> >>>>>> generic on the surface but the only goal is to support SNC.
> >>>>>
> >>>>> If you have some more ideas on how to make this more generic and
> >>>>> less SNC specific I'm all ears.
> >>>>
> >>>> It may not end up being totally generic. It should not pretend to be
> >>>> when it is not. It makes the flows difficult to follow when there are
> >>>> these unexpected checks/quirks in what claims to be core code.
> >>>
> >>> Do you want some sort of warning comments in pieces of code
> >>> that are SNC specific?
> >>
> >> I cannot think now where warnings will be appropriate but if you
> >> find instances then please do. To start the quirks can at least be
> >> documented. For example, "Only user of <feature> is SNC, which does
> >> not require <custom> so simplify by <describe shortcut> ..."
> > 
> > The main spot that triggered this line of discussion was changing the
> > sanity check that operations to read monitors is being done from a
> > CPU within the right domain. I've added a short comment on the new
> > check:
> > 
> > -       if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> > +       /* Event counts can only be read from a CPU on the same L3 cache */
> > +       if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
> >                 return -EINVAL;
> > 
> > But my change embeds the assumption that monitor events are L3 scoped.
> > 
> > Should it be something like this (to keep the non-SNC case generic):
> > 
> > 	if (r->mon_scope == r->mon_display_scope) {
> > 		if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> > 			return -EINVAL;
> 
> Yes, keeping this check looks good to me ...
> 
> > 	} else {
> > 		/*
> > 		 * SNC: OK to read events on any CPU sharing same L3
> > 		 * cache instance.
> > 		 */
> > 		 if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
> > 		 	return -EINVAL;
> > 	}
> 
> ... while I remain unsure about where "display_id" fits in.

See below.

> > 
> >>
> >>>
> >>>>
> >>>>>>>>>         }
> >>>>>>>>> +
> >>>>>>>>> +       return 0;
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> >>>>>>>>> +                               struct rdt_mon_domain *d,
> >>>>>>>>> +                               struct rdt_resource *r, struct rdtgroup *prgrp)
> >>>>>>>>> +{
> >>>>>>>>> +       struct kernfs_node *kn, *ckn;
> >>>>>>>>> +       char name[32];
> >>>>>>>>> +       bool do_sum;
> >>>>>>>>> +       int ret;
> >>>>>>>>> +
> >>>>>>>>> +       do_sum = r->mon_scope != r->mon_display_scope;
> >>>>>>>>> +       sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> >>>>>>>>> +       kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> >>>>>>>>> +       if (!kn) {
> >>>>>>>>> +               /* create the directory */
> >>>>>>>>> +               kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> >>>>>>>>> +               if (IS_ERR(kn))
> >>>>>>>>> +                       return PTR_ERR(kn);
> >>>>>>>>> +
> >>>>>>>>> +               ret = rdtgroup_kn_set_ugid(kn);
> >>>>>>>>> +               if (ret)
> >>>>>>>>> +                       goto out_destroy;
> >>>>>>>>> +               ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
> >>>>>>>>
> >>>>>>>> This does not look right. If I understand correctly the private data
> >>>>>>>> of these event files will have whichever mon domain came up first as
> >>>>>>>> its domain id. That seems completely arbitrary and does not reflect
> >>>>>>>> accurate state for this file. Since "do_sum" is essentially a "flag"
> >>>>>>>> on how this file can be treated, can its "dom_id" not rather be
> >>>>>>>> the "monitor scope domain id"? Could that not help to eliminate
> >>>>>>>
> >>>>>>> You are correct that this should be the "monitor scope domain id" rather
> >>>>>>> than the first SNC domain that appears. I'll change to use that. I don't
> >>>>>>> think it helps in removing the per-domain display_id.
> >>>>>>
> >>>>>> Wouldn't the file metadata then be the "display_id"?
> >>>>>
> >>>>> Yes. The metadata is the display_id for files that need to sum across
> >>>>> SNC nodes, but the domain id for ones where no summation is needed.
> >>>>
> >>>> Right ... and there is a "sum" flag to tell which is which?
> >>>
> >>> Yes. sum==0 means the domid field is the one and only domain to
> >>> report for this resctrl monitor file. sum==1 means the domid field is
> >>> the display_id - all domains with this display_id must be summed to
> >>> provide the result to present to the user.
> >>>
> >>> I've tried to capture that in the kerneldoc comment for struct mon_event.
> >>> Here's what I'm planning to include in v18 (Outlook will probably mangle
> >>> the formatting ... just imagine that the text lines up neatly):
> >>>
> >>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> >>> index 49440f194253..3411557d761a 100644
> >>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> >>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> >>> @@ -132,14 +132,19 @@ struct mon_evt {
> >>>   *                     as kernfs private data
> >>>   * @rid:               Resource id associated with the event file
> >>>   * @evtid:             Event id associated with the event file
> >>> - * @domid:             The domain to which the event file belongs
> >>> + * @sum:               Set when event must be summed across multiple
> >>> + *                     domains.
> >>> + * @domid:             When @sum is zero this is the domain to which
> >>> + *                     the event file belongs. When sum is one this
> >>> + *                     is the display_id of all domains to be summed
> >>
> >> Here is where I would like to understand why it cannot just be
> >> "When sum is one this is the domain id of the scope at which (for which?)
> >> the events must be summed." Although, you already mentioned this will be
> >> clear in next posting.
> >>
> >>>   * @u:                 Name of the bit fields struct
> >>>   */
> >>>  union mon_data_bits {
> >>>         void *priv;
> >>>         struct {
> >>>                 unsigned int rid                : 10;
> >>> -               enum resctrl_event_id evtid     : 8;
> >>> +               enum resctrl_event_id evtid     : 7;
> >>> +               unsigned int sum                : 1;
> >>>                 unsigned int domid              : 14;
> >>>         } u;
> >>>  };
> >>>
> >>> -Tony
> > 
> > Maybe an example might help. Assume an SNC system with two sockets,
> > three SNC nodes per socket, only supporting monitoring. The only domain
> > list created by resctrl is the mon_domains list on the RDT_RESOURCE_L3
> > resource. And it looks like this (with "disply_list" abbreviated to
> > "dspl" to keep the picture small):
> > 
> > 
> >        <------ SNC NODES ON SOCKET 0 ----->   <------ SNC NODES ON SOCKET 1 ------>
> > ----> +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
> >       | id = 0   | | id = 1   | | id = 2   | | id = 3   | | id = 4   | | id = 5   |
> >       |          | |          | |          | |          | |          | |          |
> >       | dspl = 0 | | dspl = 0 | | dspl = 0 | | dspl = 1 | | dspl = 1 | | dspl = 1 |
> >       |          | |          | |          | |          | |          | |          |
> >       +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
> > 
> > Reading the per-SNC node monitor values looks just the same as the
> > non-SNC case. The struct rmid_read passed across the smp_call*() has
> > the resource, domain, event, and reading the counters is essentially
> > unchanged.
> > 
> > Reading a file to sum event counts for SNC nodes on socket 1 needs to
> > find each of the "struct rdt_mon_domain" that are part of socket 1.
> > I'm doing that with meta data in the file that says sum=1 (need to add
> > up something) and domid=1 (the things to be added are those with
> > display_id = 1). So the code reads:
> > 
> > 	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> > 		if (d->display_id == rr->d->display_id) {
> > 			... call stuff to read and sum for domain "d"
> > 		}
> > 	}
> > 
> > The display_id is "the domain id of the scope at which (for which?)
> > the events must be summed." in your text above.
> 
> My point remains that it is not clear (to me) why it is required to
> carry the display_id around.
> 
>  	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> 		/* determine @id of @d at rr->r->mon_display_scope */
>  		if (id == domid) {
>  			... call stuff to read and sum for domain "d"
>  		}
>  	}

That "determine @id of @d at rr->r->mon_display_scope" is:

	display_id = get_domain_id_from_scope(cpumask_first(rr->d->hdr.cpu_mask), rr->r->mon_display_scope);
	if (display_id < 0) {
		take some error action
	}

So it certainly isn't *required* to carry display_id around. But doing
so makes the code simpler. I could bury the long line into a helper
macro/function. But I can't bury the error check.

I'd also need to change get_domain_id_from_scope() from "static" to
global so it can be used in other files besides core.c

Note that there are several places where I need to use display_id,
computing it at run time in each place, but it seems so much easier to
do it once at domain creation time.

> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring
  2024-05-15 17:23                     ` Tony Luck
@ 2024-05-15 18:48                       ` Reinette Chatre
  0 siblings, 0 replies; 26+ messages in thread
From: Reinette Chatre @ 2024-05-15 18:48 UTC (permalink / raw
  To: Tony Luck
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 5/15/2024 10:23 AM, Tony Luck wrote:
> On Wed, May 15, 2024 at 09:47:28AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/14/2024 2:53 PM, Tony Luck wrote:
>>> On Tue, May 14, 2024 at 01:30:05PM -0700, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>
>>>> On 5/14/2024 11:26 AM, Luck, Tony wrote:
>>>>>> On 5/13/2024 5:21 PM, Tony Luck wrote:
>>>>>>> On Mon, May 13, 2024 at 11:53:17AM -0700, Reinette Chatre wrote:
>>>>>>>> On 5/13/2024 10:05 AM, Tony Luck wrote:
>>>>>>>>> On Fri, May 10, 2024 at 02:24:13PM -0700, Reinette Chatre wrote:
>>>>>>>>> Thanks for the review. Detailed comments below. But overall I'm
>>>>>>>>> going to split patch 7 into a bunch of smaller changes, each with
>>>>>>>>> a better commit message.
>>>>>>>>>
>>>>>>>>>> On 5/3/2024 1:33 PM, Tony Luck wrote:
>>>>>>>>>>
>>>>>>>>>> (Could you please start the changelog with some context?)
>>>>>>>>>>
>>>>>>>>>>> Add a field to the rdt_resource structure to track whether monitoring
>>>>>>>>>>> resources are tracked by hardware at a different scope (NODE) from
>>>>>>>>>>> the legacy L3 scope.
>>>>>>>>>>
>>>>>>>>>> This seems to describe @mon_scope that was introduced in patch #3?
>>>>>>>>>
>>>>>>>>> Not really. Patch #3 made the change so that control an monitor
>>>>>>>>> functions can have different scope. That's still needed as with SNC
>>>>>>>>> enabled the underlying data collection is at the node level for
>>>>>>>>> monitoring, while control stays at the L3 cache scope.
>>>>>>>>>
>>>>>>>>> This new field describes the legacy scope of monitoring, so that
>>>>>>>>> resctrl can provide correctly scoped monitor files for legacy
>>>>>>>>> applications that aren't aware of SNC. So I'm using this both
>>>>>>>>> to indicate when SNC is enabled (with mon_scope != mon_display_scope)
>>>>>>>>> or disabled (when they are the same).
>>>>>>>>
>>>>>>>> This seems to enforce the idea that these new additions aim to be
>>>>>>>> generic on the surface but the only goal is to support SNC.
>>>>>>>
>>>>>>> If you have some more ideas on how to make this more generic and
>>>>>>> less SNC specific I'm all ears.
>>>>>>
>>>>>> It may not end up being totally generic. It should not pretend to be
>>>>>> when it is not. It makes the flows difficult to follow when there are
>>>>>> these unexpected checks/quirks in what claims to be core code.
>>>>>
>>>>> Do you want some sort of warning comments in pieces of code
>>>>> that are SNC specific?
>>>>
>>>> I cannot think now where warnings will be appropriate but if you
>>>> find instances then please do. To start the quirks can at least be
>>>> documented. For example, "Only user of <feature> is SNC, which does
>>>> not require <custom> so simplify by <describe shortcut> ..."
>>>
>>> The main spot that triggered this line of discussion was changing the
>>> sanity check that operations to read monitors is being done from a
>>> CPU within the right domain. I've added a short comment on the new
>>> check:
>>>
>>> -       if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>>> +       /* Event counts can only be read from a CPU on the same L3 cache */
>>> +       if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
>>>                 return -EINVAL;
>>>
>>> But my change embeds the assumption that monitor events are L3 scoped.
>>>
>>> Should it be something like this (to keep the non-SNC case generic):
>>>
>>> 	if (r->mon_scope == r->mon_display_scope) {
>>> 		if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>>> 			return -EINVAL;
>>
>> Yes, keeping this check looks good to me ...
>>
>>> 	} else {
>>> 		/*
>>> 		 * SNC: OK to read events on any CPU sharing same L3
>>> 		 * cache instance.
>>> 		 */
>>> 		 if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(), r->mon_display_scope))
>>> 		 	return -EINVAL;
>>> 	}
>>
>> ... while I remain unsure about where "display_id" fits in.
> 
> See below.
> 
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>>>>>         }
>>>>>>>>>>> +
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>>>>>>>>>> +                               struct rdt_mon_domain *d,
>>>>>>>>>>> +                               struct rdt_resource *r, struct rdtgroup *prgrp)
>>>>>>>>>>> +{
>>>>>>>>>>> +       struct kernfs_node *kn, *ckn;
>>>>>>>>>>> +       char name[32];
>>>>>>>>>>> +       bool do_sum;
>>>>>>>>>>> +       int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +       do_sum = r->mon_scope != r->mon_display_scope;
>>>>>>>>>>> +       sprintf(name, "mon_%s_%02d", r->name, d->display_id);
>>>>>>>>>>> +       kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
>>>>>>>>>>> +       if (!kn) {
>>>>>>>>>>> +               /* create the directory */
>>>>>>>>>>> +               kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
>>>>>>>>>>> +               if (IS_ERR(kn))
>>>>>>>>>>> +                       return PTR_ERR(kn);
>>>>>>>>>>> +
>>>>>>>>>>> +               ret = rdtgroup_kn_set_ugid(kn);
>>>>>>>>>>> +               if (ret)
>>>>>>>>>>> +                       goto out_destroy;
>>>>>>>>>>> +               ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
>>>>>>>>>>
>>>>>>>>>> This does not look right. If I understand correctly the private data
>>>>>>>>>> of these event files will have whichever mon domain came up first as
>>>>>>>>>> its domain id. That seems completely arbitrary and does not reflect
>>>>>>>>>> accurate state for this file. Since "do_sum" is essentially a "flag"
>>>>>>>>>> on how this file can be treated, can its "dom_id" not rather be
>>>>>>>>>> the "monitor scope domain id"? Could that not help to eliminate
>>>>>>>>>
>>>>>>>>> You are correct that this should be the "monitor scope domain id" rather
>>>>>>>>> than the first SNC domain that appears. I'll change to use that. I don't
>>>>>>>>> think it helps in removing the per-domain display_id.
>>>>>>>>
>>>>>>>> Wouldn't the file metadata then be the "display_id"?
>>>>>>>
>>>>>>> Yes. The metadata is the display_id for files that need to sum across
>>>>>>> SNC nodes, but the domain id for ones where no summation is needed.
>>>>>>
>>>>>> Right ... and there is a "sum" flag to tell which is which?
>>>>>
>>>>> Yes. sum==0 means the domid field is the one and only domain to
>>>>> report for this resctrl monitor file. sum==1 means the domid field is
>>>>> the display_id - all domains with this display_id must be summed to
>>>>> provide the result to present to the user.
>>>>>
>>>>> I've tried to capture that in the kerneldoc comment for struct mon_event.
>>>>> Here's what I'm planning to include in v18 (Outlook will probably mangle
>>>>> the formatting ... just imagine that the text lines up neatly):
>>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>>>> index 49440f194253..3411557d761a 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>>>> @@ -132,14 +132,19 @@ struct mon_evt {
>>>>>   *                     as kernfs private data
>>>>>   * @rid:               Resource id associated with the event file
>>>>>   * @evtid:             Event id associated with the event file
>>>>> - * @domid:             The domain to which the event file belongs
>>>>> + * @sum:               Set when event must be summed across multiple
>>>>> + *                     domains.
>>>>> + * @domid:             When @sum is zero this is the domain to which
>>>>> + *                     the event file belongs. When sum is one this
>>>>> + *                     is the display_id of all domains to be summed
>>>>
>>>> Here is where I would like to understand why it cannot just be
>>>> "When sum is one this is the domain id of the scope at which (for which?)
>>>> the events must be summed." Although, you already mentioned this will be
>>>> clear in next posting.
>>>>
>>>>>   * @u:                 Name of the bit fields struct
>>>>>   */
>>>>>  union mon_data_bits {
>>>>>         void *priv;
>>>>>         struct {
>>>>>                 unsigned int rid                : 10;
>>>>> -               enum resctrl_event_id evtid     : 8;
>>>>> +               enum resctrl_event_id evtid     : 7;
>>>>> +               unsigned int sum                : 1;
>>>>>                 unsigned int domid              : 14;
>>>>>         } u;
>>>>>  };
>>>>>
>>>>> -Tony
>>>
>>> Maybe an example might help. Assume an SNC system with two sockets,
>>> three SNC nodes per socket, only supporting monitoring. The only domain
>>> list created by resctrl is the mon_domains list on the RDT_RESOURCE_L3
>>> resource. And it looks like this (with "disply_list" abbreviated to
>>> "dspl" to keep the picture small):
>>>
>>>
>>>        <------ SNC NODES ON SOCKET 0 ----->   <------ SNC NODES ON SOCKET 1 ------>
>>> ----> +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
>>>       | id = 0   | | id = 1   | | id = 2   | | id = 3   | | id = 4   | | id = 5   |
>>>       |          | |          | |          | |          | |          | |          |
>>>       | dspl = 0 | | dspl = 0 | | dspl = 0 | | dspl = 1 | | dspl = 1 | | dspl = 1 |
>>>       |          | |          | |          | |          | |          | |          |
>>>       +----------+ +----------+ +----------+ +----------+ +----------+ +----------+
>>>
>>> Reading the per-SNC node monitor values looks just the same as the
>>> non-SNC case. The struct rmid_read passed across the smp_call*() has
>>> the resource, domain, event, and reading the counters is essentially
>>> unchanged.
>>>
>>> Reading a file to sum event counts for SNC nodes on socket 1 needs to
>>> find each of the "struct rdt_mon_domain" that are part of socket 1.
>>> I'm doing that with meta data in the file that says sum=1 (need to add
>>> up something) and domid=1 (the things to be added are those with
>>> display_id = 1). So the code reads:
>>>
>>> 	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
>>> 		if (d->display_id == rr->d->display_id) {
>>> 			... call stuff to read and sum for domain "d"
>>> 		}
>>> 	}
>>>
>>> The display_id is "the domain id of the scope at which (for which?)
>>> the events must be summed." in your text above.
>>
>> My point remains that it is not clear (to me) why it is required to
>> carry the display_id around.
>>
>>  	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
>> 		/* determine @id of @d at rr->r->mon_display_scope */
>>  		if (id == domid) {
>>  			... call stuff to read and sum for domain "d"
>>  		}
>>  	}
> 
> That "determine @id of @d at rr->r->mon_display_scope" is:
> 
> 	display_id = get_domain_id_from_scope(cpumask_first(rr->d->hdr.cpu_mask), rr->r->mon_display_scope);
> 	if (display_id < 0) {
> 		take some error action
> 	}
> 
> So it certainly isn't *required* to carry display_id around. But doing
> so makes the code simpler. I could bury the long line into a helper

Is "if (d->display_id == rr->d->display_id)" really "simpler"? It is
shorter I agree, but I would argue that it is much harder to understand
what the code is trying to do. The reader needs to understand what
"display_id" means, how the state is maintained, how
the values propagated to this call site, etc. With a query like above
it should be obvious what the code does.

> macro/function. But I can't bury the error check.

If this is an error then it is a kernel bug and should be handled
appropriately.

> 
> I'd also need to change get_domain_id_from_scope() from "static" to
> global so it can be used in other files besides core.c

Is this a problem?

> Note that there are several places where I need to use display_id,
> computing it at run time in each place, but it seems so much easier to
> do it once at domain creation time.

Easier to code perhaps but I do not see how it is "easy" to understand
and maintain.

I think we have now repeated the same conversation twice. Previously you
promised that your design would be clear to me in the next version and
I have already stated twice that I am ok with that.

Reinette

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-05-15 18:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-03 20:33 [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2024-05-03 20:33 ` [PATCH v17 1/9] x86/resctrl: Prepare for new domain scope Tony Luck
2024-05-03 20:33 ` [PATCH v17 2/9] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2024-05-03 20:33 ` [PATCH v17 3/9] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2024-05-03 20:33 ` [PATCH v17 4/9] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2024-05-03 20:33 ` [PATCH v17 5/9] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2024-05-03 20:33 ` [PATCH v17 6/9] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2024-05-03 20:33 ` [PATCH v17 7/9] x86/resctrl: Add new monitor files for Sub-NUMA cluster (SNC) monitoring Tony Luck
2024-05-10 21:24   ` Reinette Chatre
2024-05-13 17:05     ` Tony Luck
2024-05-13 18:53       ` Reinette Chatre
2024-05-14  0:21         ` Tony Luck
2024-05-14 15:08           ` Reinette Chatre
2024-05-14 18:26             ` Luck, Tony
2024-05-14 20:30               ` Reinette Chatre
2024-05-14 21:53                 ` Tony Luck
2024-05-15 16:47                   ` Reinette Chatre
2024-05-15 17:23                     ` Tony Luck
2024-05-15 18:48                       ` Reinette Chatre
2024-05-03 20:33 ` [PATCH v17 8/9] x86/resctrl: Sub NUMA Cluster detection and enable Tony Luck
2024-05-10 21:24   ` Reinette Chatre
2024-05-13 17:17     ` Tony Luck
2024-05-13 18:53       ` Reinette Chatre
2024-05-14  0:28         ` Tony Luck
2024-05-03 20:33 ` [PATCH v17 9/9] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2024-05-14 15:02 ` [PATCH v17 0/9] Add support for Sub-NUMA cluster (SNC) systems Maciej Wieczor-Retman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).