LKML Archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
@ 2024-01-26  8:54 Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH Xiong Zhang
                   ` (42 more replies)
  0 siblings, 43 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

Background
===
KVM has supported vPMU for years as the emulated vPMU. In particular, KVM
presents a virtual PMU to guest where accesses to PMU get trapped and
converted into perf events. These perf events get scheduled along with
other perf events at the host level, sharing the HW resource. In the
emulated vPMU design, KVM is a client of the perf subsystem and has no
control of the HW PMU resource at host level.

This emulated vPMU has these drawbacks:
1. Poor performance, guest PMU MSR accessing has VM-exit and some has
expensive host perf API call. Once guest PMU is multiplexing its
counters, KVM will waste majority of time re-creating/starting/releasing
KVM perf events, then the guest perf performance is dropped dramatically.
2. Guest perf events's backend may be swapped out or disabled silently.
This is because host perf scheduler treats KVM perf events and other host
perf events equally, they will contend HW resources. KVM perf events will
be inactive when all HW resources have been owned by host perf events.
But KVM can not notify this backend error into guest, this slient error
is a red flag for vPMU as a production.
3. Hard to add new vPMU features. For each vPMU new feature, KVM needs
to emulate new MSRs, this involves perf and kvm two subsystems, mostly
the vendor specific perf API is added and is hard to accept.

The community has discussed these drawbacks for years and reconsidered
current emulated vPMU [1]. In latest discussion [2], both Perf and KVM
x86 community agreed to try a passthrough vPMU. So we co-work with google
engineers to develop this RFC, currently we implement it on Intel CPU
only, and can add other arch's implementation later.
Complete RFC source code can be found in below link:
https://github.com/googleprodkernel/linux-kvm/tree/passthrough-pmu-rfc

Under passthrough vPMU, VM direct access to all HW PMU general purpose
counters and some of the fixed counters, VM has transparency of x86 PMU
HW. All host perf events using x86 PMU are stopped during VM running, and
are restarted at VM-exit. This has the following benefits:
1. Better performance, when guest access x86 PMU MSRs and rdpmc, no VM-exit
and no host perf API call.
2. Guest perf events exclusively own HW resource during guest running. Host
perf events are stopped and give up HW resource at VM-entry, and restart 
runnging after VM-exit.
3. Easier to enable PMU new features. KVM just needs to passthrough new
MSRs and save/restore them at VM-exit and VM-entry, no need to add
perf API.

Note, passthrough vPMU does satisfy the enterprise-level requirement of
secure usage for PMU by intercepting guest access to all event selectors.
But the key problem of passthrough vPMU is that host user loses the
capability to profile guest. If any users want to profile guest from the
host, they should not enable passthrough vPMU mode. Another problem is
the NMI watchdog is not fully functional anymore. Please see design opens
for more details.

Implementation
===
To passthrough host x86 PMU into guest, PMU context switch is mandatory,
this RFC implements this PMU context switch at VM-entry/exit boundary.

At VM-entry:
1. KVM call perf supplied perf_guest_enter() interface, perf stops all
the perf events which use host x86 PMU.
2. KVM call perf supplied perf_guest_switch_to_kvm_pmi_vector() interface,
perf switch PMI vector to a separate kvm_pmi_vector, so that KVM handles
PMI after this point and KVM injects HW PMI into guest.
3. KVM restores guest PMU context.

In order to support KVM PMU filter feature for security, EVENT_SELECT and
FIXED_CTR_CTRL MSRs are intercepted, all other MSRs defined in Architectural
Performance Monitoring spec and rdpmc are passthrough, so guest can access
them without VM exit during guest running, when guest counter overflow
happens, HW PMI is triggered with dedicated kvm_pmi_vector, KVM injects a
virtual PMI into guest through virtual local apic.

At VM-exit:
1. KVM saves and clears guest PMU context.
2. KVM call perf supplied perf_guest_switch_to_host_pmi_vector() interface,
perf switch PMI vector to host NMI, so that host handles PMI after this
point.
3. KVM call perf supplied perf_guest_exit() interface, perf resched all
the perf events, these events stopped at VM-entry will be re-started here.

Design Opens
===
we met some design opens during this POC and seek supporting from
community:

1. host system wide / QEMU events handling during VM running
   At VM-entry, all the host perf events which use host x86 PMU will be
   stopped. These events with attr.exclude_guest = 1 will be stopped here
   and re-started after vm-exit. These events without attr.exclude_guest=1
   will be in error state, and they cannot recovery into active state even
   if the guest stops running. This impacts host perf a lot and request
   host system wide perf events have attr.exclude_guest=1.

   This requests QEMU Process's perf event with attr.exclude_guest=1 also.

   During VM running, perf event creation for system wide and QEMU
   process without attr.exclude_guest=1 fail with -EBUSY. 

2. NMI watchdog
   the perf event for NMI watchdog is a system wide cpu pinned event, it
   will be stopped also during vm running, but it doesn't have
   attr.exclude_guest=1, we add it in this RFC. But this still means NMI
   watchdog loses function during VM running.

   Two candidates exist for replacing perf event of NMI watchdog:
   a. Buddy hardlock detector[3] may be not reliable to replace perf event.
   b. HPET-based hardlock detector [4] isn't in the upstream kernel.

3. Dedicated kvm_pmi_vector
   In emulated vPMU, host PMI handler notify KVM to inject a virtual
   PMI into guest when physical PMI belongs to guest counter. If the
   same mechanism is used in passthrough vPMU and PMI skid exists
   which cause physical PMI belonging to guest happens after VM-exit,
   then the host PMI handler couldn't identify this PMI belongs to
   host or guest.
   So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
   has this vector only. The PMI belonging to host still has an NMI
   vector.

   Without considering PMI skid especially for AMD, the host NMI vector
   could be used for guest PMI also, this method is simpler and doesn't
   need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
   didn't meet the skid PMI issue on modern Intel processors.

4. per-VM passthrough mode configuration
   Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
   it decides vPMU is passthrough mode or emulated mode at kvm module
   load time.
   Do we need the capability of per-VM passthrough mode configuration?
   So an admin can launch some non-passthrough VM and profile these
   non-passthrough VMs in host, but admin still cannot profile all
   the VMs once passthrough VM existence. This means passthrough vPMU
   and emulated vPMU mix on one platform, it has challenges to implement.
   As the commit message in commit 0011, the main challenge is 
   passthrough vPMU and emulated vPMU have different vPMU features, this
   ends up with two different values for kvm_cap.supported_perf_cap, which
   is initialized at module load time. To support it, more refactor is
   needed.

Commits construction
===
0000 ~ 0003: Perf extends exclude_guest to stop perf events during
             guest running.
0004 ~ 0009: Perf interface for dedicated kvm_pmi_vector.
0010 ~ 0032: all passthrough vPMU with PMU context switch at
             VM-entry/exit boundary.
0033 ~ 0037: Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs for
             KVM PMU filter feature.
0038 ~ 0039: Add emulated instructions to guest counter.
0040 ~ 0041: Fixes for passthrough vPMU live migration and Nested VM.

Performance Data
===
Measure method:
First step: guest run workload without perf, and get basic workload score.
Second step: guest run workload with perf commands, and get perf workload
             score.
Third step: perf overhead to workload is gotten from (first-second)/first.
Finally: compare perf overhead between emulated vPMU and passthrough vPMU.

Workload: Specint-2017
HW platform: Sapphire rapids, 1 socket, 56 cores, no-SMT
Perf command:
a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

Guest performance overhead:
---------------------------------------------------------------------------
| Test case          | emulated vPMU | all passthrough | passthrough with |
|                    |               |                 | event filters    |
---------------------------------------------------------------------------
| basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
---------------------------------------------------------------------------
| multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
---------------------------------------------------------------------------
Note: here "passthrough with event filters" means KVM intercepts EVENT_SELECT
and FIXED_CTR_CTRL MSRs to support KVM PMU filter feature for security, this
is current RFC implementation. In order to collect EVENT_SELECT interception
impact, we modified RFC source to passthrough all the MSRs into guest, this
is "all passthrough" in above table.

Conclusion:
1. passthrough vPMU has much better performance than emulated vPMU.
2. Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs cause 2% overhead.
3. As PMU context switch happens at VM-exit/entry, the more VM-exit,
the more vPMU overhead. This does not only impacts perf, but it also
impacts other benchmarks which have massive VM-exit like fio. We will
optimize this at the second phase of passthrough vPMU.

Remain Works
===
1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
2. Add more PMU features like LBR, PEBS, perf metrics.
3. vPMU live migration.

Reference
===
1. https://lore.kernel.org/lkml/2db2ebbe-e552-b974-fc77-870d958465ba@gmail.com/
2. https://lkml.kernel.org/kvm/ZRRl6y1GL-7RM63x@google.com/
3. https://lwn.net/Articles/932497/
4. https://lwn.net/Articles/924927/

Dapeng Mi (4):
  x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
  KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  KVM: x86/pmu: Clear PERF_METRICS MSR for guest

Kan Liang (2):
  perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  perf: Support guest enter/exit interfaces

Mingwei Zhang (22):
  perf: core/x86: Forbid PMI handler when guest own PMU
  perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
    x86_pmu_cap
  KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
    propage to KVM instance
  KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
  KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  KVM: x86/pmu: Allow RDPMC pass through
  KVM: x86/pmu: Create a function prototype to disable MSR interception
  KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
    interception
  KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
    perf capabilities
  KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
  KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
    context
  KVM: x86/pmu: Introduce function prototype for Intel CPU to
    save/restore PMU context
  KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
    information leakage
  KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
    capability
  KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  KVM: x86/pmu: Allow writing to event selector for GP counters if event
    is allowed
  KVM: x86/pmu: Allow writing to fixed counter selector if counter is
    exposed
  KVM: x86/pmu: Introduce PMU helper to increment counter
  KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
    non-passthrough vPMU
  KVM: nVMX: Add nested virtualization support for passthrough PMU

Xiong Zhang (13):
  perf: Set exclude_guest onto nmi_watchdog
  perf: core/x86: Add support to register a new vector for PMI handling
  KVM: x86/pmu: Register PMI handler for passthrough PMU
  perf: x86: Add function to switch PMI handler
  perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
  KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
  KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  KVM: x86/pmu: Intercept EVENT_SELECT MSR
  KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR

 arch/x86/events/core.c                   |  38 +++++
 arch/x86/events/intel/core.c             |   8 +
 arch/x86/events/perf_event.h             |   1 +
 arch/x86/include/asm/hardirq.h           |   1 +
 arch/x86/include/asm/idtentry.h          |   1 +
 arch/x86/include/asm/irq.h               |   1 +
 arch/x86/include/asm/irq_vectors.h       |   2 +-
 arch/x86/include/asm/kvm-x86-pmu-ops.h   |   3 +
 arch/x86/include/asm/kvm_host.h          |   8 +
 arch/x86/include/asm/msr-index.h         |   1 +
 arch/x86/include/asm/perf_event.h        |   4 +
 arch/x86/include/asm/vmx.h               |   1 +
 arch/x86/kernel/idt.c                    |   1 +
 arch/x86/kernel/irq.c                    |  29 ++++
 arch/x86/kvm/cpuid.c                     |   4 +
 arch/x86/kvm/lapic.h                     |   5 +
 arch/x86/kvm/pmu.c                       | 102 ++++++++++++-
 arch/x86/kvm/pmu.h                       |  37 ++++-
 arch/x86/kvm/vmx/capabilities.h          |   1 +
 arch/x86/kvm/vmx/nested.c                |  52 +++++++
 arch/x86/kvm/vmx/pmu_intel.c             | 186 +++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.c                   | 176 +++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h                   |   3 +-
 arch/x86/kvm/x86.c                       |  37 ++++-
 arch/x86/kvm/x86.h                       |   2 +
 include/linux/perf_event.h               |  11 ++
 kernel/events/core.c                     | 179 ++++++++++++++++++++++
 kernel/watchdog_perf.c                   |   1 +
 tools/arch/x86/include/asm/irq_vectors.h |   1 +
 29 files changed, 852 insertions(+), 44 deletions(-)


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.34.1


^ permalink raw reply	[flat|nested] 181+ messages in thread

* [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 17:04   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 02/41] perf: Support guest enter/exit interfaces Xiong Zhang
                   ` (41 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
and later PMUs, which includes the improvements for virtualization.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/intel/core.c | 6 ++++++
 include/linux/perf_event.h   | 1 +
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a08f794a0e79..cf790c37757a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4662,6 +4662,9 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
 	else
 		pmu->pmu.capabilities |= ~PERF_PMU_CAP_AUX_OUTPUT;
 
+	if (x86_pmu.version >= 4)
+		pmu->pmu.capabilities |= PERF_PMU_CAP_VPMU_PASSTHROUGH;
+
 	intel_pmu_check_event_constraints(pmu->event_constraints,
 					  pmu->num_counters,
 					  pmu->num_counters_fixed,
@@ -6137,6 +6140,9 @@ __init int intel_pmu_init(void)
 			pr_cont(" AnyThread deprecated, ");
 	}
 
+	if (version >= 4)
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_VPMU_PASSTHROUGH;
+
 	/*
 	 * Install the hw-cache-events table:
 	 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index afb028c54f33..60eff413dbba 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -291,6 +291,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
 #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
+#define PERF_PMU_CAP_VPMU_PASSTHROUGH		0x0200
 
 struct perf_output_handle;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-03-20 16:40   ` Raghavendra Rao Ananta
  2024-04-11 18:06   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 03/41] perf: Set exclude_guest onto nmi_watchdog Xiong Zhang
                   ` (40 subsequent siblings)
  42 siblings, 2 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Currently, the guest and host share the PMU resources when a guest is
running. KVM has to create an extra virtual event to simulate the
guest's event, which brings several issues, e.g., high overhead, not
accuracy and etc.

A new pass-through method is proposed to address the issue. It requires
that the PMU resources can be fully occupied by the guest while it's
running. Two new interfaces are implemented to fulfill the requirement.
The hypervisor should invoke the interface while entering/exiting a
guest which wants the pass-through PMU capability.

The PMU resources should only be temporarily occupied when a guest is
running. When the guest is out, the PMU resources are still shared among
different users.

The exclude_guest event modifier is used to guarantee the exclusive
occupation of the PMU resources. When a guest enters, perf forces the
exclude_guest capability. If the pre-existing events with
!exclude_guest, the events are moved to the error state. The new
event-creation of the !exclude_guest event will error out during the
period. So the PMU resources can be safely accessed by the guest
directly.
https://lore.kernel.org/lkml/20231002204017.GB27267@noisy.programming.kicks-ass.net/

Not all PMUs support exclude_guest and vPMU pass-through, e.g., uncore
PMU and SW PMU. The guest enter/exit interfaces should only impact the
supported PMUs. Add a new PERF_PMU_CAP_VPMU_PASSTHROUGH flag to indicate
the PMUs that support the feature.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |   9 ++
 kernel/events/core.c       | 174 +++++++++++++++++++++++++++++++++++++
 2 files changed, 183 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 60eff413dbba..9912d1112371 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1392,6 +1392,11 @@ static inline int is_exclusive_pmu(struct pmu *pmu)
 	return pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE;
 }
 
+static inline int has_vpmu_passthrough_cap(struct pmu *pmu)
+{
+	return pmu->capabilities & PERF_PMU_CAP_VPMU_PASSTHROUGH;
+}
+
 extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
 
 extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64);
@@ -1709,6 +1714,8 @@ extern void perf_event_task_tick(void);
 extern int perf_event_account_interrupt(struct perf_event *event);
 extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
+extern void perf_guest_enter(void);
+extern void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1795,6 +1802,8 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset)
 {
 	return 0;
 }
+static inline void perf_guest_enter(void)				{ }
+static inline void perf_guest_exit(void)				{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 683dc086ef10..59471eeec7e4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3803,6 +3803,8 @@ static inline void group_update_userpage(struct perf_event *group_event)
 		event_update_userpage(event);
 }
 
+static DEFINE_PER_CPU(bool, __perf_force_exclude_guest);
+
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -3814,6 +3816,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	if (!event_filter_match(event))
 		return 0;
 
+	/*
+	 * The __perf_force_exclude_guest indicates entering the guest.
+	 * No events of the passthrough PMU should be scheduled.
+	 */
+	if (__this_cpu_read(__perf_force_exclude_guest) &&
+	    has_vpmu_passthrough_cap(event->pmu))
+		return 0;
+
 	if (group_can_go_on(event, *can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
@@ -5707,6 +5717,165 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
 }
 EXPORT_SYMBOL_GPL(perf_event_pause);
 
+static void __perf_force_exclude_guest_pmu(struct perf_event_pmu_context *pmu_ctx,
+					   struct perf_event *event)
+{
+	struct perf_event_context *ctx = pmu_ctx->ctx;
+	struct perf_event *sibling;
+	bool include_guest = false;
+
+	event_sched_out(event, ctx);
+	if (!event->attr.exclude_guest)
+		include_guest = true;
+	for_each_sibling_event(sibling, event) {
+		event_sched_out(sibling, ctx);
+		if (!sibling->attr.exclude_guest)
+			include_guest = true;
+	}
+	if (include_guest) {
+		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+		for_each_sibling_event(sibling, event)
+			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+	}
+}
+
+static void perf_force_exclude_guest_pmu(struct perf_event_pmu_context *pmu_ctx)
+{
+	struct perf_event *event, *tmp;
+	struct pmu *pmu = pmu_ctx->pmu;
+
+	perf_pmu_disable(pmu);
+
+	/*
+	 * Sched out all active events.
+	 * For the !exclude_guest events, they are forced to be sched out and
+	 * moved to the error state.
+	 * For the exclude_guest events, they should be scheduled out anyway
+	 * when the guest is running.
+	 */
+	list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list)
+		__perf_force_exclude_guest_pmu(pmu_ctx, event);
+
+	list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list)
+		__perf_force_exclude_guest_pmu(pmu_ctx, event);
+
+	pmu_ctx->rotate_necessary = 0;
+
+	perf_pmu_enable(pmu);
+}
+
+static void perf_force_exclude_guest_enter(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	update_context_time(ctx);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		/*
+		 * The PMU, which doesn't have the capability of excluding guest
+		 * e.g., uncore PMU, is not impacted.
+		 */
+		if (!has_vpmu_passthrough_cap(pmu_ctx->pmu))
+			continue;
+		perf_force_exclude_guest_pmu(pmu_ctx);
+	}
+}
+
+/*
+ * When a guest enters, force all active events of the PMU, which supports
+ * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
+ * PMUs, such as uncore PMU, should not be impacted. The guest can
+ * temporarily own all counters of the PMU.
+ * During the period, all the creation of the new event of the PMU with
+ * !exclude_guest are error out.
+ */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	if (__this_cpu_read(__perf_force_exclude_guest))
+		return;
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	perf_force_exclude_guest_enter(&cpuctx->ctx);
+	if (cpuctx->task_ctx)
+		perf_force_exclude_guest_enter(cpuctx->task_ctx);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+
+	__this_cpu_write(__perf_force_exclude_guest, true);
+}
+EXPORT_SYMBOL_GPL(perf_guest_enter);
+
+static void perf_force_exclude_guest_exit(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+	struct pmu *pmu;
+
+	update_context_time(ctx);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		pmu = pmu_ctx->pmu;
+		if (!has_vpmu_passthrough_cap(pmu))
+			continue;
+
+		perf_pmu_disable(pmu);
+		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
+		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
+		perf_pmu_enable(pmu);
+	}
+}
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	if (!__this_cpu_read(__perf_force_exclude_guest))
+		return;
+
+	__this_cpu_write(__perf_force_exclude_guest, false);
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	perf_force_exclude_guest_exit(&cpuctx->ctx);
+	if (cpuctx->task_ctx)
+		perf_force_exclude_guest_exit(cpuctx->task_ctx);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+EXPORT_SYMBOL_GPL(perf_guest_exit);
+
+static inline int perf_force_exclude_guest_check(struct perf_event *event,
+						 int cpu, struct task_struct *task)
+{
+	bool *force_exclude_guest = NULL;
+
+	if (!has_vpmu_passthrough_cap(event->pmu))
+		return 0;
+
+	if (event->attr.exclude_guest)
+		return 0;
+
+	if (cpu != -1) {
+		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
+	} else if (task && (task->flags & PF_VCPU)) {
+		/*
+		 * Just need to check the running CPU in the event creation. If the
+		 * task is moved to another CPU which supports the force_exclude_guest.
+		 * The event will filtered out and be moved to the error stage. See
+		 * merge_sched_in().
+		 */
+		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
+	}
+
+	if (force_exclude_guest && *force_exclude_guest)
+		return -EBUSY;
+	return 0;
+}
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
@@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		goto err_ns;
 	}
 
+	if (perf_force_exclude_guest_check(event, cpu, task)) {
+		err = -EBUSY;
+		goto err_pmu;
+	}
+
 	/*
 	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
 	 * events (they don't make sense as the cgroup will be different
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 03/41] perf: Set exclude_guest onto nmi_watchdog
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 02/41] perf: Support guest enter/exit interfaces Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 18:56   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling Xiong Zhang
                   ` (39 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

The perf event for NMI watchdog is per cpu pinned system wide event,
if such event doesn't have exclude_guest flag, it will be put into
error state once guest with passthrough PMU starts, this breaks
NMI watchdog function totally.

This commit adds exclude_guest flag for this perf event, so this perf
event is stopped during VM running, but it will continue working after
VM exit. In this way the NMI watchdog can not detect hardlockups during
VM running, it still breaks NMI watchdog function a bit. But host perf
event must be stopped during VM with passthrough PMU running, current
no other reliable method can be used to replace perf event for NMI
watchdog.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 kernel/watchdog_perf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
index 8ea00c4a24b2..c8ba656ff674 100644
--- a/kernel/watchdog_perf.c
+++ b/kernel/watchdog_perf.c
@@ -88,6 +88,7 @@ static struct perf_event_attr wd_hw_attr = {
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
 	.disabled	= 1,
+	.exclude_guest  = 1,
 };
 
 /* Callback function for perf event subsystem */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (2 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 03/41] perf: Set exclude_guest onto nmi_watchdog Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 17:10   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU Xiong Zhang
                   ` (38 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Create a new vector in the host IDT for PMI handling within a passthrough
vPMU implementation. In addition, add a function to allow the registration
of the handler and a function to switch the PMI handler.

This is the preparation work to support KVM passthrough vPMU to handle its
own PMIs without interference from PMI handler of the host PMU.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/hardirq.h           |  1 +
 arch/x86/include/asm/idtentry.h          |  1 +
 arch/x86/include/asm/irq.h               |  1 +
 arch/x86/include/asm/irq_vectors.h       |  2 +-
 arch/x86/kernel/idt.c                    |  1 +
 arch/x86/kernel/irq.c                    | 29 ++++++++++++++++++++++++
 tools/arch/x86/include/asm/irq_vectors.h |  1 +
 7 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 66837b8c67f1..c1e2c1a480bf 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -19,6 +19,7 @@ typedef struct {
 	unsigned int kvm_posted_intr_ipis;
 	unsigned int kvm_posted_intr_wakeup_ipis;
 	unsigned int kvm_posted_intr_nested_ipis;
+	unsigned int kvm_vpmu_pmis;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 05fd175cec7d..d1b58366bc21 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -675,6 +675,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
+DECLARE_IDTENTRY_SYSVEC(KVM_VPMU_VECTOR,	        sysvec_kvm_vpmu_handler);
 #endif
 
 #if IS_ENABLED(CONFIG_HYPERV)
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 836c170d3087..ee268f42d04a 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -31,6 +31,7 @@ extern void fixup_irqs(void);
 
 #ifdef CONFIG_HAVE_KVM
 extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
+extern void kvm_set_vpmu_handler(void (*handler)(void));
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 3a19904c2db6..120403572307 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -77,7 +77,7 @@
  */
 #define IRQ_WORK_VECTOR			0xf6
 
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
+#define KVM_VPMU_VECTOR			0xf5
 #define DEFERRED_ERROR_VECTOR		0xf4
 
 /* Vector on which hypervisor callbacks will be delivered */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 8857abc706e4..6944eec251f4 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
 	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
 	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
+	INTG(KVM_VPMU_VECTOR,		        asm_sysvec_kvm_vpmu_handler),
 # endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 11761c124545..c6cffb34191b 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+
+	seq_printf(p, "%*s: ", prec, "VPMU");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->kvm_vpmu_pmis);
+	seq_puts(p, " PT PMU PMI\n");
+
 #endif
 	return 0;
 }
@@ -293,6 +300,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 #ifdef CONFIG_HAVE_KVM
 static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
+static void (*kvm_vpmu_handler)(void) = dummy_handler;
 
 void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
 {
@@ -305,6 +313,17 @@ void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
 }
 EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
 
+void kvm_set_vpmu_handler(void (*handler)(void))
+{
+	if (handler)
+		kvm_vpmu_handler = handler;
+	else {
+		kvm_vpmu_handler = dummy_handler;
+		synchronize_rcu();
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_set_vpmu_handler);
+
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
  */
@@ -332,6 +351,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 	apic_eoi();
 	inc_irq_stat(kvm_posted_intr_nested_ipis);
 }
+
+/*
+ * Handler for KVM_PT_PMU_VECTOR.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_vpmu_handler)
+{
+	apic_eoi();
+	inc_irq_stat(kvm_vpmu_pmis);
+	kvm_vpmu_handler();
+}
 #endif
 
 
diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
index 3a19904c2db6..3773e60f1af8 100644
--- a/tools/arch/x86/include/asm/irq_vectors.h
+++ b/tools/arch/x86/include/asm/irq_vectors.h
@@ -85,6 +85,7 @@
 
 /* Vector for KVM to deliver posted interrupt IPI */
 #ifdef CONFIG_HAVE_KVM
+#define KVM_VPMU_VECTOR			0xf5
 #define POSTED_INTR_VECTOR		0xf2
 #define POSTED_INTR_WAKEUP_VECTOR	0xf1
 #define POSTED_INTR_NESTED_VECTOR	0xf0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (3 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 19:07   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler Xiong Zhang
                   ` (37 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Add function to register/unregister PMI handler at KVM module
initialization and destroy time. This allows the host PMU with passthough
capability enabled switch PMI handler at PMU context switch time.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f1..4432e736129f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10611,6 +10611,18 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
 
+void kvm_passthrough_pmu_handler(void)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	if (!vcpu) {
+		pr_warn_once("%s: no running vcpu found!\n", __func__);
+		return;
+	}
+
+	kvm_make_request(KVM_REQ_PMI, vcpu);
+}
+
 /*
  * Called within kvm->srcu read side.
  * Returns 1 to let vcpu_run() continue the guest execution loop without
@@ -13815,6 +13827,7 @@ static int __init kvm_x86_init(void)
 {
 	kvm_mmu_x86_module_init();
 	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
+	kvm_set_vpmu_handler(kvm_passthrough_pmu_handler);
 	return 0;
 }
 module_init(kvm_x86_init);
@@ -13825,5 +13838,6 @@ static void __exit kvm_x86_exit(void)
 	 * If module_init() is implemented, module_exit() must also be
 	 * implemented to allow module unload.
 	 */
+	kvm_set_vpmu_handler(NULL);
 }
 module_exit(kvm_x86_exit);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (4 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 19:17   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW Xiong Zhang
                   ` (36 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Add function to switch PMI handler since passthrough PMU and host PMU will
use different interrupt vectors.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c            | 15 +++++++++++++++
 arch/x86/include/asm/perf_event.h |  3 +++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 40ad1425ffa2..3f87894d8c8e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -701,6 +701,21 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
 }
 EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
 
+void perf_guest_switch_to_host_pmi_vector(void)
+{
+	lockdep_assert_irqs_disabled();
+
+	apic_write(APIC_LVTPC, APIC_DM_NMI);
+}
+EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
+
+void perf_guest_switch_to_kvm_pmi_vector(void)
+{
+	lockdep_assert_irqs_disabled();
+
+	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
+}
+EXPORT_SYMBOL_GPL(perf_guest_switch_to_kvm_pmi_vector);
 /*
  * There may be PMI landing after enabled=0. The PMI hitting could be before or
  * after disable_all.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 2618ec7c3d1d..021ab362a061 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -573,6 +573,9 @@ static inline void perf_events_lapic_init(void)	{ }
 static inline void perf_check_microcode(void) { }
 #endif
 
+extern void perf_guest_switch_to_host_pmi_vector(void);
+extern void perf_guest_switch_to_kvm_pmi_vector(void);
+
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
 extern void x86_perf_get_lbr(struct x86_pmu_lbr *lbr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (5 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 19:21   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 08/41] KVM: x86/pmu: Add get virtual LVTPC_MASK bit function Xiong Zhang
                   ` (35 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

When guest clear LVTPC_MASK bit in guest PMI handler at PMU passthrough
mode, this bit should be reflected onto HW, otherwise HW couldn't generate
PMI again during VM running until it is cleared.

This commit set HW LVTPC_MASK bit at PMU vecctor switching to KVM PMI
vector.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c            | 9 +++++++--
 arch/x86/include/asm/perf_event.h | 2 +-
 arch/x86/kvm/lapic.h              | 1 -
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3f87894d8c8e..ece042cfb470 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -709,13 +709,18 @@ void perf_guest_switch_to_host_pmi_vector(void)
 }
 EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
 
-void perf_guest_switch_to_kvm_pmi_vector(void)
+void perf_guest_switch_to_kvm_pmi_vector(bool mask)
 {
 	lockdep_assert_irqs_disabled();
 
-	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
+	if (mask)
+		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR |
+			   APIC_LVT_MASKED);
+	else
+		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
 }
 EXPORT_SYMBOL_GPL(perf_guest_switch_to_kvm_pmi_vector);
+
 /*
  * There may be PMI landing after enabled=0. The PMI hitting could be before or
  * after disable_all.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 021ab362a061..180d63ba2f46 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -574,7 +574,7 @@ static inline void perf_check_microcode(void) { }
 #endif
 
 extern void perf_guest_switch_to_host_pmi_vector(void);
-extern void perf_guest_switch_to_kvm_pmi_vector(void);
+extern void perf_guest_switch_to_kvm_pmi_vector(bool mask);
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 0a0ea4b5dd8c..e30641d5ac90 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -277,5 +277,4 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
 {
 	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
 }
-
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 08/41] KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (6 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 19:22   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 09/41] perf: core/x86: Forbid PMI handler when guest own PMU Xiong Zhang
                   ` (34 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

On PMU passthrough mode, guest virtual LVTPC_MASK bit must be reflected
onto HW, especially when guest clear it, the HW bit should be cleared also.
Otherwise processor can't generate PMI until the HW mask bit is cleared.

This commit add a function to get virtual LVTPC_MASK bit, so that
it can be set onto HW later.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
---
 arch/x86/kvm/lapic.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e30641d5ac90..dafae44325d1 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -277,4 +277,10 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
 {
 	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
 }
+
+static inline bool kvm_lapic_get_lvtpc_mask(struct kvm_vcpu *vcpu)
+{
+	return lapic_in_kernel(vcpu) &&
+	       (kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC) & APIC_LVT_MASKED);
+}
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 09/41] perf: core/x86: Forbid PMI handler when guest own PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (7 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 08/41] KVM: x86/pmu: Add get virtual LVTPC_MASK bit function Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 19:26   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 10/41] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Xiong Zhang
                   ` (33 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
be held pending until EFLAGS.IF is set. In the meantime, if the logical
processor receives an NMI for any reason at all, perf_event_nmi_handler()
will be invoked. If there is any active perf event anywhere on the system,
x86_pmu_handle_irq() will be invoked, and it will clear
IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
be a mystery which counter(s) overflowed.

When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
environment.

So modify perf_event_nmi_handler() to check perf_is_in_guest_pasthrough(),
and if so, to simply return without calling x86_pmu_handle_irq().

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c     | 17 +++++++++++++++++
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       |  5 +++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ece042cfb470..20a5ccc641b9 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1752,6 +1752,23 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	u64 finish_clock;
 	int ret;
 
+	/*
+	 * When PMU is pass-through into guest, this handler should be forbidden from
+	 * running, the reasons are:
+	 * 1. After perf_guest_switch_to_kvm_pmi_vector() is called, and before cpu
+	 *    enter into non-root mode, NMI could happen, but x86_pmu_handle_irq()
+	 *    restore PMU to use NMI vector, which destroy KVM PMI vector setting.
+	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
+	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
+	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
+	 *    clear global_status MSR which has guest status now, then this destroy
+	 *    guest PMU status.
+	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
+	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
+	 *    which has guest status now, then this destroy guest PMU status.
+	 */
+	if (perf_is_in_guest_passthrough())
+		return 0;
 	/*
 	 * All PMUs/events that share this PMI handler should make sure to
 	 * increment active_events for their events.
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9912d1112371..6cfa0f5ac120 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1716,6 +1716,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 extern void perf_guest_enter(void);
 extern void perf_guest_exit(void);
+extern bool perf_is_in_guest_passthrough(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 59471eeec7e4..00ea2705444e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5848,6 +5848,11 @@ void perf_guest_exit(void)
 }
 EXPORT_SYMBOL_GPL(perf_guest_exit);
 
+bool perf_is_in_guest_passthrough(void)
+{
+	return __this_cpu_read(__perf_force_exclude_guest);
+}
+
 static inline int perf_force_exclude_guest_check(struct perf_event *event,
 						 int cpu, struct task_struct *task)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 10/41] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (8 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 09/41] perf: core/x86: Forbid PMI handler when guest own PMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance Xiong Zhang
                   ` (32 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Plumb passthrough PMU capability to x86_pmu_cap in order to let any kernel
entity such as KVM know that host PMU support passthrough PMU mode and has
the implementation.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c            | 1 +
 arch/x86/events/intel/core.c      | 4 +++-
 arch/x86/events/perf_event.h      | 1 +
 arch/x86/include/asm/perf_event.h | 1 +
 4 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 20a5ccc641b9..d2b7aa5b7876 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3026,6 +3026,7 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->events_mask	= (unsigned int)x86_pmu.events_maskl;
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 	cap->pebs_ept		= x86_pmu.pebs_ept;
+	cap->passthrough	= !!(x86_pmu.flags & PMU_FL_PASSTHROUGH);
 }
 EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index cf790c37757a..727ee64bb566 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6140,8 +6140,10 @@ __init int intel_pmu_init(void)
 			pr_cont(" AnyThread deprecated, ");
 	}
 
-	if (version >= 4)
+	if (version >= 4) {
+		x86_pmu.flags |= PMU_FL_PASSTHROUGH;
 		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_VPMU_PASSTHROUGH;
+	}
 
 	/*
 	 * Install the hw-cache-events table:
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53dd5d495ba6..39c58a3f5a6b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1012,6 +1012,7 @@ do {									\
 #define PMU_FL_INSTR_LATENCY	0x80 /* Support Instruction Latency in PEBS Memory Info Record */
 #define PMU_FL_MEM_LOADS_AUX	0x100 /* Require an auxiliary event for the complete memory info */
 #define PMU_FL_RETIRE_LATENCY	0x200 /* Support Retire Latency in PEBS */
+#define PMU_FL_PASSTHROUGH	0x400 /* Support passthrough mode */
 
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 180d63ba2f46..400727b27634 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -254,6 +254,7 @@ struct x86_pmu_capability {
 	unsigned int	events_mask;
 	int		events_mask_len;
 	unsigned int	pebs_ept	:1;
+	unsigned int	passthrough	:1;
 };
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (9 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 10/41] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 20:54   ` Sean Christopherson
  2024-04-11 21:03   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 12/41] KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs Xiong Zhang
                   ` (31 subsequent siblings)
  42 siblings, 2 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Mingwei Zhang <mizhang@google.com>

Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
variable is true only when the following conditions satisfies:
 - set to true when module loaded.
 - enable_pmu is true.
 - is running on Intel CPU.
 - supports PerfMon v4.
 - host PMU supports passthrough mode.

The value is always read-only because passthrough PMU currently does not
support features like LBR and PEBS, while emualted PMU does. This will end
up with two different values for kvm_cap.supported_perf_cap, which is
initialized at module load time. Maintaining two different perf
capabilities will add complexity. Further, there is not enough motivation
to support running two types of PMU implementations at the same time,
although it is possible/feasible in reality.

Finally, always propagate enable_passthrough_pmu and perf_capabilities into
kvm->arch for each KVM instance.

Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h              | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  5 +++--
 arch/x86/kvm/x86.c              |  9 +++++++++
 arch/x86/kvm/x86.h              |  1 +
 5 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d7036982332e..f2e73e6830a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1371,6 +1371,7 @@ struct kvm_arch {
 
 	bool bus_lock_detection_enabled;
 	bool enable_pmu;
+	bool enable_passthrough_pmu;
 
 	u32 notify_window;
 	u32 notify_vmexit_flags;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 1d64113de488..51011603c799 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 			enable_pmu = false;
 	}
 
+	/* Pass-through vPMU is only supported in Intel CPUs. */
+	if (!is_intel)
+		enable_passthrough_pmu = false;
+
+	/*
+	 * Pass-through vPMU requires at least PerfMon version 4 because the
+	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
+	 * for counter emulation as well as PMU context switch.  In addition, it
+	 * requires host PMU support on passthrough mode. Disable pass-through
+	 * vPMU if any condition fails.
+	 */
+	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
+		enable_passthrough_pmu = false;
+
 	if (!enable_pmu) {
 		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
 		return;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1..e4610b80e519 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7835,13 +7835,14 @@ static u64 vmx_get_perf_capabilities(void)
 	if (boot_cpu_has(X86_FEATURE_PDCM))
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
 
-	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
+	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
+	    !enable_passthrough_pmu) {
 		x86_perf_get_lbr(&lbr);
 		if (lbr.nr)
 			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
 	}
 
-	if (vmx_pebs_supported()) {
+	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
 		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
 		if ((perf_cap & PERF_CAP_PEBS_FORMAT) < 4)
 			perf_cap &= ~PERF_CAP_PEBS_BASELINE;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4432e736129f..074452aa700d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -193,6 +193,11 @@ bool __read_mostly enable_pmu = true;
 EXPORT_SYMBOL_GPL(enable_pmu);
 module_param(enable_pmu, bool, 0444);
 
+/* Enable/disable PMU virtualization */
+bool __read_mostly enable_passthrough_pmu = true;
+EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
+module_param(enable_passthrough_pmu, bool, 0444);
+
 bool __read_mostly eager_page_split = true;
 module_param(eager_page_split, bool, 0644);
 
@@ -6553,6 +6558,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		mutex_lock(&kvm->lock);
 		if (!kvm->created_vcpus) {
 			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
+			/* Disable passthrough PMU if enable_pmu is false. */
+			if (!kvm->arch.enable_pmu)
+				kvm->arch.enable_passthrough_pmu = false;
 			r = 0;
 		}
 		mutex_unlock(&kvm->lock);
@@ -12480,6 +12488,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
 	kvm->arch.guest_can_read_msr_platform_info = true;
 	kvm->arch.enable_pmu = enable_pmu;
+	kvm->arch.enable_passthrough_pmu = enable_passthrough_pmu;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 5184fde1dc54..38b73e98eae9 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -329,6 +329,7 @@ extern u64 host_arch_capabilities;
 extern struct kvm_caps kvm_caps;
 
 extern bool enable_pmu;
+extern bool enable_passthrough_pmu;
 
 /*
  * Get a filtered version of KVM's supported XCR0 that strips out dynamic
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 12/41] KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (10 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 20:57   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 13/41] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Xiong Zhang
                   ` (30 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Mingwei Zhang <mizhang@google.com>

Plumb through passthrough PMU setting from kvm->arch into kvm_pmu on each
vcpu created. Note that enabling PMU is decided by VMM when it sets the
CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
in intel_pmu_refresh().

Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/pmu.c              |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 10 ++++++++--
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f2e73e6830a3..ede45c923089 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -575,6 +575,8 @@ struct kvm_pmu {
 	 * redundant check before cleanup if guest don't use vPMU at all.
 	 */
 	u8 event_count;
+
+	bool passthrough;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 9ae07db6f0f6..1853739a59bf 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -665,6 +665,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
 	static_call(kvm_x86_pmu_init)(vcpu);
 	pmu->event_count = 0;
 	pmu->need_cleanup = false;
+	pmu->passthrough = false;
 	kvm_pmu_refresh(vcpu);
 }
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 820d3e1f6b4f..15cc107ed573 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -517,14 +517,20 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		return;
 
 	entry = kvm_find_cpuid_entry(vcpu, 0xa);
-	if (!entry || !vcpu->kvm->arch.enable_pmu)
+	if (!entry || !vcpu->kvm->arch.enable_pmu) {
+		pmu->passthrough = false;
 		return;
+	}
 	eax.full = entry->eax;
 	edx.full = entry->edx;
 
 	pmu->version = eax.split.version_id;
-	if (!pmu->version)
+	if (!pmu->version) {
+		pmu->passthrough = false;
 		return;
+	}
+
+	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu;
 
 	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
 					 kvm_pmu_cap.num_counters_gp);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 13/41] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (11 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 12/41] KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 14/41] KVM: x86/pmu: Allow RDPMC pass through Xiong Zhang
                   ` (29 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Add a helper to check if passthrough PMU is enabled for convenience as it
is vendor neutral.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 51011603c799..28beae0f9209 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -267,6 +267,11 @@ static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc)
 	return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
 }
 
+static inline bool is_passthrough_pmu_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu_to_pmu(vcpu)->passthrough;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 14/41] KVM: x86/pmu: Allow RDPMC pass through
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (12 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 13/41] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Xiong Zhang
                   ` (28 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Mingwei Zhang <mizhang@google.com>

Clear RDPMC_EXITING in vmcs cpu based execution control to allow rdpmc
instruction to proceed without VMEXIT. This gives performance to
passthrough PMU. Clear RDPMC in vmx_vcpu_after_set_cpuid() when guest
enables PMU and passthrough PMU is allowed.

The passthrough RDPMC allows guest to read several PMU MSRs including
unexposed counters like fixed counter 3 as well as IA32_PERF_METRICS.

To cope with this issue, these MSRs will be cleared in later commits when
context switching to VM guest.

Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e4610b80e519..33cb69ff0804 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7819,6 +7819,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		vmx->msr_ia32_feature_control_valid_bits &=
 			~FEAT_CTL_SGX_LC_ENABLED;
 
+	if (is_passthrough_pmu_enabled(&vmx->vcpu))
+		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (13 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 14/41] KVM: x86/pmu: Allow RDPMC pass through Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:21   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 16/41] KVM: x86/pmu: Create a function prototype to disable MSR interception Xiong Zhang
                   ` (27 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

In PMU passthrough mode, there are three requirements to manage
IA32_PERF_GLOBAL_CTRL:
 - guest IA32_PERF_GLOBAL_CTRL MSR must be saved at vm exit.
 - IA32_PERF_GLOBAL_CTRL MSR must be cleared at vm exit to avoid any
   counter of running within KVM runloop.
 - guest IA32_PERF_GLOBAL_CTRL MSR must be restored at vm entry.

Introduce vmx_set_perf_global_ctrl() function to auto switching
IA32_PERF_GLOBAL_CTR and invoke it after the VMM finishes setting up the
CPUID bits.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/vmx.h |  1 +
 arch/x86/kvm/vmx/vmx.c     | 89 ++++++++++++++++++++++++++++++++------
 arch/x86/kvm/vmx/vmx.h     |  3 +-
 3 files changed, 78 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..f574e7b429a3 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -104,6 +104,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL      0x40000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 33cb69ff0804..8ab266e1e2a7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4387,6 +4387,74 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 	return pin_based_exec_ctrl;
 }
 
+static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
+{
+	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
+	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
+	int i;
+
+       /*
+	* PERF_GLOBAL_CTRL is toggled dynamically in emulated vPMU.
+	*/
+	if (cpu_has_perf_global_ctrl_bug() ||
+	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
+		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
+	}
+
+	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
+		/*
+		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
+		 */
+		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
+			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
+		else {
+			i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest,
+						       MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = vmx->msr_autoload.guest.nr++;
+				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT,
+					     vmx->msr_autoload.guest.nr);
+			}
+			vmx->msr_autoload.guest.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			vmx->msr_autoload.guest.val[i].value = 0;
+		}
+		/*
+		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
+		 */
+		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
+			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
+		else {
+			i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.host,
+							MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = vmx->msr_autoload.host.nr++;
+				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT,
+					     vmx->msr_autoload.host.nr);
+			}
+			vmx->msr_autoload.host.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			vmx->msr_autoload.host.val[i].value = 0;
+		}
+		/*
+		 * Setup auto save guest PERF_GLOBAL_CTRL msr at vm exit
+		 */
+		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
+			i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest,
+							MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = vmx->msr_autostore.guest.nr++;
+				vmcs_write32(VM_EXIT_MSR_STORE_COUNT,
+					     vmx->msr_autostore.guest.nr);
+			}
+			vmx->msr_autostore.guest.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+		}
+	}
+
+	vm_entry_controls_set(vmx, vmentry_ctrl);
+	vm_exit_controls_set(vmx, vmexit_ctrl);
+}
+
 static u32 vmx_vmentry_ctrl(void)
 {
 	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
@@ -4394,15 +4462,9 @@ static u32 vmx_vmentry_ctrl(void)
 	if (vmx_pt_mode_is_system())
 		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
 				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
-	/*
-	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
-	 */
-	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
-			  VM_ENTRY_LOAD_IA32_EFER |
-			  VM_ENTRY_IA32E_MODE);
 
-	if (cpu_has_perf_global_ctrl_bug())
-		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+	/* IA32e mode, and loading of EFER is toggled dynamically. */
+	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_IA32E_MODE);
 
 	return vmentry_ctrl;
 }
@@ -4422,12 +4484,8 @@ static u32 vmx_vmexit_ctrl(void)
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
 				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
 
-	if (cpu_has_perf_global_ctrl_bug())
-		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
-
-	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
-	return vmexit_ctrl &
-		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+       /* Loading of EFER is toggled dynamically */
+       return vmexit_ctrl & ~VM_EXIT_LOAD_IA32_EFER;
 }
 
 static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -4765,6 +4823,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write64(VM_FUNCTION_CONTROL, 0);
 
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
+	vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autostore.guest.val));
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
@@ -7822,6 +7881,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	if (is_passthrough_pmu_enabled(&vmx->vcpu))
 		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
 
+	vmx_set_perf_global_ctrl(vmx);
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..c89db35e1de8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -502,7 +502,8 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |                            \
+	       VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 16/41] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (14 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 17/41] KVM: x86/pmu: Implement pmu function for Intel CPU " Xiong Zhang
                   ` (26 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
interception.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
 arch/x86/kvm/cpuid.c                   | 4 ++++
 arch/x86/kvm/pmu.c                     | 5 +++++
 arch/x86/kvm/pmu.h                     | 2 ++
 4 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 6c98f4bb4228..a2acf0afee5d 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -25,6 +25,7 @@ KVM_X86_PMU_OP(init)
 KVM_X86_PMU_OP(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
+KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index dda6fc4cfae8..ab9e47ba8b6a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -366,6 +366,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
 	kvm_pmu_refresh(vcpu);
+
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_passthrough_pmu_msrs(vcpu);
+
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 1853739a59bf..d83746f93392 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -893,3 +893,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
 	kfree(filter);
 	return r;
 }
+
+void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 28beae0f9209..d575808c7258 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -33,6 +33,7 @@ struct kvm_pmu_ops {
 	void (*reset)(struct kvm_vcpu *vcpu);
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
+	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -286,6 +287,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id);
+void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 17/41] KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR interception
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (15 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 16/41] KVM: x86/pmu: Create a function prototype to disable MSR interception Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities Xiong Zhang
                   ` (25 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Disable PMU MSRs interception, these MSRs are defined in Architectural
Performance Monitoring from SDM, so that guest can access them without
VM-exit.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 15cc107ed573..7f6cabb2c378 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -794,6 +794,25 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
 	}
 }
 
+void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	int i;
+
+	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_gp_counters; i++) {
+		vmx_set_intercept_for_msr(vcpu, MSR_ARCH_PERFMON_EVENTSEL0 + i, MSR_TYPE_RW, false);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, false);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, false);
+	}
+
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR_CTRL, MSR_TYPE_RW, false);
+	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++)
+		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, false);
+
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_STATUS, MSR_TYPE_RW, false);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, MSR_TYPE_RW, false);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, false);
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.hw_event_available = intel_hw_event_available,
 	.pmc_idx_to_pmc = intel_pmc_idx_to_pmc,
@@ -808,6 +827,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.reset = intel_pmu_reset,
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
+	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (16 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 17/41] KVM: x86/pmu: Implement pmu function for Intel CPU " Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:23   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 19/41] KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU Xiong Zhang
                   ` (24 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Mingwei Zhang <mizhang@google.com>

Intercept full-width GP counter MSRs in passthrough PMU if guest does not
have the capability to write in full-width. In addition, opportunistically
add a warning if non-full-width counter MSRs are also intercepted, in which
case it is a clear mistake.

Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 7f6cabb2c378..49df154fbb5b 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -429,6 +429,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
 		    (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
+			if (is_passthrough_pmu_enabled(vcpu) &&
+			    !(msr & MSR_PMC_FULL_WIDTH_BIT) &&
+			    !msr_info->host_initiated) {
+				pr_warn_once("passthrough PMU never intercepts non-full-width PMU counters\n");
+				return 1;
+			}
+
 			if ((msr & MSR_PMC_FULL_WIDTH_BIT) &&
 			    (data & ~pmu->counter_bitmask[KVM_PMC_GP]))
 				return 1;
@@ -801,7 +808,8 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_gp_counters; i++) {
 		vmx_set_intercept_for_msr(vcpu, MSR_ARCH_PERFMON_EVENTSEL0 + i, MSR_TYPE_RW, false);
 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, false);
-		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, false);
+		if (fw_writes_is_enabled(vcpu))
+			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, false);
 	}
 
 	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR_CTRL, MSR_TYPE_RW, false);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 19/41] KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (17 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 20/41] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Xiong Zhang
                   ` (23 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Whitelist PMU MSRs is_valid_passthrough_msr() to avoid warnings in kernel
message. In addition add comments in vmx_possible_passthrough_msrs() to
specify that interception of PMU MSRs are specially handled in
intel_passthrough_pmu_msrs().

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8ab266e1e2a7..349954f90fe9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -158,7 +158,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
 
 /*
  * List of MSRs that can be directly passed to the guest.
- * In addition to these x2apic and PT MSRs are handled specially.
+ * In addition to these x2apic, PMU and PT MSRs are handled specially.
  */
 static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_IA32_SPEC_CTRL,
@@ -698,6 +698,15 @@ static bool is_valid_passthrough_msr(u32 msr)
 	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
+	case MSR_ARCH_PERFMON_EVENTSEL0 ... MSR_ARCH_PERFMON_EVENTSEL0 + 7:
+	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
+	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+	case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
 		return true;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 20/41] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (18 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 19/41] KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 21/41] KVM: x86/pmu: Introduce function prototype for Intel CPU to " Xiong Zhang
                   ` (22 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Plumb through kvm_pmu_ops with these two extra functions to allow pmu
context switch.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 ++
 arch/x86/kvm/pmu.c                     | 14 ++++++++++++++
 arch/x86/kvm/pmu.h                     |  4 ++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index a2acf0afee5d..ee201ac95f57 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -26,6 +26,8 @@ KVM_X86_PMU_OP(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
 KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
+KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
+KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d83746f93392..9d737f5b96bf 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -898,3 +898,17 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 {
 	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
 }
+
+void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
+}
+
+void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index d575808c7258..a4c0b2e2c24b 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -34,6 +34,8 @@ struct kvm_pmu_ops {
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
 	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
+	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
+	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -288,6 +290,8 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id);
 void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
+void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu);
+void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 21/41] KVM: x86/pmu: Introduce function prototype for Intel CPU to save/restore PMU context
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (19 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 20/41] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 22/41] x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU Xiong Zhang
                   ` (21 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Implement a PMU function prototype for Intel CPU.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 49df154fbb5b..0d58fe7d243e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -821,6 +821,14 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, false);
 }
 
+static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
+{
+}
+
+static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
+{
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.hw_event_available = intel_hw_event_available,
 	.pmc_idx_to_pmc = intel_pmc_idx_to_pmc,
@@ -836,6 +844,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
 	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
+	.save_pmu_context = intel_save_pmu_context,
+	.restore_pmu_context = intel_restore_pmu_context,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 22/41] x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (20 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 21/41] KVM: x86/pmu: Introduce function prototype for Intel CPU to " Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Xiong Zhang
                   ` (20 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add additional PMU MSRs MSR_CORE_PERF_GLOBAL_STATUS_SET to allow
passthrough PMU operation on the read-only MSR IA32_PERF_GLOBAL_STATUS.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/msr-index.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..270f4f420801 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1059,6 +1059,7 @@
 #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
 #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL	0x00000390
+#define MSR_CORE_PERF_GLOBAL_STATUS_SET 0x00000391
 
 #define MSR_PERF_METRICS		0x00000329
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (21 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 22/41] x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:26   ` Sean Christopherson
  2024-04-11 21:44   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage Xiong Zhang
                   ` (19 subsequent siblings)
  42 siblings, 2 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Implement the save/restore of PMU state for pasthrough PMU in Intel. In
passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
and gains the full HW PMU ownership. On the contrary, host regains the
ownership of PMU HW from KVM when control flow leaves the scope of
passthrough PMU.

Implement PMU context switches for Intel CPUs and opptunistically use
rdpmcl() instead of rdmsrl() when reading counters since the former has
lower latency in Intel CPUs.

Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0d58fe7d243e..f79bebe7093d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 
 static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	u32 i;
+
+	if (pmu->version != 2) {
+		pr_warn("only PerfMon v2 is supported for passthrough PMU");
+		return;
+	}
+
+	/* Global ctrl register is already saved at VM-exit. */
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
+	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
+	if (pmu->global_status)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		rdpmcl(i, pmc->counter);
+		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
+		/*
+		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
+		 * leakage and also avoid this guest GP counter get accidentally
+		 * enabled during host running when host enable global ctrl.
+		 */
+		if (pmc->eventsel)
+			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
+		if (pmc->counter)
+			wrmsrl(MSR_IA32_PMC0 + i, 0);
+	}
+
+	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	/*
+	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
+	 * also avoid these guest fixed counters get accidentially enabled
+	 * during host running when host enable global ctrl.
+	 */
+	if (pmu->fixed_ctr_ctrl)
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
+		if (pmc->counter)
+			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
+	}
 }
 
 static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	u64 global_status;
+	int i;
+
+	if (pmu->version != 2) {
+		pr_warn("only PerfMon v2 is supported for passthrough PMU");
+		return;
+	}
+
+	/* Clear host global_ctrl and global_status MSR if non-zero. */
+	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
+	if (global_status)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
+
+	wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
+		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
+	}
+
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
+	}
 }
 
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (22 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:36   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 25/41] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Xiong Zhang
                   ` (18 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Zero out unexposed counters/selectors because even though KVM intercepts
all accesses to unexposed PMU MSRs, it does pass through RDPMC instruction
which allows guest to read all GP counters and fixed counters. So, zero out
unexposed counter values which might contain critical information for the
host.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f79bebe7093d..4b4da7f17895 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -895,11 +895,27 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
 		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
 	}
 
+	/*
+	 * Zero out unexposed GP counters/selectors to avoid information leakage
+	 * since passthrough PMU does not intercept RDPMC.
+	 */
+	for (i = pmu->nr_arch_gp_counters; i < kvm_pmu_cap.num_counters_gp; i++) {
+		wrmsrl(MSR_IA32_PMC0 + i, 0);
+		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
+	}
+
 	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
 		pmc = &pmu->fixed_counters[i];
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
 	}
+
+	/*
+	 * Zero out unexposed fixed counters to avoid information leakage
+	 * since passthrough PMU does not intercept RDPMC.
+	 */
+	for (i = pmu->nr_arch_fixed_counters; i < kvm_pmu_cap.num_counters_fixed; i++)
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
 }
 
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 25/41] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (23 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 26/41] KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU capability Xiong Zhang
                   ` (17 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Define macro PMU_CAP_PERF_METRICS to represent bit[15] of
MSR_IA32_PERF_CAPABILITIES MSR. This bit is used to represent whether
perf metrics feature is enabled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/capabilities.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..d8317552b634 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
 #define PT_MODE_HOST_GUEST	1
 
 #define PMU_CAP_FW_WRITES	(1ULL << 13)
+#define PMU_CAP_PERF_METRICS	BIT_ULL(15)
 #define PMU_CAP_LBR_FMT		0x3f
 
 struct nested_vmx_msrs {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 26/41] KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU capability
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (24 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 25/41] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:49   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest Xiong Zhang
                   ` (16 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Add host_perf_cap field in kvm_caps to record host PMU capability. This
helps KVM recognize the PMU capability difference between host and guest.
This awareness improves performance in PMU context switch. In particular,
KVM will need to zero out all MSRs that guest PMU does not use but host PMU
does use. Having the host PMU feature set cached in host_perf_cap in
kvm_caps structure saves a rdmsrl() to IA32_PERF_CAPABILITY MSR on each PMU
context switch. In addition, this is more convenient approach than open
another API on the host perf subsystem side.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 +++++++++--------
 arch/x86/kvm/x86.h     |  1 +
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 349954f90fe9..50100954cd92 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7896,32 +7896,33 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vmx_update_exception_bitmap(vcpu);
 }
 
-static u64 vmx_get_perf_capabilities(void)
+static void vmx_get_perf_capabilities(void)
 {
 	u64 perf_cap = PMU_CAP_FW_WRITES;
 	struct x86_pmu_lbr lbr;
-	u64 host_perf_cap = 0;
+
+	kvm_caps.host_perf_cap = 0;
 
 	if (!enable_pmu)
-		return 0;
+		return;
 
 	if (boot_cpu_has(X86_FEATURE_PDCM))
-		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
+		rdmsrl(MSR_IA32_PERF_CAPABILITIES, kvm_caps.host_perf_cap);
 
 	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
 	    !enable_passthrough_pmu) {
 		x86_perf_get_lbr(&lbr);
 		if (lbr.nr)
-			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
+			perf_cap |= kvm_caps.host_perf_cap & PMU_CAP_LBR_FMT;
 	}
 
 	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
-		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
+		perf_cap |= kvm_caps.host_perf_cap & PERF_CAP_PEBS_MASK;
 		if ((perf_cap & PERF_CAP_PEBS_FORMAT) < 4)
 			perf_cap &= ~PERF_CAP_PEBS_BASELINE;
 	}
 
-	return perf_cap;
+	kvm_caps.supported_perf_cap = perf_cap;
 }
 
 static __init void vmx_set_cpu_caps(void)
@@ -7946,7 +7947,7 @@ static __init void vmx_set_cpu_caps(void)
 
 	if (!enable_pmu)
 		kvm_cpu_cap_clear(X86_FEATURE_PDCM);
-	kvm_caps.supported_perf_cap = vmx_get_perf_capabilities();
+	vmx_get_perf_capabilities();
 
 	if (!enable_sgx) {
 		kvm_cpu_cap_clear(X86_FEATURE_SGX);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 38b73e98eae9..a29eb0469d7e 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -28,6 +28,7 @@ struct kvm_caps {
 	u64 supported_mce_cap;
 	u64 supported_xcr0;
 	u64 supported_xss;
+	u64 host_perf_cap;
 	u64 supported_perf_cap;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (25 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 26/41] KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU capability Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:50   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Xiong Zhang
                   ` (15 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Since perf topdown metrics feature is not supported yet, clear
PERF_METRICS MSR for guest.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4b4da7f17895..ad0434646a29 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -916,6 +916,10 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
 	 */
 	for (i = pmu->nr_arch_fixed_counters; i < kvm_pmu_cap.num_counters_fixed; i++)
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
+
+	/* Clear PERF_METRICS MSR since guest topdown metrics is not supported yet. */
+	if (kvm_caps.host_perf_cap & PMU_CAP_PERF_METRICS)
+		wrmsrl(MSR_PERF_METRICS, 0);
 }
 
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (26 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:54   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 29/41] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Xiong Zhang
                   ` (14 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

In PMU passthrough mode, use global_ctrl field in struct kvm_pmu as the
cached value. This is convenient for KVM to set and get the value from the
host side. In addition, load and save the value across VM enter/exit
boundary in the following way:

 - At VM exit, if processor supports
   GUEST_VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, read guest
   IA32_PERF_GLOBAL_CTRL GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else read
   it from VM-exit MSR-stroe array in VMCS. The value is then assigned to
   global_ctrl.

 - At VM Entry, if processor supports
   GUEST_VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, read guest
   IA32_PERF_GLOBAL_CTRL from GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else
   read it from VM-entry MSR-load array in VMCS. The value is then
   assigned to global ctrl.

Implement the above logic into two helper functions and invoke them around
VM Enter/exit boundary.

Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 51 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 50100954cd92..a9623351eafe 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7193,7 +7193,7 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
-static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+static void __atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 {
 	int i, nr_msrs;
 	struct perf_guest_switch_msr *msrs;
@@ -7216,6 +7216,52 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 					msrs[i].host, false);
 }
 
+static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
+	int i;
+
+	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
+		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
+	} else {
+		i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest,
+						MSR_CORE_PERF_GLOBAL_CTRL);
+		if (i < 0)
+			return;
+		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
+	}
+}
+
+static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	u64 global_ctrl = vcpu_to_pmu(&vmx->vcpu)->global_ctrl;
+	int i;
+
+	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
+		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
+	} else {
+		i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest,
+						MSR_CORE_PERF_GLOBAL_CTRL);
+		if (i < 0)
+			return;
+
+		vmx->msr_autoload.guest.val[i].value = global_ctrl;
+	}
+}
+
+static void __atomic_switch_perf_msrs_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	load_perf_global_ctrl_in_passthrough_pmu(vmx);
+}
+
+static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+{
+	if (is_passthrough_pmu_enabled(&vmx->vcpu))
+		__atomic_switch_perf_msrs_in_passthrough_pmu(vmx);
+	else
+		__atomic_switch_perf_msrs(vmx);
+}
+
 static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7314,6 +7360,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr2 = native_read_cr2();
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		save_perf_global_ctrl_in_passthrough_pmu(vmx);
+
 	vmx->idt_vectoring_info = 0;
 
 	vmx_enable_fb_clear(vmx);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 29/41] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (27 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 30/41] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Xiong Zhang
                   ` (13 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Excluding existing vLBR logic from the passthrough PMU because the it does
not support LBR related MSRs. So to avoid any side effect, do not call
vLBR related code in both vcpu_enter_guest() and pmi injection function.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 13 ++++++++-----
 arch/x86/kvm/vmx/vmx.c       |  2 +-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ad0434646a29..9bbd5084a766 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -688,13 +688,16 @@ static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
 
 static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
-	u8 version = vcpu_to_pmu(vcpu)->version;
+	u8 version;
 
-	if (!intel_pmu_lbr_is_enabled(vcpu))
-		return;
+	if (!is_passthrough_pmu_enabled(vcpu)) {
+		if (!intel_pmu_lbr_is_enabled(vcpu))
+			return;
 
-	if (version > 1 && version < 4)
-		intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+		version = vcpu_to_pmu(vcpu)->version;
+		if (version > 1 && version < 4)
+			intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+	}
 }
 
 static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a9623351eafe..d28afa87be70 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7469,7 +7469,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	pt_guest_enter(vmx);
 
 	atomic_switch_perf_msrs(vmx);
-	if (intel_pmu_lbr_is_enabled(vcpu))
+	if (!is_passthrough_pmu_enabled(&vmx->vcpu) && intel_pmu_lbr_is_enabled(vcpu))
 		vmx_passthrough_lbr_msrs(vcpu);
 
 	if (enable_preemption_timer)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 30/41] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (28 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 29/41] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 31/41] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Xiong Zhang
                   ` (12 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Switch PMI handler at KVM context switch boundary because KVM uses a
separate maskable interrupt vector other than the NMI handler for the host
PMU to process its own PMIs.  So invoke the perf API that allows
registration of the PMI handler.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 9d737f5b96bf..cd559fd74f65 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -904,11 +904,15 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 	lockdep_assert_irqs_disabled();
 
 	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
+
+	perf_guest_switch_to_host_pmi_vector();
 }
 
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 {
 	lockdep_assert_irqs_disabled();
 
+	perf_guest_switch_to_kvm_pmi_vector(kvm_lapic_get_lvtpc_mask(vcpu));
+
 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 31/41] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (29 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 30/41] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 32/41] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Xiong Zhang
                   ` (11 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

perf subsystem should stop and restart all the perf events at the host
level when entering and leaving the passthrough PMU respectively. So invoke
the perf API at PMU context switch functions.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index cd559fd74f65..afc9f7eb3a6b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -906,12 +906,16 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
 
 	perf_guest_switch_to_host_pmi_vector();
+
+	perf_guest_exit();
 }
 
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 {
 	lockdep_assert_irqs_disabled();
 
+	perf_guest_enter();
+
 	perf_guest_switch_to_kvm_pmi_vector(kvm_lapic_get_lvtpc_mask(vcpu));
 
 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 32/41] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (30 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 31/41] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 33/41] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Xiong Zhang
                   ` (10 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Add correct PMU context switch at VM_entry/exit boundary.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 074452aa700d..fe7da1a16c3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10898,6 +10898,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_restore_pmu_context(vcpu);
+
 	guest_timing_enter_irqoff();
 
 	for (;;) {
@@ -10926,6 +10929,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		++vcpu->stat.exits;
 	}
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_save_pmu_context(vcpu);
+
 	/*
 	 * Do this here before restoring debug registers on the host.  And
 	 * since we do this before handling the vmexit, a DR access vmexit
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 33/41] KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (31 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 32/41] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 34/41] KVM: x86/pmu: Intercept EVENT_SELECT MSR Xiong Zhang
                   ` (9 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Make check_pmu_event_filter() exported to usable by vendor modules like
kvm_intel. This is because passthrough PMU intercept the guest writes to
event selectors and directly do the event filter checking inside the vendor
specific set_msr() instead of deferring to the KVM_REQ_PMU handler.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 3 ++-
 arch/x86/kvm/pmu.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index afc9f7eb3a6b..e7ad97734705 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -356,7 +356,7 @@ static bool is_fixed_event_allowed(struct kvm_x86_pmu_event_filter *filter,
 	return true;
 }
 
-static bool check_pmu_event_filter(struct kvm_pmc *pmc)
+bool check_pmu_event_filter(struct kvm_pmc *pmc)
 {
 	struct kvm_x86_pmu_event_filter *filter;
 	struct kvm *kvm = pmc->vcpu->kvm;
@@ -370,6 +370,7 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
 
 	return is_fixed_event_allowed(filter, pmc->idx);
 }
+EXPORT_SYMBOL_GPL(check_pmu_event_filter);
 
 static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
 {
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index a4c0b2e2c24b..6f44fe056368 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -292,6 +292,7 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id);
 void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
 void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu);
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu);
+bool check_pmu_event_filter(struct kvm_pmc *pmc);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 34/41] KVM: x86/pmu: Intercept EVENT_SELECT MSR
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (32 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 33/41] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:55   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 35/41] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Xiong Zhang
                   ` (8 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Event selectors for GP counters are still intercepted for the purpose of
security, i.e., preventing guest from using unallowed events to steal
information or take advantages of any CPU errata.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 1 -
 arch/x86/kvm/vmx/vmx.c       | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 9bbd5084a766..621922005184 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -809,7 +809,6 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	int i;
 
 	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_gp_counters; i++) {
-		vmx_set_intercept_for_msr(vcpu, MSR_ARCH_PERFMON_EVENTSEL0 + i, MSR_TYPE_RW, false);
 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, false);
 		if (fw_writes_is_enabled(vcpu))
 			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, false);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d28afa87be70..1a518800d154 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -698,7 +698,6 @@ static bool is_valid_passthrough_msr(u32 msr)
 	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
-	case MSR_ARCH_PERFMON_EVENTSEL0 ... MSR_ARCH_PERFMON_EVENTSEL0 + 7:
 	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
 	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 35/41] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (33 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 34/41] KVM: x86/pmu: Intercept EVENT_SELECT MSR Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 36/41] KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR Xiong Zhang
                   ` (7 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Mingwei Zhang <mizhang@google.com>

Only allow writing to event selector if event is allowed in filter. Since
passthrough PMU implementation does the PMU context switch at VM Enter/Exit
boudary, even if the value of event selector passes the checking, it cannot
be written directly to HW since PMU HW is owned by the host PMU at the
moment.  Because of that, introduce eventsel_hw to cache that value which
will be assigned into HW just before VM entry.

Note that regardless of whether an event value is allowed, the value will
be cached in pmc->eventsel and guest VM can always read the cached value
back. This implementation is consistent with the HW CPU design.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 18 ++++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ede45c923089..fd1c69371dbf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -503,6 +503,7 @@ struct kvm_pmc {
 	u64 counter;
 	u64 prev_counter;
 	u64 eventsel;
+	u64 eventsel_hw;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
 	/*
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 621922005184..92c5baed8d36 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -458,7 +458,18 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (data & reserved_bits)
 				return 1;
 
-			if (data != pmc->eventsel) {
+			if (is_passthrough_pmu_enabled(vcpu)) {
+				pmc->eventsel = data;
+				if (!check_pmu_event_filter(pmc)) {
+					/* When guest request an invalid event,
+					 * stop the counter by clearing the
+					 * event selector MSR.
+					 */
+					pmc->eventsel_hw = 0;
+					return 0;
+				}
+				pmc->eventsel_hw = data;
+			} else if (data != pmc->eventsel) {
 				pmc->eventsel = data;
 				kvm_pmu_request_counter_reprogram(pmc);
 			}
@@ -843,13 +854,12 @@ static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
 		pmc = &pmu->gp_counters[i];
 		rdpmcl(i, pmc->counter);
-		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
 		/*
 		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
 		 * leakage and also avoid this guest GP counter get accidentally
 		 * enabled during host running when host enable global ctrl.
 		 */
-		if (pmc->eventsel)
+		if (pmc->eventsel_hw)
 			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
 		if (pmc->counter)
 			wrmsrl(MSR_IA32_PMC0 + i, 0);
@@ -894,7 +904,7 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
 		pmc = &pmu->gp_counters[i];
 		wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
-		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
+		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel_hw);
 	}
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 36/41] KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (34 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 35/41] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 21:56   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Xiong Zhang
                   ` (6 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang, Xiong Zhang

From: Xiong Zhang <xiong.y.zhang@intel.com>

Fixed counters control MSR are still intercepted for the purpose of
security, i.e., preventing guest from using unallowed Fixed Counter
to steal information or take advantages of any CPU errata.

Signed-off-by: Xiong Zhang  <xiong.y.zhang@intel.com>
Signed-off-by: Mingwei Zhang  <mizhang@google.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 1 -
 arch/x86/kvm/vmx/vmx.c       | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 92c5baed8d36..713c2a7c7f07 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -825,7 +825,6 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, false);
 	}
 
-	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR_CTRL, MSR_TYPE_RW, false);
 	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++)
 		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, false);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1a518800d154..7c4e1feb589b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -700,7 +700,6 @@ static bool is_valid_passthrough_msr(u32 msr)
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
 	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
 	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
-	case MSR_CORE_PERF_FIXED_CTR_CTRL:
 	case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:
 	case MSR_CORE_PERF_GLOBAL_STATUS:
 	case MSR_CORE_PERF_GLOBAL_CTRL:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (35 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 36/41] KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 22:03   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 38/41] KVM: x86/pmu: Introduce PMU helper to increment counter Xiong Zhang
                   ` (5 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Allow writing to fixed counter selector if counter is exposed. If this
fixed counter is filtered out, this counter won't be enabled on HW.

Passthrough PMU implements the context switch at VM Enter/Exit boundary the
guest value cannot be directly written to HW since the HW PMU is owned by
the host. Introduce a new field fixed_ctr_ctrl_hw in kvm_pmu to cache the
guest value.  which will be assigne to HW at PMU context restore.

Since passthrough PMU intercept writes to fixed counter selector, there is
no need to read the value at pmu context save, but still clear the fix
counter ctrl MSR and counters when switching out to host PMU.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 28 ++++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fd1c69371dbf..b02688ed74f7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -527,6 +527,7 @@ struct kvm_pmu {
 	unsigned nr_arch_fixed_counters;
 	unsigned available_event_types;
 	u64 fixed_ctr_ctrl;
+	u64 fixed_ctr_ctrl_hw;
 	u64 fixed_ctr_ctrl_mask;
 	u64 global_ctrl;
 	u64 global_status;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 713c2a7c7f07..93cfb86c1292 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -68,6 +68,25 @@ static int fixed_pmc_events[] = {
 	[2] = PSEUDO_ARCH_REFERENCE_CYCLES,
 };
 
+static void reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64 data)
+{
+	struct kvm_pmc *pmc;
+	u64 new_data = 0;
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
+		if (check_pmu_event_filter(pmc)) {
+			pmc->current_config = fixed_ctrl_field(data, i);
+			new_data |= intel_fixed_bits_by_idx(i, pmc->current_config);
+		} else {
+			pmc->counter = 0;
+		}
+	}
+	pmu->fixed_ctr_ctrl_hw = new_data;
+	pmu->fixed_ctr_ctrl = data;
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	struct kvm_pmc *pmc;
@@ -401,7 +420,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & pmu->fixed_ctr_ctrl_mask)
 			return 1;
 
-		if (pmu->fixed_ctr_ctrl != data)
+		if (is_passthrough_pmu_enabled(vcpu))
+			reprogram_fixed_counters_in_passthrough_pmu(pmu, data);
+		else if (pmu->fixed_ctr_ctrl != data)
 			reprogram_fixed_counters(pmu, data);
 		break;
 	case MSR_IA32_PEBS_ENABLE:
@@ -864,13 +885,12 @@ static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
 			wrmsrl(MSR_IA32_PMC0 + i, 0);
 	}
 
-	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
 	/*
 	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
 	 * also avoid these guest fixed counters get accidentially enabled
 	 * during host running when host enable global ctrl.
 	 */
-	if (pmu->fixed_ctr_ctrl)
+	if (pmu->fixed_ctr_ctrl_hw)
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
 		pmc = &pmu->fixed_counters[i];
@@ -915,7 +935,7 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
 		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
 	}
 
-	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
 		pmc = &pmu->fixed_counters[i];
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 38/41] KVM: x86/pmu: Introduce PMU helper to increment counter
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (36 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-01-26  8:54 ` [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Xiong Zhang
                   ` (4 subsequent siblings)
  42 siblings, 0 replies; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Introduce PMU helper to increment counter for passthrough PMU because it is
able to conveniently return the overflow condition instead of deferring the
overflow check to KVM_REQ_PMU in original implementation. In addition, this
helper function can hide architecture details.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.c              | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b02688ed74f7..869de0d81055 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -501,6 +501,7 @@ struct kvm_pmc {
 	bool is_paused;
 	bool intr;
 	u64 counter;
+	u64 emulated_counter;
 	u64 prev_counter;
 	u64 eventsel;
 	u64 eventsel_hw;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e7ad97734705..7b0bac1ac4bf 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -434,6 +434,21 @@ static void reprogram_counter(struct kvm_pmc *pmc)
 	pmc->prev_counter = 0;
 }
 
+static bool kvm_passthrough_pmu_incr_counter(struct kvm_pmc *pmc)
+{
+	if (!pmc->emulated_counter)
+		return false;
+
+	pmc->counter += pmc->emulated_counter;
+	pmc->emulated_counter = 0;
+	pmc->counter &= pmc_bitmask(pmc);
+
+	if (!pmc->counter)
+		return true;
+
+	return false;
+}
+
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (37 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 38/41] KVM: x86/pmu: Introduce PMU helper to increment counter Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 23:12   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU Xiong Zhang
                   ` (3 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Implement emulated counter increment for passthrough PMU under KVM_REQ_PMU.
Defer the counter increment to KVM_REQ_PMU handler because counter
increment requests come from kvm_pmu_trigger_event() which can be triggered
within the KVM_RUN inner loop or outside of the inner loop. This means the
counter increment could happen before or after PMU context switch.

So process counter increment in one place makes the implementation simple.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/pmu.c              | 52 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/pmu.h              |  1 +
 arch/x86/kvm/x86.c              |  8 +++--
 4 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 869de0d81055..9080319751de 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -532,6 +532,7 @@ struct kvm_pmu {
 	u64 fixed_ctr_ctrl_mask;
 	u64 global_ctrl;
 	u64 global_status;
+	u64 synthesized_overflow;
 	u64 counter_bitmask[2];
 	u64 global_ctrl_mask;
 	u64 global_status_mask;
@@ -550,6 +551,7 @@ struct kvm_pmu {
 		atomic64_t __reprogram_pmi;
 	};
 	DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
+	DECLARE_BITMAP(incremented_pmc_idx, X86_PMC_IDX_MAX);
 	DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
 
 	u64 ds_area;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7b0bac1ac4bf..9e62e96fe48a 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -449,6 +449,26 @@ static bool kvm_passthrough_pmu_incr_counter(struct kvm_pmc *pmc)
 	return false;
 }
 
+void kvm_passthrough_pmu_handle_event(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	int bit;
+
+	for_each_set_bit(bit, pmu->incremented_pmc_idx, X86_PMC_IDX_MAX) {
+		struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit);
+
+		if (kvm_passthrough_pmu_incr_counter(pmc)) {
+			__set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->synthesized_overflow);
+
+			if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT)
+				kvm_make_request(KVM_REQ_PMI, vcpu);
+		}
+	}
+	bitmap_zero(pmu->incremented_pmc_idx, X86_PMC_IDX_MAX);
+	pmu->global_status |= pmu->synthesized_overflow;
+	pmu->synthesized_overflow = 0;
+}
+
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -748,7 +768,29 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc)
 	return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
 }
 
-void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id)
+static void __kvm_passthrough_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	int i;
+
+	for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
+		pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i);
+
+		if (!pmc || !pmc_speculative_in_use(pmc) ||
+		    !check_pmu_event_filter(pmc))
+			continue;
+
+		/* Ignore checks for edge detect, pin control, invert and CMASK bits */
+		if (eventsel_match_perf_hw_id(pmc, perf_hw_id) && cpl_is_matched(pmc)) {
+			pmc->emulated_counter += 1;
+			__set_bit(pmc->idx, pmu->incremented_pmc_idx);
+			kvm_make_request(KVM_REQ_PMU, vcpu);
+		}
+	}
+}
+
+static void __kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
@@ -765,6 +807,14 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id)
 			kvm_pmu_incr_counter(pmc);
 	}
 }
+
+void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id)
+{
+	if (is_passthrough_pmu_enabled(vcpu))
+		__kvm_passthrough_pmu_trigger_event(vcpu, perf_hw_id);
+	else
+		__kvm_pmu_trigger_event(vcpu, perf_hw_id);
+}
 EXPORT_SYMBOL_GPL(kvm_pmu_trigger_event);
 
 static bool is_masked_filter_valid(const struct kvm_x86_pmu_event_filter *filter)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 6f44fe056368..0fc37a06fe48 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -277,6 +277,7 @@ static inline bool is_passthrough_pmu_enabled(struct kvm_vcpu *vcpu)
 
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
+void kvm_passthrough_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 bool kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fe7da1a16c3b..1bbf312cbd73 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10726,8 +10726,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 		if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
 			record_steal_time(vcpu);
-		if (kvm_check_request(KVM_REQ_PMU, vcpu))
-			kvm_pmu_handle_event(vcpu);
+		if (kvm_check_request(KVM_REQ_PMU, vcpu)) {
+			if (is_passthrough_pmu_enabled(vcpu))
+				kvm_passthrough_pmu_handle_event(vcpu);
+			else
+				kvm_pmu_handle_event(vcpu);
+		}
 		if (kvm_check_request(KVM_REQ_PMI, vcpu))
 			kvm_pmu_deliver_pmi(vcpu);
 #ifdef CONFIG_KVM_SMM
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (38 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 23:18   ` Sean Christopherson
  2024-01-26  8:54 ` [RFC PATCH 41/41] KVM: nVMX: Add nested virtualization support for passthrough PMU Xiong Zhang
                   ` (2 subsequent siblings)
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Separate passthrough PMU logic from non-passthrough vPMU code. There are
two places in passthrough vPMU when set/get_msr() may call into the
existing non-passthrough vPMU code: 1) set/get counters; 2) set global_ctrl
MSR.

In the former case, non-passthrough vPMU will call into
pmc_{read,write}_counter() which wires to the perf API. Update these
functions to avoid the perf API invocation.

The 2nd case is where global_ctrl MSR writes invokes reprogram_counters()
which will invokes the non-passthrough PMU logic. So use pmu->passthrough
flag to wrap out the call.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c |  4 +++-
 arch/x86/kvm/pmu.h | 10 +++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 9e62e96fe48a..de653a67ba93 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -652,7 +652,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (pmu->global_ctrl != data) {
 			diff = pmu->global_ctrl ^ data;
 			pmu->global_ctrl = data;
-			reprogram_counters(pmu, diff);
+			/* Passthrough vPMU never reprogram counters. */
+			if (!pmu->passthrough)
+				reprogram_counters(pmu, diff);
 		}
 		break;
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 0fc37a06fe48..ab8d4a8e58a8 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -70,6 +70,9 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 	u64 counter, enabled, running;
 
 	counter = pmc->counter;
+	if (pmc_to_pmu(pmc)->passthrough)
+		return counter & pmc_bitmask(pmc);
+
 	if (pmc->perf_event && !pmc->is_paused)
 		counter += perf_event_read_value(pmc->perf_event,
 						 &enabled, &running);
@@ -79,7 +82,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 
 static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
 {
-	pmc->counter += val - pmc_read_counter(pmc);
+	/* In passthrough PMU, counter value is the actual value in HW. */
+	if (pmc_to_pmu(pmc)->passthrough)
+		pmc->counter = val;
+	else
+		pmc->counter += val - pmc_read_counter(pmc);
+
 	pmc->counter &= pmc_bitmask(pmc);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [RFC PATCH 41/41] KVM: nVMX: Add nested virtualization support for passthrough PMU
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (39 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU Xiong Zhang
@ 2024-01-26  8:54 ` Xiong Zhang
  2024-04-11 23:21   ` Sean Christopherson
  2024-04-11 17:03 ` [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Sean Christopherson
  2024-04-11 23:25 ` Sean Christopherson
  42 siblings, 1 reply; 181+ messages in thread
From: Xiong Zhang @ 2024-01-26  8:54 UTC (permalink / raw)
  To: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson
  Cc: kvm, linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, xiong.y.zhang

From: Mingwei Zhang <mizhang@google.com>

Add nested virtualization support for passthrough PMU by combining the MSR
interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
this patch, nested virtualization works for passthrough PMU because L1 will
see Perfmon v2 and will have to use legacy vPMU implementation if it is
Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
even be Linux.

If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
the access.

However, in current implementation, if without adding anything for nested,
KVM always set MSR interception bits in vmcs02. This leads to the fact that
L0 will emulate all MSR read/writes for L2, leading to errors, since the
current passthrough vPMU never implements set_msr() and get_msr() for any
counter access except counter accesses from the VMM side.

So fix the issue by setting up the correct MSR interception for PMU MSRs.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/nested.c | 52 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff7..95e1c78152da 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -561,6 +561,55 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
 						   msr_bitmap_l0, msr);
 }
 
+/* Pass PMU MSRs to nested VM if L0 and L1 are set to passthrough. */
+static void nested_vmx_set_passthru_pmu_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap_l1,
+							  unsigned long *msr_bitmap_l0)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_ARCH_PERFMON_EVENTSEL0 + i,
+						 MSR_TYPE_RW);
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_IA32_PERFCTR0 + i,
+						 MSR_TYPE_RW);
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_IA32_PMC0 + i,
+						 MSR_TYPE_RW);
+	}
+
+	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++) {
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_CORE_PERF_FIXED_CTR0 + i,
+						 MSR_TYPE_RW);
+	}
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_FIXED_CTR_CTRL,
+					 MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_STATUS,
+					 MSR_TYPE_RW);
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_CTRL,
+					 MSR_TYPE_RW);
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_OVF_CTRL,
+					 MSR_TYPE_RW);
+}
+
 /*
  * Merge L0's and L1's MSR bitmap, return false to indicate that
  * we do not use the hardware.
@@ -660,6 +709,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		nested_vmx_set_passthru_pmu_intercept_for_msr(vcpu, msr_bitmap_l1, msr_bitmap_l0);
+
 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
 
 	vmx->nested.force_msr_bitmap_recalc = false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-01-26  8:54 ` [RFC PATCH 02/41] perf: Support guest enter/exit interfaces Xiong Zhang
@ 2024-03-20 16:40   ` Raghavendra Rao Ananta
  2024-03-20 17:12     ` Liang, Kan
  2024-04-11 18:06   ` Sean Christopherson
  1 sibling, 1 reply; 181+ messages in thread
From: Raghavendra Rao Ananta @ 2024-03-20 16:40 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Kan Liang

Hi Kan,

>
> +static void __perf_force_exclude_guest_pmu(struct perf_event_pmu_context *pmu_ctx,
> +                                          struct perf_event *event)
> +{
> +       struct perf_event_context *ctx = pmu_ctx->ctx;
> +       struct perf_event *sibling;
> +       bool include_guest = false;
> +
> +       event_sched_out(event, ctx);
> +       if (!event->attr.exclude_guest)
> +               include_guest = true;
> +       for_each_sibling_event(sibling, event) {
> +               event_sched_out(sibling, ctx);
> +               if (!sibling->attr.exclude_guest)
> +                       include_guest = true;
> +       }
> +       if (include_guest) {
> +               perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> +               for_each_sibling_event(sibling, event)
> +                       perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> +       }
Does the perf core revert the PERF_EVENT_STATE_ERROR state somewhere
from the perf_guest_exit() path, or is it expected to remain in this
state?
IIUC, in the perf_guest_exit() path, when we land into
merge_sched_in(), we never schedule the event back if event->state <=
PERF_EVENT_STATE_OFF.

Thank you.
Raghavendra

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-03-20 16:40   ` Raghavendra Rao Ananta
@ 2024-03-20 17:12     ` Liang, Kan
  0 siblings, 0 replies; 181+ messages in thread
From: Liang, Kan @ 2024-03-20 17:12 UTC (permalink / raw)
  To: Raghavendra Rao Ananta, Xiong Zhang
  Cc: seanjc, pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024-03-20 12:40 p.m., Raghavendra Rao Ananta wrote:
> Hi Kan,
> 
>>
>> +static void __perf_force_exclude_guest_pmu(struct perf_event_pmu_context *pmu_ctx,
>> +                                          struct perf_event *event)
>> +{
>> +       struct perf_event_context *ctx = pmu_ctx->ctx;
>> +       struct perf_event *sibling;
>> +       bool include_guest = false;
>> +
>> +       event_sched_out(event, ctx);
>> +       if (!event->attr.exclude_guest)
>> +               include_guest = true;
>> +       for_each_sibling_event(sibling, event) {
>> +               event_sched_out(sibling, ctx);
>> +               if (!sibling->attr.exclude_guest)
>> +                       include_guest = true;
>> +       }
>> +       if (include_guest) {
>> +               perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>> +               for_each_sibling_event(sibling, event)
>> +                       perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>> +       }
> Does the perf core revert the PERF_EVENT_STATE_ERROR state somewhere
> from the perf_guest_exit() path, or is it expected to remain in this
> state?
> IIUC, in the perf_guest_exit() path, when we land into
> merge_sched_in(), we never schedule the event back if event->state <=
> PERF_EVENT_STATE_OFF.
> 

The perf doesn't revert event with the ERROR STATE. A user asks to
profile both guest and host, but the pass-through mode doesn't allow the
profiling of the guest. So it has to error out and remain the ERROR STATE.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (40 preceding siblings ...)
  2024-01-26  8:54 ` [RFC PATCH 41/41] KVM: nVMX: Add nested virtualization support for passthrough PMU Xiong Zhang
@ 2024-04-11 17:03 ` Sean Christopherson
  2024-04-12  2:19   ` Zhang, Xiong Y
  2024-04-18 20:46   ` Mingwei Zhang
  2024-04-11 23:25 ` Sean Christopherson
  42 siblings, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 17:03 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

<bikeshed>

I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
emulates the control plane (controls and event selectors), while the data is
fully passed through (counters).

</bikeshed>

On Fri, Jan 26, 2024, Xiong Zhang wrote:

> 1. host system wide / QEMU events handling during VM running
>    At VM-entry, all the host perf events which use host x86 PMU will be
>    stopped. These events with attr.exclude_guest = 1 will be stopped here
>    and re-started after vm-exit. These events without attr.exclude_guest=1
>    will be in error state, and they cannot recovery into active state even
>    if the guest stops running. This impacts host perf a lot and request
>    host system wide perf events have attr.exclude_guest=1.
> 
>    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
> 
>    During VM running, perf event creation for system wide and QEMU
>    process without attr.exclude_guest=1 fail with -EBUSY. 
> 
> 2. NMI watchdog
>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>    will be stopped also during vm running, but it doesn't have
>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>    watchdog loses function during VM running.
> 
>    Two candidates exist for replacing perf event of NMI watchdog:
>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.

I think the simplest solution is to allow mediated PMU usage if and only if
the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
problem to solve.

> 3. Dedicated kvm_pmi_vector
>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>    PMI into guest when physical PMI belongs to guest counter. If the
>    same mechanism is used in passthrough vPMU and PMI skid exists
>    which cause physical PMI belonging to guest happens after VM-exit,
>    then the host PMI handler couldn't identify this PMI belongs to
>    host or guest.
>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>    has this vector only. The PMI belonging to host still has an NMI
>    vector.
> 
>    Without considering PMI skid especially for AMD, the host NMI vector
>    could be used for guest PMI also, this method is simpler and doesn't

I don't see how multiplexing NMIs between guest and host is simpler.  At best,
the complexity is a wash, just in different locations, and I highly doubt it's
a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
LVTPC.

E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.

>    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
>    didn't meet the skid PMI issue on modern Intel processors.
> 
> 4. per-VM passthrough mode configuration
>    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
>    it decides vPMU is passthrough mode or emulated mode at kvm module
>    load time.
>    Do we need the capability of per-VM passthrough mode configuration?
>    So an admin can launch some non-passthrough VM and profile these
>    non-passthrough VMs in host, but admin still cannot profile all
>    the VMs once passthrough VM existence. This means passthrough vPMU
>    and emulated vPMU mix on one platform, it has challenges to implement.
>    As the commit message in commit 0011, the main challenge is 
>    passthrough vPMU and emulated vPMU have different vPMU features, this
>    ends up with two different values for kvm_cap.supported_perf_cap, which
>    is initialized at module load time. To support it, more refactor is
>    needed.

I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
existing vPMU support entirely, but that's probably not be realistic, at least not
in the near future.

> Remain Works
> ===
> 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.

Before this gets out of its "RFC" phase, I would at least like line of sight to
a more optimized switch.  I 100% agree that starting with a conservative
implementation is the way to go, and the kernel absolutely needs to be able to
profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
guest PMU loaded for the entirety of KVM_RUN isn't a viable option.

But I also don't want to get into a situation where can't figure out a clean,
robust way to do the optimized context switch without needing (another) massive
rewrite.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-01-26  8:54 ` [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH Xiong Zhang
@ 2024-04-11 17:04   ` Sean Christopherson
  2024-04-11 17:21     ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 17:04 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Kan Liang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
> and later PMUs

Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
take a dependency on v4+.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling
  2024-01-26  8:54 ` [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling Xiong Zhang
@ 2024-04-11 17:10   ` Sean Christopherson
  2024-04-11 19:05     ` Sean Christopherson
  2024-04-12  3:56     ` Zhang, Xiong Y
  0 siblings, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 17:10 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> Create a new vector in the host IDT for PMI handling within a passthrough
> vPMU implementation. In addition, add a function to allow the registration
> of the handler and a function to switch the PMI handler.
> 
> This is the preparation work to support KVM passthrough vPMU to handle its
> own PMIs without interference from PMI handler of the host PMU.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/include/asm/hardirq.h           |  1 +
>  arch/x86/include/asm/idtentry.h          |  1 +
>  arch/x86/include/asm/irq.h               |  1 +
>  arch/x86/include/asm/irq_vectors.h       |  2 +-
>  arch/x86/kernel/idt.c                    |  1 +
>  arch/x86/kernel/irq.c                    | 29 ++++++++++++++++++++++++
>  tools/arch/x86/include/asm/irq_vectors.h |  1 +
>  7 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
> index 66837b8c67f1..c1e2c1a480bf 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -19,6 +19,7 @@ typedef struct {
>  	unsigned int kvm_posted_intr_ipis;
>  	unsigned int kvm_posted_intr_wakeup_ipis;
>  	unsigned int kvm_posted_intr_nested_ipis;
> +	unsigned int kvm_vpmu_pmis;

Somewhat off topic, does anyone actually ever use these particular stats?  If the
desire is to track _all_ IRQs, why not have an array and bump the counts in common
code?

>  #endif
>  	unsigned int x86_platform_ipis;	/* arch dependent */
>  	unsigned int apic_perf_irqs;
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 05fd175cec7d..d1b58366bc21 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -675,6 +675,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
> +DECLARE_IDTENTRY_SYSVEC(KVM_VPMU_VECTOR,	        sysvec_kvm_vpmu_handler);

I vote for KVM_VIRTUAL_PMI_VECTOR.  I don't see any reasy to abbreviate "virtual",
and the vector is a for a Performance Monitoring Interupt.

>  #endif
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
> index 836c170d3087..ee268f42d04a 100644
> --- a/arch/x86/include/asm/irq.h
> +++ b/arch/x86/include/asm/irq.h
> @@ -31,6 +31,7 @@ extern void fixup_irqs(void);
>  
>  #ifdef CONFIG_HAVE_KVM
>  extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
> +extern void kvm_set_vpmu_handler(void (*handler)(void));

virtual_pmi_handler()

>  #endif
>  
>  extern void (*x86_platform_ipi_callback)(void);
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 3a19904c2db6..120403572307 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -77,7 +77,7 @@
>   */
>  #define IRQ_WORK_VECTOR			0xf6
>  
> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
> +#define KVM_VPMU_VECTOR			0xf5

This should be inside

	#ifdef CONFIG_HAVE_KVM

no?

>  #define DEFERRED_ERROR_VECTOR		0xf4
>  
>  /* Vector on which hypervisor callbacks will be delivered */
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index 8857abc706e4..6944eec251f4 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
>  	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>  	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>  	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
> +	INTG(KVM_VPMU_VECTOR,		        asm_sysvec_kvm_vpmu_handler),

kvm_virtual_pmi_handler

> @@ -332,6 +351,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>  	apic_eoi();
>  	inc_irq_stat(kvm_posted_intr_nested_ipis);
>  }
> +
> +/*
> + * Handler for KVM_PT_PMU_VECTOR.

Heh, not sure where the PT part came from...

> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_vpmu_handler)
> +{
> +	apic_eoi();
> +	inc_irq_stat(kvm_vpmu_pmis);
> +	kvm_vpmu_handler();
> +}
>  #endif
>  
>  
> diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
> index 3a19904c2db6..3773e60f1af8 100644
> --- a/tools/arch/x86/include/asm/irq_vectors.h
> +++ b/tools/arch/x86/include/asm/irq_vectors.h
> @@ -85,6 +85,7 @@
>  
>  /* Vector for KVM to deliver posted interrupt IPI */
>  #ifdef CONFIG_HAVE_KVM
> +#define KVM_VPMU_VECTOR			0xf5

Heh, and your copy+paste is out of date.

>  #define POSTED_INTR_VECTOR		0xf2
>  #define POSTED_INTR_WAKEUP_VECTOR	0xf1
>  #define POSTED_INTR_NESTED_VECTOR	0xf0
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 17:04   ` Sean Christopherson
@ 2024-04-11 17:21     ` Liang, Kan
  2024-04-11 17:24       ` Jim Mattson
  0 siblings, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-11 17:21 UTC (permalink / raw)
  To: Sean Christopherson, Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024-04-11 1:04 p.m., Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
>> and later PMUs
> 
> Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
> take a dependency on v4+.

The IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs are introduced in v4. They
are used in the save/restore of PMU state. Please see PATCH 23/41.
So it's limited to v4+ for now.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 17:21     ` Liang, Kan
@ 2024-04-11 17:24       ` Jim Mattson
  2024-04-11 17:46         ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 17:24 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, mizhang,
	kan.liang, zhenyuw, dapeng1.mi, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao

On Thu, Apr 11, 2024 at 10:21 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2024-04-11 1:04 p.m., Sean Christopherson wrote:
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
> >> and later PMUs
> >
> > Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
> > take a dependency on v4+.
>
> The IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs are introduced in v4. They
> are used in the save/restore of PMU state. Please see PATCH 23/41.
> So it's limited to v4+ for now.

Prior to version 4, semi-passthrough is possible, but
IA32_PERF_GLOBAL_STATUS has to be intercepted and emulated, since it
is non-trivial to set bits in this MSR.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 17:24       ` Jim Mattson
@ 2024-04-11 17:46         ` Sean Christopherson
  2024-04-11 19:13           ` Liang, Kan
  2024-04-11 19:32           ` Sean Christopherson
  0 siblings, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 17:46 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Kan Liang, Xiong Zhang, pbonzini, peterz, mizhang, kan.liang,
	zhenyuw, dapeng1.mi, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Thu, Apr 11, 2024, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 10:21 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> > On 2024-04-11 1:04 p.m., Sean Christopherson wrote:
> > > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > >> From: Kan Liang <kan.liang@linux.intel.com>
> > >>
> > >> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
> > >> and later PMUs
> > >
> > > Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
> > > take a dependency on v4+.
> >
> > The IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs are introduced in v4. They
> > are used in the save/restore of PMU state. Please see PATCH 23/41.
> > So it's limited to v4+ for now.
> 
> Prior to version 4, semi-passthrough is possible, but IA32_PERF_GLOBAL_STATUS
> has to be intercepted and emulated, since it is non-trivial to set bits in
> this MSR.

Ah, then this _perf_ capability should be PERF_PMU_CAP_WRITABLE_GLOBAL_STATUS or
so, especially since it's introduced in advance of the KVM side of things.  Then
whether or not to support a mediated PMU becomes a KVM decision, e.g. intercepting
accesses to IA32_PERF_GLOBAL_STATUS doesn't seem like a complete deal breaker
(or maybe it is, I now see the comment about it being used to do the context switch).

And peeking ahead, IIUC perf effectively _forces_ a passthrough model when
has_vpmu_passthrough_cap() is true, which is wrong.  There needs to be a user/admin
opt-in (or opt-out) to that behavior, at a kernel/perf level, not just at a KVM
level.  Hmm, or is perf relying on KVM to do that right thing?  I.e. relying on
KVM to do perf_guest_{enter,exit}() if and only if the PMU can support the
passthrough model.

If that's the case, most of the has_vpmu_passthrough_cap() checks are gratiutous
and confusing, e.g. just WARN if KVM (or some other module) tries to trigger a
PMU context switch when it's not supported by perf.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-01-26  8:54 ` [RFC PATCH 02/41] perf: Support guest enter/exit interfaces Xiong Zhang
  2024-03-20 16:40   ` Raghavendra Rao Ananta
@ 2024-04-11 18:06   ` Sean Christopherson
  2024-04-11 19:53     ` Liang, Kan
  1 sibling, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 18:06 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Kan Liang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 683dc086ef10..59471eeec7e4 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3803,6 +3803,8 @@ static inline void group_update_userpage(struct perf_event *group_event)
>  		event_update_userpage(event);
>  }
>  
> +static DEFINE_PER_CPU(bool, __perf_force_exclude_guest);
> +
>  static int merge_sched_in(struct perf_event *event, void *data)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> @@ -3814,6 +3816,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  	if (!event_filter_match(event))
>  		return 0;
>  
> +	/*
> +	 * The __perf_force_exclude_guest indicates entering the guest.
> +	 * No events of the passthrough PMU should be scheduled.
> +	 */
> +	if (__this_cpu_read(__perf_force_exclude_guest) &&
> +	    has_vpmu_passthrough_cap(event->pmu))

As mentioned in the previous reply, I think perf should WARN and reject any attempt
to trigger a "passthrough" context switch if such a switch isn't supported by
perf, not silently let it go through and then skip things later.

> +		return 0;
> +
>  	if (group_can_go_on(event, *can_add_hw)) {
>  		if (!group_sched_in(event, ctx))
>  			list_add_tail(&event->active_list, get_event_list(event));

...

> +/*
> + * When a guest enters, force all active events of the PMU, which supports
> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
> + * PMUs, such as uncore PMU, should not be impacted. The guest can
> + * temporarily own all counters of the PMU.
> + * During the period, all the creation of the new event of the PMU with
> + * !exclude_guest are error out.
> + */
> +void perf_guest_enter(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (__this_cpu_read(__perf_force_exclude_guest))

This should be a WARN_ON_ONCE, no?

> +		return;
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	perf_force_exclude_guest_enter(&cpuctx->ctx);
> +	if (cpuctx->task_ctx)
> +		perf_force_exclude_guest_enter(cpuctx->task_ctx);
> +
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +
> +	__this_cpu_write(__perf_force_exclude_guest, true);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_enter);
> +
> +static void perf_force_exclude_guest_exit(struct perf_event_context *ctx)
> +{
> +	struct perf_event_pmu_context *pmu_ctx;
> +	struct pmu *pmu;
> +
> +	update_context_time(ctx);
> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> +		pmu = pmu_ctx->pmu;
> +		if (!has_vpmu_passthrough_cap(pmu))
> +			continue;

I don't see how we can sanely support a CPU that doesn't support writable
PERF_GLOBAL_STATUS across all PMUs.

> +
> +		perf_pmu_disable(pmu);
> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
> +		perf_pmu_enable(pmu);
> +	}
> +}
> +
> +void perf_guest_exit(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (!__this_cpu_read(__perf_force_exclude_guest))

WARN_ON_ONCE here too?

> +		return;
> +
> +	__this_cpu_write(__perf_force_exclude_guest, false);
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	perf_force_exclude_guest_exit(&cpuctx->ctx);
> +	if (cpuctx->task_ctx)
> +		perf_force_exclude_guest_exit(cpuctx->task_ctx);
> +
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_exit);
> +
> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
> +						 int cpu, struct task_struct *task)
> +{
> +	bool *force_exclude_guest = NULL;
> +
> +	if (!has_vpmu_passthrough_cap(event->pmu))
> +		return 0;
> +
> +	if (event->attr.exclude_guest)
> +		return 0;
> +
> +	if (cpu != -1) {
> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
> +	} else if (task && (task->flags & PF_VCPU)) {
> +		/*
> +		 * Just need to check the running CPU in the event creation. If the
> +		 * task is moved to another CPU which supports the force_exclude_guest.
> +		 * The event will filtered out and be moved to the error stage. See
> +		 * merge_sched_in().
> +		 */
> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
> +	}

These checks are extremely racy, I don't see how this can possibly do the
right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
time slices).

Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
per-CPU context mutex.

> +	if (force_exclude_guest && *force_exclude_guest)
> +		return -EBUSY;
> +	return 0;
> +}
> +
>  /*
>   * Holding the top-level event's child_mutex means that any
>   * descendant process that has inherited this event will block
> @@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		goto err_ns;
>  	}
>  
> +	if (perf_force_exclude_guest_check(event, cpu, task)) {

This should be:

	err = perf_force_exclude_guest_check(event, cpu, task);
	if (err)
		goto err_pmu;

i.e. shouldn't effectively ignore/override the return result.

> +		err = -EBUSY;
> +		goto err_pmu;
> +	}
> +
>  	/*
>  	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
>  	 * events (they don't make sense as the cgroup will be different
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 03/41] perf: Set exclude_guest onto nmi_watchdog
  2024-01-26  8:54 ` [RFC PATCH 03/41] perf: Set exclude_guest onto nmi_watchdog Xiong Zhang
@ 2024-04-11 18:56   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 18:56 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> The perf event for NMI watchdog is per cpu pinned system wide event,
> if such event doesn't have exclude_guest flag, it will be put into
> error state once guest with passthrough PMU starts, this breaks
> NMI watchdog function totally.
> 
> This commit adds exclude_guest flag for this perf event, so this perf
> event is stopped during VM running, but it will continue working after
> VM exit. In this way the NMI watchdog can not detect hardlockups during
> VM running, it still breaks NMI watchdog function a bit. But host perf
> event must be stopped during VM with passthrough PMU running, current
> no other reliable method can be used to replace perf event for NMI
> watchdog.

As mentioned in the cover letter, I think this is backwards, and mediated PMU
support should be disallowed if kernel-priority things like the watchdog are in
use.

Doubly so because this patch affects _everything_, not just systems with VMs
that have a mediated PMU.

> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  kernel/watchdog_perf.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
> index 8ea00c4a24b2..c8ba656ff674 100644
> --- a/kernel/watchdog_perf.c
> +++ b/kernel/watchdog_perf.c
> @@ -88,6 +88,7 @@ static struct perf_event_attr wd_hw_attr = {
>  	.size		= sizeof(struct perf_event_attr),
>  	.pinned		= 1,
>  	.disabled	= 1,
> +	.exclude_guest  = 1,
>  };
>  
>  /* Callback function for perf event subsystem */
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling
  2024-04-11 17:10   ` Sean Christopherson
@ 2024-04-11 19:05     ` Sean Christopherson
  2024-04-12  3:56     ` Zhang, Xiong Y
  1 sibling, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:05 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024, Sean Christopherson wrote:
> > diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> > index 05fd175cec7d..d1b58366bc21 100644
> > --- a/arch/x86/include/asm/idtentry.h
> > +++ b/arch/x86/include/asm/idtentry.h
> > @@ -675,6 +675,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
> >  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
> >  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
> >  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
> > +DECLARE_IDTENTRY_SYSVEC(KVM_VPMU_VECTOR,	        sysvec_kvm_vpmu_handler);
> 
> I vote for KVM_VIRTUAL_PMI_VECTOR.  I don't see any reasy to abbreviate "virtual",
> and the vector is a for a Performance Monitoring Interupt.

Actually, I vote for KVM_GUEST_PMI_VECTOR.  The IRQ/PMI itself isn't virtual, it
is quite literally the vector that is used for PMIs in KVM guests.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU
  2024-01-26  8:54 ` [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU Xiong Zhang
@ 2024-04-11 19:07   ` Sean Christopherson
  2024-04-12  5:44     ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:07 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> Add function to register/unregister PMI handler at KVM module
> initialization and destroy time. This allows the host PMU with passthough
> capability enabled switch PMI handler at PMU context switch time.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c924075f6f1..4432e736129f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10611,6 +10611,18 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
>  
> +void kvm_passthrough_pmu_handler(void)

s/pmu/pmi, and this needs a verb.  Maybe kvm_handle_guest_pmi()?  Definitely
open to other names.

> +{
> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +
> +	if (!vcpu) {
> +		pr_warn_once("%s: no running vcpu found!\n", __func__);

Unless I misunderstand the code, this can/should be a full WARN_ON_ONCE.  If a
PMI skids all the way past vcpu_put(), we've got big problems.
 
> +		return;
> +	}
> +
> +	kvm_make_request(KVM_REQ_PMI, vcpu);
> +}
> +
>  /*
>   * Called within kvm->srcu read side.
>   * Returns 1 to let vcpu_run() continue the guest execution loop without
> @@ -13815,6 +13827,7 @@ static int __init kvm_x86_init(void)
>  {
>  	kvm_mmu_x86_module_init();
>  	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
> +	kvm_set_vpmu_handler(kvm_passthrough_pmu_handler);

Hmm, a few patches late, but the "kvm" scope is weird.  This calls a core x86
function, not a KVM function.

And to reduce exports and copy+paste, what about something like this?

void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
{
	if (!handler)
		handler = dummy_handler;

	if (vector == POSTED_INTR_WAKEUP_VECTOR)
		kvm_posted_intr_wakeup_handler = handler;
	else if (vector == KVM_GUEST_PMI_VECTOR)
		kvm_guest_pmi_handler = handler;
	else
		WARN_ON_ONCE(1);

	if (handler == dummy_handler)
		synchronize_rcu();
}
EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 17:46         ` Sean Christopherson
@ 2024-04-11 19:13           ` Liang, Kan
  2024-04-11 20:43             ` Sean Christopherson
  2024-04-11 19:32           ` Sean Christopherson
  1 sibling, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-11 19:13 UTC (permalink / raw)
  To: Sean Christopherson, Jim Mattson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024-04-11 1:46 p.m., Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Jim Mattson wrote:
>> On Thu, Apr 11, 2024 at 10:21 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>> On 2024-04-11 1:04 p.m., Sean Christopherson wrote:
>>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
>>>>> and later PMUs
>>>>
>>>> Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
>>>> take a dependency on v4+.
>>>
>>> The IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs are introduced in v4. They
>>> are used in the save/restore of PMU state. Please see PATCH 23/41.
>>> So it's limited to v4+ for now.
>>
>> Prior to version 4, semi-passthrough is possible, but IA32_PERF_GLOBAL_STATUS
>> has to be intercepted and emulated, since it is non-trivial to set bits in
>> this MSR.
> 
> Ah, then this _perf_ capability should be PERF_PMU_CAP_WRITABLE_GLOBAL_STATUS or
> so, especially since it's introduced in advance of the KVM side of things.  Then
> whether or not to support a mediated PMU becomes a KVM decision, e.g. intercepting
> accesses to IA32_PERF_GLOBAL_STATUS doesn't seem like a complete deal breaker
> (or maybe it is, I now see the comment about it being used to do the context switch).

The PERF_PMU_CAP_VPMU_PASSTHROUGH is to indicate whether the PMU has the
capability to support passthrough mode. It's used to distinguish the
other PMUs, e.g., uncore PMU. It's just because the current RFC utilizes
the IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs, I have to limit it to V4+.

I agree that it should be a KVM decision, not perf. The v4 check should
be removed.

Regarding the PERF_PMU_CAP_WRITABLE_GLOBAL_STATUS, I think perf already
passes the x86_pmu.version to KVM. Maybe KVM can add an internal flag to
track it, so a PERF_PMU_CAP_ bit can be saved?

> 
> And peeking ahead, IIUC perf effectively _forces_ a passthrough model when
> has_vpmu_passthrough_cap() is true, which is wrong.  There needs to be a user/admin
> opt-in (or opt-out) to that behavior, at a kernel/perf level, not just at a KVM
> level.  Hmm, or is perf relying on KVM to do that right thing?  I.e. relying on
> KVM to do perf_guest_{enter,exit}() if and only if the PMU can support the
> passthrough model.
>

Yes, perf relies on KVM to tell if a guest is entering the passthrough mode.

> If that's the case, most of the has_vpmu_passthrough_cap() checks are gratiutous
> and confusing, e.g. just WARN if KVM (or some other module) tries to trigger a
> PMU context switch when it's not supported by perf.

If there is only non supported PMUs running in the host, perf wouldn't
do any context switch. The guest can feel free to use the core PMU. We
should not WARN for this case.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler
  2024-01-26  8:54 ` [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler Xiong Zhang
@ 2024-04-11 19:17   ` Sean Christopherson
  2024-04-11 19:34     ` Sean Christopherson
  2024-04-12  5:57     ` Zhang, Xiong Y
  0 siblings, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:17 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> Add function to switch PMI handler since passthrough PMU and host PMU will
> use different interrupt vectors.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/events/core.c            | 15 +++++++++++++++
>  arch/x86/include/asm/perf_event.h |  3 +++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 40ad1425ffa2..3f87894d8c8e 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -701,6 +701,21 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>  }
>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>  
> +void perf_guest_switch_to_host_pmi_vector(void)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
> +
> +void perf_guest_switch_to_kvm_pmi_vector(void)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_switch_to_kvm_pmi_vector);

Why slice and dice the context switch if it's all in perf?  Just do this in
perf_guest_enter().  

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
  2024-01-26  8:54 ` [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW Xiong Zhang
@ 2024-04-11 19:21   ` Sean Christopherson
  2024-04-12  6:17     ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:21 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> When guest clear LVTPC_MASK bit in guest PMI handler at PMU passthrough
> mode, this bit should be reflected onto HW, otherwise HW couldn't generate
> PMI again during VM running until it is cleared.

This fixes a bug in the previous patch, i.e. this should not be a standalone
patch.

> 
> This commit set HW LVTPC_MASK bit at PMU vecctor switching to KVM PMI
> vector.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/events/core.c            | 9 +++++++--
>  arch/x86/include/asm/perf_event.h | 2 +-
>  arch/x86/kvm/lapic.h              | 1 -
>  3 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 3f87894d8c8e..ece042cfb470 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -709,13 +709,18 @@ void perf_guest_switch_to_host_pmi_vector(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
>  
> -void perf_guest_switch_to_kvm_pmi_vector(void)
> +void perf_guest_switch_to_kvm_pmi_vector(bool mask)
>  {
>  	lockdep_assert_irqs_disabled();
>  
> -	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
> +	if (mask)
> +		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR |
> +			   APIC_LVT_MASKED);
> +	else
> +		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
>  }

Or more simply:

void perf_guest_enter(u32 guest_lvtpc)
{
	...

	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR |
			       (guest_lvtpc & APIC_LVT_MASKED));
}

and then on the KVM side:

	perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));

because an in-kernel APIC should be a hard requirement for the mediated PMU.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 08/41] KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
  2024-01-26  8:54 ` [RFC PATCH 08/41] KVM: x86/pmu: Add get virtual LVTPC_MASK bit function Xiong Zhang
@ 2024-04-11 19:22   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:22 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> On PMU passthrough mode, guest virtual LVTPC_MASK bit must be reflected
> onto HW, especially when guest clear it, the HW bit should be cleared also.
> Otherwise processor can't generate PMI until the HW mask bit is cleared.
> 
> This commit add a function to get virtual LVTPC_MASK bit, so that

No "This commit", "This patch", or any other variation.  Please read through:

  Documentation/process/maintainer-tip.rst 
  Documentation/process/maintainer-kvm-x86.rst

> it can be set onto HW later.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> ---
>  arch/x86/kvm/lapic.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index e30641d5ac90..dafae44325d1 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -277,4 +277,10 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
>  {
>  	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
>  }
> +
> +static inline bool kvm_lapic_get_lvtpc_mask(struct kvm_vcpu *vcpu)
> +{
> +	return lapic_in_kernel(vcpu) &&
> +	       (kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC) & APIC_LVT_MASKED);
> +}

As suggested in the previous patch, I'm pretty sure we can safely omit this
helper.

>  #endif
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 09/41] perf: core/x86: Forbid PMI handler when guest own PMU
  2024-01-26  8:54 ` [RFC PATCH 09/41] perf: core/x86: Forbid PMI handler when guest own PMU Xiong Zhang
@ 2024-04-11 19:26   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:26 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> +	/*
> +	 * When PMU is pass-through into guest, this handler should be forbidden from
> +	 * running, the reasons are:
> +	 * 1. After perf_guest_switch_to_kvm_pmi_vector() is called, and before cpu
> +	 *    enter into non-root mode, NMI could happen, but x86_pmu_handle_irq()
> +	 *    restore PMU to use NMI vector, which destroy KVM PMI vector setting.
> +	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
> +	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
> +	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
> +	 *    clear global_status MSR which has guest status now, then this destroy
> +	 *    guest PMU status.
> +	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
> +	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
> +	 *    which has guest status now, then this destroy guest PMU status.
> +	 */
> +	if (perf_is_in_guest_passthrough())

Maybe a name more along the lines of:

	if (perf_is_guest_context_loaded())

because that makes it more obvious that the NMI _can't_ belong to the host PMU.

For that matter, I would also rename __perf_force_exclude_guest to
perf_guest_context_loaded (or "active" if that's better).  The boolean tracks
the state (guest vs. host context loaded/active), where as forcing perf events
to exclude_guest is an action based on that state.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 17:46         ` Sean Christopherson
  2024-04-11 19:13           ` Liang, Kan
@ 2024-04-11 19:32           ` Sean Christopherson
  1 sibling, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:32 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Kan Liang, Xiong Zhang, pbonzini, peterz, mizhang, kan.liang,
	zhenyuw, dapeng1.mi, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Thu, Apr 11, 2024, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Jim Mattson wrote:
> > On Thu, Apr 11, 2024 at 10:21 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> > > On 2024-04-11 1:04 p.m., Sean Christopherson wrote:
> > > > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > >> From: Kan Liang <kan.liang@linux.intel.com>
> > > >>
> > > >> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
> > > >> and later PMUs
> > > >
> > > > Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
> > > > take a dependency on v4+.
> > >
> > > The IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs are introduced in v4. They
> > > are used in the save/restore of PMU state. Please see PATCH 23/41.
> > > So it's limited to v4+ for now.
> > 
> > Prior to version 4, semi-passthrough is possible, but IA32_PERF_GLOBAL_STATUS
> > has to be intercepted and emulated, since it is non-trivial to set bits in
> > this MSR.
> 
> Ah, then this _perf_ capability should be PERF_PMU_CAP_WRITABLE_GLOBAL_STATUS or

And now I see that the capabilities are arch agnostic, whereas GLOBAL_STATUS
obviously is not.  Unless a writable GLOBAL_STATUS is a hard requirement for perf
to be able to support a mediated PMU, this capability probably doesn't need to
exist, e.g. KVM can check for a writable GLOBAL_STATUS just as easily as perf
(or perf can stuff x86_pmu_capability.writable_global_status directly).

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler
  2024-04-11 19:17   ` Sean Christopherson
@ 2024-04-11 19:34     ` Sean Christopherson
  2024-04-12  6:03       ` Zhang, Xiong Y
  2024-04-12  5:57     ` Zhang, Xiong Y
  1 sibling, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 19:34 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > From: Xiong Zhang <xiong.y.zhang@intel.com>
> > 
> > Add function to switch PMI handler since passthrough PMU and host PMU will
> > use different interrupt vectors.
> > 
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/events/core.c            | 15 +++++++++++++++
> >  arch/x86/include/asm/perf_event.h |  3 +++
> >  2 files changed, 18 insertions(+)
> > 
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index 40ad1425ffa2..3f87894d8c8e 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -701,6 +701,21 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
> >  }
> >  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
> >  
> > +void perf_guest_switch_to_host_pmi_vector(void)
> > +{
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	apic_write(APIC_LVTPC, APIC_DM_NMI);
> > +}
> > +EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
> > +
> > +void perf_guest_switch_to_kvm_pmi_vector(void)
> > +{
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
> > +}
> > +EXPORT_SYMBOL_GPL(perf_guest_switch_to_kvm_pmi_vector);
> 
> Why slice and dice the context switch if it's all in perf?  Just do this in
> perf_guest_enter().  

Ah, because perf_guest_enter() isn't x86-specific.

That can be solved by having the exported APIs be arch specific, e.g.
x86_perf_guest_enter(), and making perf_guest_enter() a perf-internal API.

That has the advantage of making it impossible to call perf_guest_enter() on an
unsupported architecture (modulo perf bugs).

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-11 18:06   ` Sean Christopherson
@ 2024-04-11 19:53     ` Liang, Kan
  2024-04-12 19:17       ` Sean Christopherson
  2024-04-26  4:09       ` Zhang, Xiong Y
  0 siblings, 2 replies; 181+ messages in thread
From: Liang, Kan @ 2024-04-11 19:53 UTC (permalink / raw)
  To: Sean Christopherson, Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024-04-11 2:06 p.m., Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 683dc086ef10..59471eeec7e4 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -3803,6 +3803,8 @@ static inline void group_update_userpage(struct perf_event *group_event)
>>  		event_update_userpage(event);
>>  }
>>  
>> +static DEFINE_PER_CPU(bool, __perf_force_exclude_guest);
>> +
>>  static int merge_sched_in(struct perf_event *event, void *data)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> @@ -3814,6 +3816,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  	if (!event_filter_match(event))
>>  		return 0;
>>  
>> +	/*
>> +	 * The __perf_force_exclude_guest indicates entering the guest.
>> +	 * No events of the passthrough PMU should be scheduled.
>> +	 */
>> +	if (__this_cpu_read(__perf_force_exclude_guest) &&
>> +	    has_vpmu_passthrough_cap(event->pmu))
> 
> As mentioned in the previous reply, I think perf should WARN and reject any attempt
> to trigger a "passthrough" context switch if such a switch isn't supported by
> perf, not silently let it go through and then skip things later.

perf supports many PMUs. The core PMU is one of them. Only the core PMU
supports "passthrough", and will do the "passthrough" context switch if
there are active events.
For other PMUs, they should not be impacted. The "passthrough" context
switch should be transparent for there.

Here is to reject an existing host event in the schedule stage. If a
"passthrough" guest is running, perf should rejects any existing host
events of the "passthrough" supported PMU.

> 
>> +		return 0;
>> +
>>  	if (group_can_go_on(event, *can_add_hw)) {
>>  		if (!group_sched_in(event, ctx))
>>  			list_add_tail(&event->active_list, get_event_list(event));
> 
> ...
> 
>> +/*
>> + * When a guest enters, force all active events of the PMU, which supports
>> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
>> + * PMUs, such as uncore PMU, should not be impacted. The guest can
>> + * temporarily own all counters of the PMU.
>> + * During the period, all the creation of the new event of the PMU with
>> + * !exclude_guest are error out.
>> + */
>> +void perf_guest_enter(void)
>> +{
>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>> +
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	if (__this_cpu_read(__perf_force_exclude_guest))
> 
> This should be a WARN_ON_ONCE, no?

To debug the improper behavior of KVM?
I guess yes.

> 
>> +		return;
>> +
>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +	perf_force_exclude_guest_enter(&cpuctx->ctx);
>> +	if (cpuctx->task_ctx)
>> +		perf_force_exclude_guest_enter(cpuctx->task_ctx);
>> +
>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +
>> +	__this_cpu_write(__perf_force_exclude_guest, true);
>> +}
>> +EXPORT_SYMBOL_GPL(perf_guest_enter);
>> +
>> +static void perf_force_exclude_guest_exit(struct perf_event_context *ctx)
>> +{
>> +	struct perf_event_pmu_context *pmu_ctx;
>> +	struct pmu *pmu;
>> +
>> +	update_context_time(ctx);
>> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>> +		pmu = pmu_ctx->pmu;
>> +		if (!has_vpmu_passthrough_cap(pmu))
>> +			continue;
> 
> I don't see how we can sanely support a CPU that doesn't support writable
> PERF_GLOBAL_STATUS across all PMUs.

Only core PMU has the PERF_GLOBAL_STATUS. Other PMUs, e.g., uncore PMU,
aren't impacted by the MSR. Those MSRs should be ignored.

> 
>> +
>> +		perf_pmu_disable(pmu);
>> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
>> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
>> +		perf_pmu_enable(pmu);
>> +	}
>> +}
>> +
>> +void perf_guest_exit(void)
>> +{
>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>> +
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	if (!__this_cpu_read(__perf_force_exclude_guest))
> 
> WARN_ON_ONCE here too?
> 
>> +		return;
>> +
>> +	__this_cpu_write(__perf_force_exclude_guest, false);
>> +
>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +	perf_force_exclude_guest_exit(&cpuctx->ctx);
>> +	if (cpuctx->task_ctx)
>> +		perf_force_exclude_guest_exit(cpuctx->task_ctx);
>> +
>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +}
>> +EXPORT_SYMBOL_GPL(perf_guest_exit);
>> +
>> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
>> +						 int cpu, struct task_struct *task)
>> +{
>> +	bool *force_exclude_guest = NULL;
>> +
>> +	if (!has_vpmu_passthrough_cap(event->pmu))
>> +		return 0;
>> +
>> +	if (event->attr.exclude_guest)
>> +		return 0;
>> +
>> +	if (cpu != -1) {
>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
>> +	} else if (task && (task->flags & PF_VCPU)) {
>> +		/*
>> +		 * Just need to check the running CPU in the event creation. If the
>> +		 * task is moved to another CPU which supports the force_exclude_guest.
>> +		 * The event will filtered out and be moved to the error stage. See
>> +		 * merge_sched_in().
>> +		 */
>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
>> +	}
> 
> These checks are extremely racy, I don't see how this can possibly do the
> right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
> to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
> include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
> time slices).
>

This is to reject an !exclude_guest event creation for a running
"passthrough" guest from host perf tool.
Could you please suggest a way to detect it via the struct task_struct?


> Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
> perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
> per-CPU context mutex.

Do you mean that the perf_guest_enter() check could be happened right
after the perf_force_exclude_guest_check()?
It's possible. For this case, the event can still be created. It will be
treated as an existing event and handled in merge_sched_in(). It will
never be scheduled when a guest is running.

The perf_force_exclude_guest_check() is to make sure most of the cases
can be rejected at the creation place. For the corner cases, they will
be rejected in the schedule stage.

> 
>> +	if (force_exclude_guest && *force_exclude_guest)
>> +		return -EBUSY;
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Holding the top-level event's child_mutex means that any
>>   * descendant process that has inherited this event will block
>> @@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>  		goto err_ns;
>>  	}
>>  
>> +	if (perf_force_exclude_guest_check(event, cpu, task)) {
> 
> This should be:
> 
> 	err = perf_force_exclude_guest_check(event, cpu, task);
> 	if (err)
> 		goto err_pmu;
> 
> i.e. shouldn't effectively ignore/override the return result.
>

Sure.

Thanks,
Kan

>> +		err = -EBUSY;
>> +		goto err_pmu;
>> +	}
>> +
>>  	/*
>>  	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
>>  	 * events (they don't make sense as the cgroup will be different
>> -- 
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 19:13           ` Liang, Kan
@ 2024-04-11 20:43             ` Sean Christopherson
  2024-04-11 21:04               ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 20:43 UTC (permalink / raw)
  To: Kan Liang
  Cc: Jim Mattson, Xiong Zhang, pbonzini, peterz, mizhang, kan.liang,
	zhenyuw, dapeng1.mi, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Thu, Apr 11, 2024, Kan Liang wrote:
> On 2024-04-11 1:46 p.m., Sean Christopherson wrote:
> > On Thu, Apr 11, 2024, Jim Mattson wrote:
> >> On Thu, Apr 11, 2024 at 10:21 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>> On 2024-04-11 1:04 p.m., Sean Christopherson wrote:
> >>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> >>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>
> >>>>> Define and apply the PERF_PMU_CAP_VPMU_PASSTHROUGH flag for the version 4
> >>>>> and later PMUs
> >>>>
> >>>> Why?  I get that is an RFC, but it's not at all obvious to me why this needs to
> >>>> take a dependency on v4+.
> >>>
> >>> The IA32_PERF_GLOBAL_STATUS_RESET/SET MSRs are introduced in v4. They
> >>> are used in the save/restore of PMU state. Please see PATCH 23/41.
> >>> So it's limited to v4+ for now.
> >>
> >> Prior to version 4, semi-passthrough is possible, but IA32_PERF_GLOBAL_STATUS
> >> has to be intercepted and emulated, since it is non-trivial to set bits in
> >> this MSR.
> > 
> > Ah, then this _perf_ capability should be PERF_PMU_CAP_WRITABLE_GLOBAL_STATUS or
> > so, especially since it's introduced in advance of the KVM side of things.  Then
> > whether or not to support a mediated PMU becomes a KVM decision, e.g. intercepting
> > accesses to IA32_PERF_GLOBAL_STATUS doesn't seem like a complete deal breaker
> > (or maybe it is, I now see the comment about it being used to do the context switch).
> 
> The PERF_PMU_CAP_VPMU_PASSTHROUGH is to indicate whether the PMU has the
> capability to support passthrough mode. It's used to distinguish the
> other PMUs, e.g., uncore PMU.

Ah, the changelog blurb about SW/uncore PMUs finally clicked.

> Regarding the PERF_PMU_CAP_WRITABLE_GLOBAL_STATUS, I think perf already
> passes the x86_pmu.version to KVM. Maybe KVM can add an internal flag to
> track it, so a PERF_PMU_CAP_ bit can be saved?

Yeah, I think that's totally fine.  At some point, KVM is going to need to know
that GLOBAL_STATUS is writable if PMU.version >= 4, e.g. to correctly emulate
guest accesses, so I don't see any reason to bury that logic in perf.

> > And peeking ahead, IIUC perf effectively _forces_ a passthrough model when
> > has_vpmu_passthrough_cap() is true, which is wrong.  There needs to be a user/admin
> > opt-in (or opt-out) to that behavior, at a kernel/perf level, not just at a KVM
> > level.  Hmm, or is perf relying on KVM to do that right thing?  I.e. relying on
> > KVM to do perf_guest_{enter,exit}() if and only if the PMU can support the
> > passthrough model.
> >
> 
> Yes, perf relies on KVM to tell if a guest is entering the passthrough mode.
> 
> > If that's the case, most of the has_vpmu_passthrough_cap() checks are gratiutous
> > and confusing, e.g. just WARN if KVM (or some other module) tries to trigger a
> > PMU context switch when it's not supported by perf.
> 
> If there is only non supported PMUs running in the host, perf wouldn't
> do any context switch. The guest can feel free to use the core PMU. We
> should not WARN for this case.

I'm struggling to wrap my head around this.  If there is no supported PMU in the
host, how can there be a core PMU for the guest to use?  KVM virtualizes a PMU
if and only if kvm_init_pmu_capability() reports a compatible PMU, and IIUC that
reporting is done based on the core PMU.

Specifically, I want to ensure we don't screw is passing through PMU MSR access,
e.g. because KVM thinks perf will context switch those MSRs, but perf doesn't
because perf doesn't think the relevant PMU supports a mediate/passthrough mode.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance
  2024-01-26  8:54 ` [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance Xiong Zhang
@ 2024-04-11 20:54   ` Sean Christopherson
  2024-04-11 21:03   ` Sean Christopherson
  1 sibling, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 20:54 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> Finally, always propagate enable_passthrough_pmu and perf_capabilities into
> kvm->arch for each KVM instance.

Why?

arch.enable_passthrough_pmu is simply "arch.enable_pmu && enable_passthrough_pmu",
I don't see any reason to cache that information on a per-VM basis.  Blech, it's
also cached in vcpu->pmu.passthrough, which is even more compexity that doesn't
add any value.

E.g. code that is reachable iff the VM/vCPU has a PMU can simply check the module
param.  And if we commit to that model (all or nothing), then we can probably
end up with cleaner code overall because we bifurcate everything at a module
level, e.g. even use static_call() if we had reason to.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 12/41] KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
  2024-01-26  8:54 ` [RFC PATCH 12/41] KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs Xiong Zhang
@ 2024-04-11 20:57   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 20:57 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Plumb through passthrough PMU setting from kvm->arch into kvm_pmu on each
> vcpu created. Note that enabling PMU is decided by VMM when it sets the
> CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
> in intel_pmu_refresh().

As stated in the previous patch, even the most naive implementation can be:

static inline bool is_passthrough_pmu_enabled(struct kvm_vcpu *vcpu)
{
	return enable_passthrough_pmu && vcpu_to_pmu(vcpu)->version;
}

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance
  2024-01-26  8:54 ` [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance Xiong Zhang
  2024-04-11 20:54   ` Sean Christopherson
@ 2024-04-11 21:03   ` Sean Christopherson
  1 sibling, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:03 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4432e736129f..074452aa700d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -193,6 +193,11 @@ bool __read_mostly enable_pmu = true;
>  EXPORT_SYMBOL_GPL(enable_pmu);
>  module_param(enable_pmu, bool, 0444);
>  
> +/* Enable/disable PMU virtualization */

Heh, copy+paste fail.  Just omit a comment, it's pretty self-explanatory.

> +bool __read_mostly enable_passthrough_pmu = true;
> +EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
> +module_param(enable_passthrough_pmu, bool, 0444);

Almost forgot.  Two things:

 1. KVM should not enable the passthrough/mediate PMU by default until it has
    reached feature parity with the existing PMU, because otherwise we are
    essentially breaking userspace.  And if for some reason the passthrough PMU
    *can't* reach feature parity, then (a) that's super interesting, and (b) we
    need a more explicit/deliberate transition plan.

 2. The module param absolutely must not be exposed to userspace until all patches
    are in place.  The easiest way to do that without creating dependency hell is
    to simply not create the module param.

I.e. this patch should do _only_

bool __read_mostly enable_passthrough_pmu;
EXPORT_SYMBOL_GPL(enable_passthrough_pmu);

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  2024-04-11 20:43             ` Sean Christopherson
@ 2024-04-11 21:04               ` Liang, Kan
  0 siblings, 0 replies; 181+ messages in thread
From: Liang, Kan @ 2024-04-11 21:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Xiong Zhang, pbonzini, peterz, mizhang, kan.liang,
	zhenyuw, dapeng1.mi, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-11 4:43 p.m., Sean Christopherson wrote:
>>> And peeking ahead, IIUC perf effectively _forces_ a passthrough model when
>>> has_vpmu_passthrough_cap() is true, which is wrong.  There needs to be a user/admin
>>> opt-in (or opt-out) to that behavior, at a kernel/perf level, not just at a KVM
>>> level.  Hmm, or is perf relying on KVM to do that right thing?  I.e. relying on
>>> KVM to do perf_guest_{enter,exit}() if and only if the PMU can support the
>>> passthrough model.
>>>
>> Yes, perf relies on KVM to tell if a guest is entering the passthrough mode.
>>
>>> If that's the case, most of the has_vpmu_passthrough_cap() checks are gratiutous
>>> and confusing, e.g. just WARN if KVM (or some other module) tries to trigger a
>>> PMU context switch when it's not supported by perf.
>> If there is only non supported PMUs running in the host, perf wouldn't
>> do any context switch. The guest can feel free to use the core PMU. We
>> should not WARN for this case.
> I'm struggling to wrap my head around this.  If there is no supported PMU in the
> host, how can there be a core PMU for the guest to use?  KVM virtualizes a PMU
> if and only if kvm_init_pmu_capability() reports a compatible PMU, and IIUC that
> reporting is done based on the core PMU.
> 
> Specifically, I want to ensure we don't screw is passing through PMU MSR access,
> e.g. because KVM thinks perf will context switch those MSRs, but perf doesn't

Perf only context switches the MSRs of the PMU with the
PERF_PMU_CAP_VPMU_PASSTHROUGH flag. (Only the core PMU for this RFC).

For other PMUs without the PERF_PMU_CAP_VPMU_PASSTHROUGH, perf does
nothing in perf_guest_enter/exit().

KVM can rely on the flag to decide whether to enable the passthrough
mode for the PMU.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-01-26  8:54 ` [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Xiong Zhang
@ 2024-04-11 21:21   ` Sean Christopherson
  2024-04-11 22:30     ` Jim Mattson
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:21 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> +	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		/*
> +		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
> +		 */
> +		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
> +		else {
> +			i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest,
> +						       MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = vmx->msr_autoload.guest.nr++;
> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT,
> +					     vmx->msr_autoload.guest.nr);
> +			}
> +			vmx->msr_autoload.guest.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			vmx->msr_autoload.guest.val[i].value = 0;

Eww, no.   Just make cpu_has_load_perf_global_ctrl() and VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL
hard requirements for enabling passthrough mode.  And then have clear_atomic_switch_msr()
yell if KVM tries to disable loading MSR_CORE_PERF_GLOBAL_CTRL.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities
  2024-01-26  8:54 ` [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities Xiong Zhang
@ 2024-04-11 21:23   ` Sean Christopherson
  2024-04-11 21:50     ` Jim Mattson
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:23 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Intercept full-width GP counter MSRs in passthrough PMU if guest does not
> have the capability to write in full-width. In addition, opportunistically
> add a warning if non-full-width counter MSRs are also intercepted, in which
> case it is a clear mistake.
> 
> Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/pmu_intel.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 7f6cabb2c378..49df154fbb5b 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -429,6 +429,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	default:
>  		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>  		    (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
> +			if (is_passthrough_pmu_enabled(vcpu) &&
> +			    !(msr & MSR_PMC_FULL_WIDTH_BIT) &&
> +			    !msr_info->host_initiated) {
> +				pr_warn_once("passthrough PMU never intercepts non-full-width PMU counters\n");
> +				return 1;

This is broken, KVM must be prepared to handle WRMSR (and RDMSR and RDPMC) that
come in through the emulator.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-01-26  8:54 ` [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Xiong Zhang
@ 2024-04-11 21:26   ` Sean Christopherson
  2024-04-13  2:29     ` Mi, Dapeng
  2024-04-11 21:44   ` Sean Christopherson
  1 sibling, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:26 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
>  static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	u32 i;
> +
> +	if (pmu->version != 2) {
> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
> +		return;
> +	}
> +
> +	/* Global ctrl register is already saved at VM-exit. */
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		rdpmcl(i, pmc->counter);
> +		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> +		/*
> +		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> +		 * leakage and also avoid this guest GP counter get accidentally
> +		 * enabled during host running when host enable global ctrl.
> +		 */
> +		if (pmc->eventsel)
> +			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> +		if (pmc->counter)
> +			wrmsrl(MSR_IA32_PMC0 + i, 0);
> +	}
> +
> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +	/*
> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> +	 * also avoid these guest fixed counters get accidentially enabled
> +	 * during host running when host enable global ctrl.
> +	 */
> +	if (pmu->fixed_ctr_ctrl)
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> +		if (pmc->counter)
> +			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> +	}

For the next RFC, please make that it includes AMD support.  Mostly because I'm
pretty all of this code can be in common x86.  The fixed counters are ugly,
but pmu->nr_arch_fixed_counters is guaranteed to '0' on AMD, so it's _just_ ugly,
i.e. not functionally problematic. 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage
  2024-01-26  8:54 ` [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage Xiong Zhang
@ 2024-04-11 21:36   ` Sean Christopherson
  2024-04-11 21:56     ` Jim Mattson
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:36 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Zero out unexposed counters/selectors because even though KVM intercepts
> all accesses to unexposed PMU MSRs, it does pass through RDPMC instruction
> which allows guest to read all GP counters and fixed counters. So, zero out
> unexposed counter values which might contain critical information for the
> host.

This belongs in the previous patch, it's effectively a bug fix.  I appreciate
the push for finer granularity, but introducing a blatant bug and then immediately
fixing it goes too far.

> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index f79bebe7093d..4b4da7f17895 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -895,11 +895,27 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>  		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>  	}
>  
> +	/*
> +	 * Zero out unexposed GP counters/selectors to avoid information leakage
> +	 * since passthrough PMU does not intercept RDPMC.

Zeroing the selectors is unnecessary.  KVM still intercepts MSR_CORE_PERF_GLOBAL_CTRL,
so just ensure the PMCs that aren't exposed the guest are globally enabled.

> +	 */
> +	for (i = pmu->nr_arch_gp_counters; i < kvm_pmu_cap.num_counters_gp; i++) {
> +		wrmsrl(MSR_IA32_PMC0 + i, 0);
> +		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> +	}
> +
>  	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>  	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>  		pmc = &pmu->fixed_counters[i];
>  		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>  	}
> +
> +	/*
> +	 * Zero out unexposed fixed counters to avoid information leakage
> +	 * since passthrough PMU does not intercept RDPMC.

I would call out that RDPMC interception is all or nothing, i.e. KVM can't
selectively intercept _some_ PMCs, and the MSR bitmaps don't apply to RDPMC.

> +	 */
> +	for (i = pmu->nr_arch_fixed_counters; i < kvm_pmu_cap.num_counters_fixed; i++)
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>  }
>  
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-01-26  8:54 ` [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Xiong Zhang
  2024-04-11 21:26   ` Sean Christopherson
@ 2024-04-11 21:44   ` Sean Christopherson
  2024-04-11 22:19     ` Jim Mattson
  2024-04-13  3:03     ` Mi, Dapeng
  1 sibling, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:44 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> and gains the full HW PMU ownership. On the contrary, host regains the
> ownership of PMU HW from KVM when control flow leaves the scope of
> passthrough PMU.
> 
> Implement PMU context switches for Intel CPUs and opptunistically use
> rdpmcl() instead of rdmsrl() when reading counters since the former has
> lower latency in Intel CPUs.
> 
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 0d58fe7d243e..f79bebe7093d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>  
>  static void intel_save_pmu_context(struct kvm_vcpu *vcpu)

I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().

>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	u32 i;
> +
> +	if (pmu->version != 2) {
> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
> +		return;
> +	}
> +
> +	/* Global ctrl register is already saved at VM-exit. */
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		rdpmcl(i, pmc->counter);
> +		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> +		/*
> +		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> +		 * leakage and also avoid this guest GP counter get accidentally
> +		 * enabled during host running when host enable global ctrl.
> +		 */
> +		if (pmc->eventsel)
> +			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> +		if (pmc->counter)
> +			wrmsrl(MSR_IA32_PMC0 + i, 0);

This doesn't make much sense.  The kernel already has full access to the guest,
I don't see what is gained by zeroing out the MSRs just to hide them from perf.

Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
the event selector, we gots problems.

Same thing for the fixed counters below.  Can't this just be?

	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
		rdpmcl(i, pmu->gp_counters[i].counter);

	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
		       pmu->fixed_counters[i].counter);

> +	}
> +
> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +	/*
> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> +	 * also avoid these guest fixed counters get accidentially enabled
> +	 * during host running when host enable global ctrl.
> +	 */
> +	if (pmu->fixed_ctr_ctrl)
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> +		if (pmc->counter)
> +			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> +	}
>  }
>  
>  static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	u64 global_status;
> +	int i;
> +
> +	if (pmu->version != 2) {
> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
> +		return;
> +	}
> +
> +	/* Clear host global_ctrl and global_status MSR if non-zero. */
> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);

Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?

> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> +	if (global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);

This seems especially silly, isn't the full MSR being written below?  Or am I
misunderstanding how these things work?

> +	wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> +		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> +	}
> +
> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> +	}
>  }
>  
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 26/41] KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU capability
  2024-01-26  8:54 ` [RFC PATCH 26/41] KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU capability Xiong Zhang
@ 2024-04-11 21:49   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:49 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Add host_perf_cap field in kvm_caps to record host PMU capability. This
> helps KVM recognize the PMU capability difference between host and guest.
> This awareness improves performance in PMU context switch. In particular,
> KVM will need to zero out all MSRs that guest PMU does not use but host PMU
> does use. Having the host PMU feature set cached in host_perf_cap in
> kvm_caps structure saves a rdmsrl() to IA32_PERF_CAPABILITY MSR on each PMU
> context switch. In addition, this is more convenient approach than open
> another API on the host perf subsystem side.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 17 +++++++++--------
>  arch/x86/kvm/x86.h     |  1 +
>  2 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 349954f90fe9..50100954cd92 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7896,32 +7896,33 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vmx_update_exception_bitmap(vcpu);
>  }
>  
> -static u64 vmx_get_perf_capabilities(void)
> +static void vmx_get_perf_capabilities(void)
>  {
>  	u64 perf_cap = PMU_CAP_FW_WRITES;
>  	struct x86_pmu_lbr lbr;
> -	u64 host_perf_cap = 0;
> +
> +	kvm_caps.host_perf_cap = 0;
>  
>  	if (!enable_pmu)
> -		return 0;
> +		return;
>  
>  	if (boot_cpu_has(X86_FEATURE_PDCM))
> -		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
> +		rdmsrl(MSR_IA32_PERF_CAPABILITIES, kvm_caps.host_perf_cap);

I would strongly prefer KVM snapshot the host's MSR_IA32_PERF_CAPABILITIES, if
the CPU has PDMC, i.e. not leave it zero if the PMU is disabled.

>  
>  	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
>  	    !enable_passthrough_pmu) {
>  		x86_perf_get_lbr(&lbr);
>  		if (lbr.nr)
> -			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
> +			perf_cap |= kvm_caps.host_perf_cap & PMU_CAP_LBR_FMT;
>  	}
>  
>  	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
> -		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
> +		perf_cap |= kvm_caps.host_perf_cap & PERF_CAP_PEBS_MASK;
>  		if ((perf_cap & PERF_CAP_PEBS_FORMAT) < 4)
>  			perf_cap &= ~PERF_CAP_PEBS_BASELINE;
>  	}
>  
> -	return perf_cap;
> +	kvm_caps.supported_perf_cap = perf_cap;
>  }
>  
>  static __init void vmx_set_cpu_caps(void)
> @@ -7946,7 +7947,7 @@ static __init void vmx_set_cpu_caps(void)
>  
>  	if (!enable_pmu)
>  		kvm_cpu_cap_clear(X86_FEATURE_PDCM);
> -	kvm_caps.supported_perf_cap = vmx_get_perf_capabilities();
> +	vmx_get_perf_capabilities();
>  
>  	if (!enable_sgx) {
>  		kvm_cpu_cap_clear(X86_FEATURE_SGX);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 38b73e98eae9..a29eb0469d7e 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -28,6 +28,7 @@ struct kvm_caps {
>  	u64 supported_mce_cap;
>  	u64 supported_xcr0;
>  	u64 supported_xss;
> +	u64 host_perf_cap;
>  	u64 supported_perf_cap;

This is confusing, host_perf_cap doesn't track "capabilities" so much as it tracks
a raw host value.  Luckily, I have a series that I am going to post this week
that adds another struct for tracking host values, e.g. host_xss, host_efer, etc.

>  };
>  
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest
  2024-01-26  8:54 ` [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest Xiong Zhang
@ 2024-04-11 21:50   ` Sean Christopherson
  2024-04-13  3:29     ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:50 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Since perf topdown metrics feature is not supported yet, clear
> PERF_METRICS MSR for guest.

Please rewrite with --verbose, I have no idea what MSR_PERF_METRICS, and thus no
clue why it needs to be zeroed when loading guest context, e.g. it's not passed
through, so why does it matter?

> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/pmu_intel.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 4b4da7f17895..ad0434646a29 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -916,6 +916,10 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>  	 */
>  	for (i = pmu->nr_arch_fixed_counters; i < kvm_pmu_cap.num_counters_fixed; i++)
>  		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> +
> +	/* Clear PERF_METRICS MSR since guest topdown metrics is not supported yet. */
> +	if (kvm_caps.host_perf_cap & PMU_CAP_PERF_METRICS)
> +		wrmsrl(MSR_PERF_METRICS, 0);
>  }
>  
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities
  2024-04-11 21:23   ` Sean Christopherson
@ 2024-04-11 21:50     ` Jim Mattson
  2024-04-12 16:01       ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 21:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > From: Mingwei Zhang <mizhang@google.com>
> >
> > Intercept full-width GP counter MSRs in passthrough PMU if guest does not
> > have the capability to write in full-width. In addition, opportunistically
> > add a warning if non-full-width counter MSRs are also intercepted, in which
> > case it is a clear mistake.
> >
> > Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/vmx/pmu_intel.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index 7f6cabb2c378..49df154fbb5b 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -429,6 +429,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >       default:
> >               if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
> >                   (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
> > +                     if (is_passthrough_pmu_enabled(vcpu) &&
> > +                         !(msr & MSR_PMC_FULL_WIDTH_BIT) &&
> > +                         !msr_info->host_initiated) {
> > +                             pr_warn_once("passthrough PMU never intercepts non-full-width PMU counters\n");
> > +                             return 1;
>
> This is broken, KVM must be prepared to handle WRMSR (and RDMSR and RDPMC) that
> come in through the emulator.

Don't tell me that we are still supporting CPUs that don't have
"unrestricted guest"! Sigh.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-01-26  8:54 ` [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Xiong Zhang
@ 2024-04-11 21:54   ` Sean Christopherson
  2024-04-11 22:10     ` Jim Mattson
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:54 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> +	int i;
> +
> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> +	} else {
> +		i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest,
> +						MSR_CORE_PERF_GLOBAL_CTRL);
> +		if (i < 0)
> +			return;
> +		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;

As before, NAK to using the MSR load/store lists unless there's a *really* good
reason I'm missing.

And we should consider adding VCPU_EXREG_GLOBAL_CTRL so that we can defer the
VMREAD until KVM actually tries to access the guest value.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 34/41] KVM: x86/pmu: Intercept EVENT_SELECT MSR
  2024-01-26  8:54 ` [RFC PATCH 34/41] KVM: x86/pmu: Intercept EVENT_SELECT MSR Xiong Zhang
@ 2024-04-11 21:55   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:55 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> Event selectors for GP counters are still intercepted for the purpose of
> security, i.e., preventing guest from using unallowed events to steal
> information or take advantages of any CPU errata.

Heh, so then they shouldn't have been passed through in the first place.

> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/pmu_intel.c | 1 -
>  arch/x86/kvm/vmx/vmx.c       | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 9bbd5084a766..621922005184 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -809,7 +809,6 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>  	int i;
>  
>  	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_gp_counters; i++) {
> -		vmx_set_intercept_for_msr(vcpu, MSR_ARCH_PERFMON_EVENTSEL0 + i, MSR_TYPE_RW, false);
>  		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, false);
>  		if (fw_writes_is_enabled(vcpu))
>  			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, false);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d28afa87be70..1a518800d154 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -698,7 +698,6 @@ static bool is_valid_passthrough_msr(u32 msr)
>  	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
>  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> -	case MSR_ARCH_PERFMON_EVENTSEL0 ... MSR_ARCH_PERFMON_EVENTSEL0 + 7:
>  	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
>  	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
>  	case MSR_CORE_PERF_FIXED_CTR_CTRL:
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 36/41] KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR
  2024-01-26  8:54 ` [RFC PATCH 36/41] KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR Xiong Zhang
@ 2024-04-11 21:56   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 21:56 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@intel.com>
> 
> Fixed counters control MSR are still intercepted for the purpose of
> security, i.e., preventing guest from using unallowed Fixed Counter
> to steal information or take advantages of any CPU errata.

Same comments as earlier patches.  Don't introduce bugs and then immediately fix
said bugs.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage
  2024-04-11 21:36   ` Sean Christopherson
@ 2024-04-11 21:56     ` Jim Mattson
  0 siblings, 0 replies; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 21:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 11, 2024 at 2:36 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > From: Mingwei Zhang <mizhang@google.com>
> >
> > Zero out unexposed counters/selectors because even though KVM intercepts
> > all accesses to unexposed PMU MSRs, it does pass through RDPMC instruction
> > which allows guest to read all GP counters and fixed counters. So, zero out
> > unexposed counter values which might contain critical information for the
> > host.
>
> This belongs in the previous patch, it's effectively a bug fix.  I appreciate
> the push for finer granularity, but introducing a blatant bug and then immediately
> fixing it goes too far.
>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index f79bebe7093d..4b4da7f17895 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -895,11 +895,27 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> >               wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> >       }
> >
> > +     /*
> > +      * Zero out unexposed GP counters/selectors to avoid information leakage
> > +      * since passthrough PMU does not intercept RDPMC.
>
> Zeroing the selectors is unnecessary.  KVM still intercepts MSR_CORE_PERF_GLOBAL_CTRL,
> so just ensure the PMCs that aren't exposed the guest are globally enabled.
>
> > +      */
> > +     for (i = pmu->nr_arch_gp_counters; i < kvm_pmu_cap.num_counters_gp; i++) {
> > +             wrmsrl(MSR_IA32_PMC0 + i, 0);
> > +             wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> > +     }
> > +
> >       wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >       for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >               pmc = &pmu->fixed_counters[i];
> >               wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> >       }
> > +
> > +     /*
> > +      * Zero out unexposed fixed counters to avoid information leakage
> > +      * since passthrough PMU does not intercept RDPMC.
>
> I would call out that RDPMC interception is all or nothing, i.e. KVM can't
> selectively intercept _some_ PMCs, and the MSR bitmaps don't apply to RDPMC.

Yes. RDPMC must be intercepted, unless KVM knows that all possible
PMCs are passthrough. It must also be intercepted if
enable_vmware_backdoor is true.

> > +      */
> > +     for (i = pmu->nr_arch_fixed_counters; i < kvm_pmu_cap.num_counters_fixed; i++)
> > +             wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> >  }
> >
> >  struct kvm_pmu_ops intel_pmu_ops __initdata = {
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed
  2024-01-26  8:54 ` [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Xiong Zhang
@ 2024-04-11 22:03   ` Sean Christopherson
  2024-04-13  4:12     ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 22:03 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Allow writing to fixed counter selector if counter is exposed. If this
> fixed counter is filtered out, this counter won't be enabled on HW.
> 
> Passthrough PMU implements the context switch at VM Enter/Exit boundary the
> guest value cannot be directly written to HW since the HW PMU is owned by
> the host. Introduce a new field fixed_ctr_ctrl_hw in kvm_pmu to cache the
> guest value.  which will be assigne to HW at PMU context restore.
> 
> Since passthrough PMU intercept writes to fixed counter selector, there is
> no need to read the value at pmu context save, but still clear the fix
> counter ctrl MSR and counters when switching out to host PMU.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx/pmu_intel.c    | 28 ++++++++++++++++++++++++----
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fd1c69371dbf..b02688ed74f7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -527,6 +527,7 @@ struct kvm_pmu {
>  	unsigned nr_arch_fixed_counters;
>  	unsigned available_event_types;
>  	u64 fixed_ctr_ctrl;
> +	u64 fixed_ctr_ctrl_hw;
>  	u64 fixed_ctr_ctrl_mask;

Before introduce more fields, can someone please send a patch/series to rename
the _mask fields?  AFAIK, they all should be e.g. fixed_ctr_ctrl_rsvd, or something
to that effect.

Because I think we should avoid reinventing the naming wheel, and use "shadow"
instead of "hw", because KVM developers already know what "shadow" means.  But
"mask" also has very specific meaning for shadowed fields.  That, and "mask" is
a freaking awful name in the first place.

>  	u64 global_ctrl;
>  	u64 global_status;
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 713c2a7c7f07..93cfb86c1292 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -68,6 +68,25 @@ static int fixed_pmc_events[] = {
>  	[2] = PSEUDO_ARCH_REFERENCE_CYCLES,
>  };
>  
> +static void reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64 data)

We need to come up with shorter names, this ain't Java.  :-)  Heh, that can be
another argument for "mediated", it saves three characters.

And somewhat related, kernel style is <scope>_<blah>, i.e.

static void mediated_pmu_reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-04-11 21:54   ` Sean Christopherson
@ 2024-04-11 22:10     ` Jim Mattson
  2024-04-11 22:54       ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 22:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024 at 2:54 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> > +{
> > +     struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> > +     int i;
> > +
> > +     if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> > +             pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> > +     } else {
> > +             i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest,
> > +                                             MSR_CORE_PERF_GLOBAL_CTRL);
> > +             if (i < 0)
> > +                     return;
> > +             pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
>
> As before, NAK to using the MSR load/store lists unless there's a *really* good
> reason I'm missing.

The VM-exit control, "save IA32_PERF_GLOBAL_CTL," first appears in
Sapphire Rapids. I think that's a compelling reason.

> And we should consider adding VCPU_EXREG_GLOBAL_CTRL so that we can defer the
> VMREAD until KVM actually tries to access the guest value.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-11 21:44   ` Sean Christopherson
@ 2024-04-11 22:19     ` Jim Mattson
  2024-04-11 23:31       ` Sean Christopherson
  2024-04-13  3:03     ` Mi, Dapeng
  1 sibling, 1 reply; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 22:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 11, 2024 at 2:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >
> > Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> > passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> > the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> > and gains the full HW PMU ownership. On the contrary, host regains the
> > ownership of PMU HW from KVM when control flow leaves the scope of
> > passthrough PMU.
> >
> > Implement PMU context switches for Intel CPUs and opptunistically use
> > rdpmcl() instead of rdmsrl() when reading counters since the former has
> > lower latency in Intel CPUs.
> >
> > Co-developed-by: Mingwei Zhang <mizhang@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> >  arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 73 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index 0d58fe7d243e..f79bebe7093d 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >
> >  static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>
> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
>
> >  {
> > +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +     struct kvm_pmc *pmc;
> > +     u32 i;
> > +
> > +     if (pmu->version != 2) {
> > +             pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > +             return;
> > +     }
> > +
> > +     /* Global ctrl register is already saved at VM-exit. */
> > +     rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> > +     /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> > +     if (pmu->global_status)
> > +             wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> > +
> > +     for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > +             pmc = &pmu->gp_counters[i];
> > +             rdpmcl(i, pmc->counter);
> > +             rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> > +             /*
> > +              * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> > +              * leakage and also avoid this guest GP counter get accidentally
> > +              * enabled during host running when host enable global ctrl.
> > +              */
> > +             if (pmc->eventsel)
> > +                     wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> > +             if (pmc->counter)
> > +                     wrmsrl(MSR_IA32_PMC0 + i, 0);
>
> This doesn't make much sense.  The kernel already has full access to the guest,
> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>
> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> the event selector, we gots problems.
>
> Same thing for the fixed counters below.  Can't this just be?
>
>         for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>                 rdpmcl(i, pmu->gp_counters[i].counter);
>
>         for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>                 rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
>                        pmu->fixed_counters[i].counter);
>
> > +     }
> > +
> > +     rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > +     /*
> > +      * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> > +      * also avoid these guest fixed counters get accidentially enabled
> > +      * during host running when host enable global ctrl.
> > +      */
> > +     if (pmu->fixed_ctr_ctrl)
> > +             wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> > +     for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > +             pmc = &pmu->fixed_counters[i];
> > +             rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> > +             if (pmc->counter)
> > +                     wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> > +     }
> >  }
> >
> >  static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> >  {
> > +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +     struct kvm_pmc *pmc;
> > +     u64 global_status;
> > +     int i;
> > +
> > +     if (pmu->version != 2) {
> > +             pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > +             return;
> > +     }
> > +
> > +     /* Clear host global_ctrl and global_status MSR if non-zero. */
> > +     wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>
> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>
> > +     rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > +     if (global_status)
> > +             wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>
> This seems especially silly, isn't the full MSR being written below?  Or am I
> misunderstanding how these things work?

LOL! You expect CPU design to follow basic logic?!?

Writing a 1 to a bit in IA32_PERF_GLOBAL_STATUS_SET sets the
corresponding bit in IA32_PERF_GLOBAL_STATUS to 1.

Writing a 0 to a bit in to IA32_PERF_GLOBAL_STATUS_SET is a nop.

To clear a bit in IA32_PERF_GLOBAL_STATUS, you need to write a 1 to
the corresponding bit in IA32_PERF_GLOBAL_STATUS_RESET (aka
IA32_PERF_GLOBAL_OVF_CTRL).

> > +     wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> > +
> > +     for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > +             pmc = &pmu->gp_counters[i];
> > +             wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> > +             wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> > +     }
> > +
> > +     wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > +     for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > +             pmc = &pmu->fixed_counters[i];
> > +             wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> > +     }
> >  }
> >
> >  struct kvm_pmu_ops intel_pmu_ops __initdata = {
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-04-11 21:21   ` Sean Christopherson
@ 2024-04-11 22:30     ` Jim Mattson
  2024-04-11 23:27       ` Sean Christopherson
  2024-04-13  2:10       ` Mi, Dapeng
  0 siblings, 2 replies; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 22:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024 at 2:21 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > +     if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
> > +             /*
> > +              * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
> > +              */
> > +             if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
> > +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
> > +             else {
> > +                     i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest,
> > +                                                    MSR_CORE_PERF_GLOBAL_CTRL);
> > +                     if (i < 0) {
> > +                             i = vmx->msr_autoload.guest.nr++;
> > +                             vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT,
> > +                                          vmx->msr_autoload.guest.nr);
> > +                     }
> > +                     vmx->msr_autoload.guest.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> > +                     vmx->msr_autoload.guest.val[i].value = 0;
>
> Eww, no.   Just make cpu_has_load_perf_global_ctrl() and VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL
> hard requirements for enabling passthrough mode.  And then have clear_atomic_switch_msr()
> yell if KVM tries to disable loading MSR_CORE_PERF_GLOBAL_CTRL.

Weren't you just complaining about the PMU version 4 constraint in
another patch? And here, you are saying, "Don't support anything older
than Sapphire Rapids."

Sapphire Rapids has PMU version 4, so if we require
VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, PMU version 4 is irrelevant.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-04-11 22:10     ` Jim Mattson
@ 2024-04-11 22:54       ` Sean Christopherson
  2024-04-11 23:08         ` Jim Mattson
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 22:54 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 2:54 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> > > +{
> > > +     struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> > > +     int i;
> > > +
> > > +     if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> > > +             pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> > > +     } else {
> > > +             i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest,
> > > +                                             MSR_CORE_PERF_GLOBAL_CTRL);
> > > +             if (i < 0)
> > > +                     return;
> > > +             pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
> >
> > As before, NAK to using the MSR load/store lists unless there's a *really* good
> > reason I'm missing.
> 
> The VM-exit control, "save IA32_PERF_GLOBAL_CTL," first appears in
> Sapphire Rapids. I think that's a compelling reason.

Well that's annoying.  When was PMU v4 introduced?  E.g. if it came in ICX, I'd
be sorely tempted to make VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL a hard requirement.

And has someone confirmed that the CPU saves into the MSR store list before
processing VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL?

Assuming we don't make VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL a hard requirement,
this code should be cleaned up and simplified.  It should be impossible to get
to this point with a passthrough PMU and no slot for saving guest GLOBAL_CTRL.

E.g. this could simply be:

	if (cpu_has_save_perf_global_ctrl())
		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
	else
		pmu->global_ctrl = *pmu->__global_ctrl;

where vmx_set_perf_global_ctrl() sets __global_ctrl to:

	pmu->__global_ctrl = &vmx->msr_autostore.guest.val[i].value;

KVM could store 'i', i.e. the slot, but in the end it's 4 bytes per vCPU (assuming
64-bit kernel, and an int to store the slot).

Oh, by the by, vmx_set_perf_global_ctrl() is buggy, as it neglects to *remove*
PERF_GLOBAL_CTRL from the lists if userspace sets CPUID multiple times.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-04-11 22:54       ` Sean Christopherson
@ 2024-04-11 23:08         ` Jim Mattson
  0 siblings, 0 replies; 181+ messages in thread
From: Jim Mattson @ 2024-04-11 23:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024 at 3:54 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 11, 2024, Jim Mattson wrote:
> > On Thu, Apr 11, 2024 at 2:54 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > > +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> > > > +{
> > > > +     struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> > > > +     int i;
> > > > +
> > > > +     if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> > > > +             pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> > > > +     } else {
> > > > +             i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest,
> > > > +                                             MSR_CORE_PERF_GLOBAL_CTRL);
> > > > +             if (i < 0)
> > > > +                     return;
> > > > +             pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
> > >
> > > As before, NAK to using the MSR load/store lists unless there's a *really* good
> > > reason I'm missing.
> >
> > The VM-exit control, "save IA32_PERF_GLOBAL_CTL," first appears in
> > Sapphire Rapids. I think that's a compelling reason.
>
> Well that's annoying.  When was PMU v4 introduced?  E.g. if it came in ICX, I'd
> be sorely tempted to make VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL a hard requirement.

Broadwell was v3. Skylake was v4.

> And has someone confirmed that the CPU saves into the MSR store list before
> processing VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL?

It's at the top of chapter 28 in volume 3 of the SDM.  MSRs may be
saved in the VM-exit MSR-store area before processor state is loaded
based in part on the host-state area and some VM-exit controls.
Anything else would be stupid. (Yes, I know that this is CPU design
we're talking about!)

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-01-26  8:54 ` [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Xiong Zhang
@ 2024-04-11 23:12   ` Sean Christopherson
  2024-04-11 23:17     ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:12 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Implement emulated counter increment for passthrough PMU under KVM_REQ_PMU.
> Defer the counter increment to KVM_REQ_PMU handler because counter
> increment requests come from kvm_pmu_trigger_event() which can be triggered
> within the KVM_RUN inner loop or outside of the inner loop. This means the
> counter increment could happen before or after PMU context switch.
> 
> So process counter increment in one place makes the implementation simple.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/pmu.c              | 52 ++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/pmu.h              |  1 +
>  arch/x86/kvm/x86.c              |  8 +++--
>  4 files changed, 60 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 869de0d81055..9080319751de 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -532,6 +532,7 @@ struct kvm_pmu {
>  	u64 fixed_ctr_ctrl_mask;
>  	u64 global_ctrl;
>  	u64 global_status;
> +	u64 synthesized_overflow;

There's no reason for this to be a per-PMU field, it's only ever used in
kvm_passthrough_pmu_handle_event().

>  	u64 counter_bitmask[2];
>  	u64 global_ctrl_mask;
>  	u64 global_status_mask;
> @@ -550,6 +551,7 @@ struct kvm_pmu {
>  		atomic64_t __reprogram_pmi;
>  	};
>  	DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
> +	DECLARE_BITMAP(incremented_pmc_idx, X86_PMC_IDX_MAX);
>  	DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
>  
>  	u64 ds_area;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 7b0bac1ac4bf..9e62e96fe48a 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -449,6 +449,26 @@ static bool kvm_passthrough_pmu_incr_counter(struct kvm_pmc *pmc)
>  	return false;
>  }
>  
> +void kvm_passthrough_pmu_handle_event(struct kvm_vcpu *vcpu)

Huh.  Why do we call the existing helper kvm_pmu_handle_event()?  It's not handling
an event, it's reprogramming counters.

Can you add a patch to clean that up?  It doesn't matter terribly with the existing
code, but once kvm_handle_guest_pmi() exists, the name becomes quite confusing,
e.g. I was expecting this to be the handler for guest PMIs.

> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	int bit;
> +
> +	for_each_set_bit(bit, pmu->incremented_pmc_idx, X86_PMC_IDX_MAX) {

I don't love the "incremented_pmc_idx" name.  It's specifically for emulated
events, that should ideally be clear in the name.

And does the tracking the emulated counters actually buy anything?  Iterating
over all PMCs and checking emulated_counter doesn't seem like it'd be measurably
slow, especially not when this path is likely writing multiple MSRs.

Wait, why use that and not reprogram_pmi?


> +		struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit);
> +
> +		if (kvm_passthrough_pmu_incr_counter(pmc)) {

kvm_passthrough_pmu_incr_counter() is *super* misleading.  (a) it's not an
"increment" in standard x86 and kernel terminology, which is "Increment by 1",
and (b) it's not actually bumping the count, it's simply moving an *already*
incremented count from emulated_count to the pmc->counter.

To avoid bikeshedding, and because boolean returns are no fun, just open code it.

		if (!pmc->emulated_counter)
			continue;

		pmc->counter += pmc->emulated_counter;
		pmc->emulated_counter = 0;
		pmc->counter &= pmc_bitmask(pmc);

		/* comment goes here */
		if (pmc->counter)
			continue;

		if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT)
			kvm_make_request(KVM_REQ_PMI, vcpu);

		pmu->global_status |= BIT_ULL(pmc->idx);

> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 6f44fe056368..0fc37a06fe48 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -277,6 +277,7 @@ static inline bool is_passthrough_pmu_enabled(struct kvm_vcpu *vcpu)
>  
>  void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
> +void kvm_passthrough_pmu_handle_event(struct kvm_vcpu *vcpu);
>  int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
>  bool kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx);
>  bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fe7da1a16c3b..1bbf312cbd73 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10726,8 +10726,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		}
>  		if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
>  			record_steal_time(vcpu);
> -		if (kvm_check_request(KVM_REQ_PMU, vcpu))
> -			kvm_pmu_handle_event(vcpu);
> +		if (kvm_check_request(KVM_REQ_PMU, vcpu)) {
> +			if (is_passthrough_pmu_enabled(vcpu))
> +				kvm_passthrough_pmu_handle_event(vcpu);
> +			else
> +				kvm_pmu_handle_event(vcpu);

This seems like a detail that belongs in the PMU code.  E.g. if we get to a point
where the two PMU flavors can share code (and I think we can/show), then there's
no need or desire for two separate APIs.

> +		}
>  		if (kvm_check_request(KVM_REQ_PMI, vcpu))
>  			kvm_pmu_deliver_pmi(vcpu);
>  #ifdef CONFIG_KVM_SMM
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-04-11 23:12   ` Sean Christopherson
@ 2024-04-11 23:17     ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:17 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 11, 2024, Sean Christopherson wrote:
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +	int bit;
> > +
> > +	for_each_set_bit(bit, pmu->incremented_pmc_idx, X86_PMC_IDX_MAX) {
> 
> I don't love the "incremented_pmc_idx" name.  It's specifically for emulated
> events, that should ideally be clear in the name.
> 
> And does the tracking the emulated counters actually buy anything?  Iterating
> over all PMCs and checking emulated_counter doesn't seem like it'd be measurably
> slow, especially not when this path is likely writing multiple MSRs.
> 
> Wait, why use that and not reprogram_pmi?

If the name is a sticking point, just rename it to something generic, e.g.
dirty_pmcs or something.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU
  2024-01-26  8:54 ` [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU Xiong Zhang
@ 2024-04-11 23:18   ` Sean Christopherson
  2024-04-18 21:54     ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:18 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Separate passthrough PMU logic from non-passthrough vPMU code. There are
> two places in passthrough vPMU when set/get_msr() may call into the
> existing non-passthrough vPMU code: 1) set/get counters; 2) set global_ctrl
> MSR.
> 
> In the former case, non-passthrough vPMU will call into
> pmc_{read,write}_counter() which wires to the perf API. Update these
> functions to avoid the perf API invocation.
> 
> The 2nd case is where global_ctrl MSR writes invokes reprogram_counters()
> which will invokes the non-passthrough PMU logic. So use pmu->passthrough
> flag to wrap out the call.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/pmu.c |  4 +++-
>  arch/x86/kvm/pmu.h | 10 +++++++++-
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 9e62e96fe48a..de653a67ba93 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -652,7 +652,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (pmu->global_ctrl != data) {
>  			diff = pmu->global_ctrl ^ data;
>  			pmu->global_ctrl = data;
> -			reprogram_counters(pmu, diff);
> +			/* Passthrough vPMU never reprogram counters. */
> +			if (!pmu->passthrough)

This should probably be handled in reprogram_counters(), otherwise we'll be
playing whack-a-mole, e.g. this misses MSR_IA32_PEBS_ENABLE, which benign, but
only because PEBS isn't yet supported.

> +				reprogram_counters(pmu, diff);
>  		}
>  		break;
>  	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 0fc37a06fe48..ab8d4a8e58a8 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -70,6 +70,9 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  	u64 counter, enabled, running;
>  
>  	counter = pmc->counter;
> +	if (pmc_to_pmu(pmc)->passthrough)
> +		return counter & pmc_bitmask(pmc);

Won't perf_event always be NULL for mediated counters?  I.e. this can be dropped,
I think.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 41/41] KVM: nVMX: Add nested virtualization support for passthrough PMU
  2024-01-26  8:54 ` [RFC PATCH 41/41] KVM: nVMX: Add nested virtualization support for passthrough PMU Xiong Zhang
@ 2024-04-11 23:21   ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:21 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> Add nested virtualization support for passthrough PMU by combining the MSR
> interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
> this patch, nested virtualization works for passthrough PMU because L1 will
> see Perfmon v2 and will have to use legacy vPMU implementation if it is
> Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
> even be Linux.
> 
> If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
> MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
> the access.
> 
> However, in current implementation, if without adding anything for nested,
> KVM always set MSR interception bits in vmcs02. This leads to the fact that
> L0 will emulate all MSR read/writes for L2, leading to errors, since the
> current passthrough vPMU never implements set_msr() and get_msr() for any
> counter access except counter accesses from the VMM side.
> 
> So fix the issue by setting up the correct MSR interception for PMU MSRs.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 52 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c5ec0ef51ff7..95e1c78152da 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -561,6 +561,55 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
>  						   msr_bitmap_l0, msr);
>  }
>  
> +/* Pass PMU MSRs to nested VM if L0 and L1 are set to passthrough. */
> +static void nested_vmx_set_passthru_pmu_intercept_for_msr(struct kvm_vcpu *vcpu,

Heh, 50 instances of passthrough, and then someone decides to shave a few characters
with passthru :-)  Long live mediated PMU!!!

> +							  unsigned long *msr_bitmap_l1,
> +							  unsigned long *msr_bitmap_l0)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int i;
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_ARCH_PERFMON_EVENTSEL0 + i,
> +						 MSR_TYPE_RW);
> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_IA32_PERFCTR0 + i,
> +						 MSR_TYPE_RW);
> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_IA32_PMC0 + i,
> +						 MSR_TYPE_RW);
> +	}
> +
> +	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++) {

Curly braces aren't needed, and this can use "pmu" instead of "vcpu_to_pmu".

> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_CORE_PERF_FIXED_CTR0 + i,
> +						 MSR_TYPE_RW);
> +	}
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_FIXED_CTR_CTRL,
> +					 MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_GLOBAL_STATUS,
> +					 MSR_TYPE_RW);
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_GLOBAL_CTRL,
> +					 MSR_TYPE_RW);
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_GLOBAL_OVF_CTRL,
> +					 MSR_TYPE_RW);
> +}
> +
>  /*
>   * Merge L0's and L1's MSR bitmap, return false to indicate that
>   * we do not use the hardware.
> @@ -660,6 +709,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>  					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>  
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		nested_vmx_set_passthru_pmu_intercept_for_msr(vcpu, msr_bitmap_l1, msr_bitmap_l0);

More code that's probably cleaner if the helper handles the PMU type.

	nested_vmx_set_pmu_msr_intercepts(vcpu, msr_bitmap_l1, msr_bitmap_l0);

and then

	if (!enable_mediated_pmu || !pmu->version)
		return;

> +
>  	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>  
>  	vmx->nested.force_msr_bitmap_recalc = false;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
                   ` (41 preceding siblings ...)
  2024-04-11 17:03 ` [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Sean Christopherson
@ 2024-04-11 23:25 ` Sean Christopherson
  2024-04-11 23:56   ` Mingwei Zhang
  42 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:25 UTC (permalink / raw)
  To: Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> Dapeng Mi (4):
>   x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
>   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
>   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
>   KVM: x86/pmu: Clear PERF_METRICS MSR for guest
> 
> Kan Liang (2):
>   perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
>   perf: Support guest enter/exit interfaces
> 
> Mingwei Zhang (22):
>   perf: core/x86: Forbid PMI handler when guest own PMU
>   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
>     propage to KVM instance
>   KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
>   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
>   KVM: x86/pmu: Allow RDPMC pass through
>   KVM: x86/pmu: Create a function prototype to disable MSR interception
>   KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
>     interception
>   KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
>     perf capabilities
>   KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
>   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
>     context
>   KVM: x86/pmu: Introduce function prototype for Intel CPU to
>     save/restore PMU context
>   KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
>     information leakage
>   KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
>     capability
>   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
>   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
>   KVM: x86/pmu: Allow writing to event selector for GP counters if event
>     is allowed
>   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
>     exposed
>   KVM: x86/pmu: Introduce PMU helper to increment counter
>   KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
>   KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
>     non-passthrough vPMU
>   KVM: nVMX: Add nested virtualization support for passthrough PMU
> 
> Xiong Zhang (13):
>   perf: Set exclude_guest onto nmi_watchdog
>   perf: core/x86: Add support to register a new vector for PMI handling
>   KVM: x86/pmu: Register PMI handler for passthrough PMU
>   perf: x86: Add function to switch PMI handler
>   perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
>   KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
>   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
>   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
>   KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
>   KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
>   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
>   KVM: x86/pmu: Intercept EVENT_SELECT MSR
>   KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR

All done with this pass.  Looks quite good, nothing on the KVM side scares me.  Nice!

I haven't spent much time thinking about whether or not the overall implementation
correct/optimal, i.e. I mostly just reviewed the mechanics.  I'll make sure to
spend a bit more time on that for the next RFC.

Please be sure to rebase to kvm-x86/next for the next RFC, there are a few patches
that will change quite a bit.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-04-11 22:30     ` Jim Mattson
@ 2024-04-11 23:27       ` Sean Christopherson
  2024-04-13  2:10       ` Mi, Dapeng
  1 sibling, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:27 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 2:21 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > +     if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
> > > +             /*
> > > +              * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
> > > +              */
> > > +             if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
> > > +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
> > > +             else {
> > > +                     i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest,
> > > +                                                    MSR_CORE_PERF_GLOBAL_CTRL);
> > > +                     if (i < 0) {
> > > +                             i = vmx->msr_autoload.guest.nr++;
> > > +                             vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT,
> > > +                                          vmx->msr_autoload.guest.nr);
> > > +                     }
> > > +                     vmx->msr_autoload.guest.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> > > +                     vmx->msr_autoload.guest.val[i].value = 0;
> >
> > Eww, no.   Just make cpu_has_load_perf_global_ctrl() and VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL
> > hard requirements for enabling passthrough mode.  And then have clear_atomic_switch_msr()
> > yell if KVM tries to disable loading MSR_CORE_PERF_GLOBAL_CTRL.
> 
> Weren't you just complaining about the PMU version 4 constraint in
> another patch? And here, you are saying, "Don't support anything older
> than Sapphire Rapids."

Heh, I didn't realize VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL was SPR+ when I wrote
this, I thought it existed alongside the "load" controls.

> Sapphire Rapids has PMU version 4, so if we require
> VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, PMU version 4 is irrelevant.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-11 22:19     ` Jim Mattson
@ 2024-04-11 23:31       ` Sean Christopherson
  2024-04-13  3:19         ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-11 23:31 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 11, 2024, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 2:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > +     /* Clear host global_ctrl and global_status MSR if non-zero. */
> > > +     wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >
> > Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> >
> > > +     rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > > +     if (global_status)
> > > +             wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> >
> > This seems especially silly, isn't the full MSR being written below?  Or am I
> > misunderstanding how these things work?
> 
> LOL! You expect CPU design to follow basic logic?!?
> 
> Writing a 1 to a bit in IA32_PERF_GLOBAL_STATUS_SET sets the
> corresponding bit in IA32_PERF_GLOBAL_STATUS to 1.
> 
> Writing a 0 to a bit in to IA32_PERF_GLOBAL_STATUS_SET is a nop.
> 
> To clear a bit in IA32_PERF_GLOBAL_STATUS, you need to write a 1 to
> the corresponding bit in IA32_PERF_GLOBAL_STATUS_RESET (aka
> IA32_PERF_GLOBAL_OVF_CTRL).

If only C had a way to annotate what the code is doing. :-)

> > > +     wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);

IIUC, that means this should be:

	if (pmu->global_status)
		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);

or even better:

	toggle = pmu->global_status ^ global_status;
	if (global_status & toggle)
		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
	if (pmu->global_status & toggle)
		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-11 23:25 ` Sean Christopherson
@ 2024-04-11 23:56   ` Mingwei Zhang
  0 siblings, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-11 23:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

Hi Sean,

On Thu, Apr 11, 2024 at 4:26 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > Dapeng Mi (4):
> >   x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
> >   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
> >   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
> >   KVM: x86/pmu: Clear PERF_METRICS MSR for guest
> >
> > Kan Liang (2):
> >   perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
> >   perf: Support guest enter/exit interfaces
> >
> > Mingwei Zhang (22):
> >   perf: core/x86: Forbid PMI handler when guest own PMU
> >   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
> >     x86_pmu_cap
> >   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
> >     propage to KVM instance
> >   KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
> >   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
> >   KVM: x86/pmu: Allow RDPMC pass through
> >   KVM: x86/pmu: Create a function prototype to disable MSR interception
> >   KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
> >     interception
> >   KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
> >     perf capabilities
> >   KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
> >   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
> >     context
> >   KVM: x86/pmu: Introduce function prototype for Intel CPU to
> >     save/restore PMU context
> >   KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
> >     information leakage
> >   KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
> >     capability
> >   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
> >   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
> >   KVM: x86/pmu: Allow writing to event selector for GP counters if event
> >     is allowed
> >   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
> >     exposed
> >   KVM: x86/pmu: Introduce PMU helper to increment counter
> >   KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
> >   KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
> >     non-passthrough vPMU
> >   KVM: nVMX: Add nested virtualization support for passthrough PMU
> >
> > Xiong Zhang (13):
> >   perf: Set exclude_guest onto nmi_watchdog
> >   perf: core/x86: Add support to register a new vector for PMI handling
> >   KVM: x86/pmu: Register PMI handler for passthrough PMU
> >   perf: x86: Add function to switch PMI handler
> >   perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
> >   KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
> >   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
> >   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
> >   KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
> >   KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
> >   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
> >   KVM: x86/pmu: Intercept EVENT_SELECT MSR
> >   KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR
>
> All done with this pass.  Looks quite good, nothing on the KVM side scares me.  Nice!

yay! Thank you Sean for the review!

>
> I haven't spent much time thinking about whether or not the overall implementation
> correct/optimal, i.e. I mostly just reviewed the mechanics.  I'll make sure to
> spend a bit more time on that for the next RFC.

Yes, I am expecting the debate/discussion in PUCK after v2 is sent
out. There should be room for optimization as well.

>
> Please be sure to rebase to kvm-x86/next for the next RFC, there are a few patches
> that will change quite a bit.

Will do the rebase and all of the feedback will be taken and into
updates in v2. In v2, we will incorporate passthrough vPMU with AMD
support. Will do our best to get it in high quality.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-11 17:03 ` [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Sean Christopherson
@ 2024-04-12  2:19   ` Zhang, Xiong Y
  2024-04-12 18:32     ` Sean Christopherson
  2024-04-18 20:46   ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-12  2:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 4/12/2024 1:03 AM, Sean Christopherson wrote:
> <bikeshed>
> 
> I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> emulates the control plane (controls and event selectors), while the data is
> fully passed through (counters).
> 
> </bikeshed>
> 
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> 
>> 1. host system wide / QEMU events handling during VM running
>>    At VM-entry, all the host perf events which use host x86 PMU will be
>>    stopped. These events with attr.exclude_guest = 1 will be stopped here
>>    and re-started after vm-exit. These events without attr.exclude_guest=1
>>    will be in error state, and they cannot recovery into active state even
>>    if the guest stops running. This impacts host perf a lot and request
>>    host system wide perf events have attr.exclude_guest=1.
>>
>>    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
>>
>>    During VM running, perf event creation for system wide and QEMU
>>    process without attr.exclude_guest=1 fail with -EBUSY. 
>>
>> 2. NMI watchdog
>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>    will be stopped also during vm running, but it doesn't have
>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>    watchdog loses function during VM running.
>>
>>    Two candidates exist for replacing perf event of NMI watchdog:
>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> 
> I think the simplest solution is to allow mediated PMU usage if and only if
> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> problem to solve.
Make sense. KVM should not affect host high priority work.
NMI watchdog is a client of perf and is a system wide perf event, perf can't distinguish a system wide perf event is NMI watchdog or others, so how about we extend this suggestion to all the system wide perf events ?
mediated PMU is only allowed when all system wide perf events are disabled or non-exist at vm creation.
but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> 
>> 3. Dedicated kvm_pmi_vector
>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>    PMI into guest when physical PMI belongs to guest counter. If the
>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>    which cause physical PMI belonging to guest happens after VM-exit,
>>    then the host PMI handler couldn't identify this PMI belongs to
>>    host or guest.
>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>    has this vector only. The PMI belonging to host still has an NMI
>>    vector.
>>
>>    Without considering PMI skid especially for AMD, the host NMI vector
>>    could be used for guest PMI also, this method is simpler and doesn't
> 
> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> the complexity is a wash, just in different locations, and I highly doubt it's
> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> LVTPC.
when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing NMI between guest and host, we could extend guest PT's PMI framework to mediated PMU. so I think this is simpler.
> 
> E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> 
>>    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
>>    didn't meet the skid PMI issue on modern Intel processors.
>>
>> 4. per-VM passthrough mode configuration
>>    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
>>    it decides vPMU is passthrough mode or emulated mode at kvm module
>>    load time.
>>    Do we need the capability of per-VM passthrough mode configuration?
>>    So an admin can launch some non-passthrough VM and profile these
>>    non-passthrough VMs in host, but admin still cannot profile all
>>    the VMs once passthrough VM existence. This means passthrough vPMU
>>    and emulated vPMU mix on one platform, it has challenges to implement.
>>    As the commit message in commit 0011, the main challenge is 
>>    passthrough vPMU and emulated vPMU have different vPMU features, this
>>    ends up with two different values for kvm_cap.supported_perf_cap, which
>>    is initialized at module load time. To support it, more refactor is
>>    needed.
> 
> I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> existing vPMU support entirely, but that's probably not be realistic, at least not
> in the near future.
> 
>> Remain Works
>> ===
>> 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> 
> Before this gets out of its "RFC" phase, I would at least like line of sight to
> a more optimized switch.  I 100% agree that starting with a conservative
> implementation is the way to go, and the kernel absolutely needs to be able to
> profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> 
> But I also don't want to get into a situation where can't figure out a clean,
> robust way to do the optimized context switch without needing (another) massive
> rewrite.
> 
Current PMU context switch happens at each vm-entry/exit, this impacts guest performance even if guest doesn't use PMU, as our first optimization, we will switch the PMU context only when guest really use PMU.

thanks

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling
  2024-04-11 17:10   ` Sean Christopherson
  2024-04-11 19:05     ` Sean Christopherson
@ 2024-04-12  3:56     ` Zhang, Xiong Y
  2024-04-13  1:17       ` Mi, Dapeng
  1 sibling, 1 reply; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-12  3:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang



On 4/12/2024 1:10 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@intel.com>
>>
>> Create a new vector in the host IDT for PMI handling within a passthrough
>> vPMU implementation. In addition, add a function to allow the registration
>> of the handler and a function to switch the PMI handler.
>>
>> This is the preparation work to support KVM passthrough vPMU to handle its
>> own PMIs without interference from PMI handler of the host PMU.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/include/asm/hardirq.h           |  1 +
>>  arch/x86/include/asm/idtentry.h          |  1 +
>>  arch/x86/include/asm/irq.h               |  1 +
>>  arch/x86/include/asm/irq_vectors.h       |  2 +-
>>  arch/x86/kernel/idt.c                    |  1 +
>>  arch/x86/kernel/irq.c                    | 29 ++++++++++++++++++++++++
>>  tools/arch/x86/include/asm/irq_vectors.h |  1 +
>>  7 files changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
>> index 66837b8c67f1..c1e2c1a480bf 100644
>> --- a/arch/x86/include/asm/hardirq.h
>> +++ b/arch/x86/include/asm/hardirq.h
>> @@ -19,6 +19,7 @@ typedef struct {
>>  	unsigned int kvm_posted_intr_ipis;
>>  	unsigned int kvm_posted_intr_wakeup_ipis;
>>  	unsigned int kvm_posted_intr_nested_ipis;
>> +	unsigned int kvm_vpmu_pmis;
> 
> Somewhat off topic, does anyone actually ever use these particular stats?  If the
> desire is to track _all_ IRQs, why not have an array and bump the counts in common
> code?
it is used in arch_show_interrupts() for /proc/interrupts.
> 
>>  #endif
>>  	unsigned int x86_platform_ipis;	/* arch dependent */
>>  	unsigned int apic_perf_irqs;
>> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
>> index 05fd175cec7d..d1b58366bc21 100644
>> --- a/arch/x86/include/asm/idtentry.h
>> +++ b/arch/x86/include/asm/idtentry.h
>> @@ -675,6 +675,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
>>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
>>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
>> +DECLARE_IDTENTRY_SYSVEC(KVM_VPMU_VECTOR,	        sysvec_kvm_vpmu_handler);
> 
> I vote for KVM_VIRTUAL_PMI_VECTOR.  I don't see any reasy to abbreviate "virtual",
> and the vector is a for a Performance Monitoring Interupt.
yes, KVM_GUEST_PMI_VECTOR in your next reply is better.
> 
>>  #endif
>>  
>>  #if IS_ENABLED(CONFIG_HYPERV)
>> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
>> index 836c170d3087..ee268f42d04a 100644
>> --- a/arch/x86/include/asm/irq.h
>> +++ b/arch/x86/include/asm/irq.h
>> @@ -31,6 +31,7 @@ extern void fixup_irqs(void);
>>  
>>  #ifdef CONFIG_HAVE_KVM
>>  extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
>> +extern void kvm_set_vpmu_handler(void (*handler)(void));
> 
> virtual_pmi_handler()
> 
>>  #endif
>>  
>>  extern void (*x86_platform_ipi_callback)(void);
>> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
>> index 3a19904c2db6..120403572307 100644
>> --- a/arch/x86/include/asm/irq_vectors.h
>> +++ b/arch/x86/include/asm/irq_vectors.h
>> @@ -77,7 +77,7 @@
>>   */
>>  #define IRQ_WORK_VECTOR			0xf6
>>  
>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>> +#define KVM_VPMU_VECTOR			0xf5
> 
> This should be inside
> 
> 	#ifdef CONFIG_HAVE_KVM
> 
> no?
yes, it should have #if IS_ENABLED(CONFIG_KVM)
> 
>>  #define DEFERRED_ERROR_VECTOR		0xf4
>>  
>>  /* Vector on which hypervisor callbacks will be delivered */
>> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
>> index 8857abc706e4..6944eec251f4 100644
>> --- a/arch/x86/kernel/idt.c
>> +++ b/arch/x86/kernel/idt.c
>> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
>>  	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>>  	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>>  	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
>> +	INTG(KVM_VPMU_VECTOR,		        asm_sysvec_kvm_vpmu_handler),
> 
> kvm_virtual_pmi_handler
> 
>> @@ -332,6 +351,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>>  	apic_eoi();
>>  	inc_irq_stat(kvm_posted_intr_nested_ipis);
>>  }
>> +
>> +/*
>> + * Handler for KVM_PT_PMU_VECTOR.
> 
> Heh, not sure where the PT part came from...
I will change it to KVM_GUEST_PMI_VECTOR
> 
>> + */
>> +DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_vpmu_handler)
>> +{
>> +	apic_eoi();
>> +	inc_irq_stat(kvm_vpmu_pmis);
>> +	kvm_vpmu_handler();
>> +}
>>  #endif
>>  
>>  
>> diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
>> index 3a19904c2db6..3773e60f1af8 100644
>> --- a/tools/arch/x86/include/asm/irq_vectors.h
>> +++ b/tools/arch/x86/include/asm/irq_vectors.h
>> @@ -85,6 +85,7 @@
>>  
>>  /* Vector for KVM to deliver posted interrupt IPI */
>>  #ifdef CONFIG_HAVE_KVM
>> +#define KVM_VPMU_VECTOR			0xf5
> 
> Heh, and your copy+paste is out of date.
Get it. 0xf5 isn't aligned with 0xf2, and the above comment should be moved prior POSTED_INTR_VECTOR

thanks
> 
>>  #define POSTED_INTR_VECTOR		0xf2
>>  #define POSTED_INTR_WAKEUP_VECTOR	0xf1
>>  #define POSTED_INTR_NESTED_VECTOR	0xf0
>> -- 
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU
  2024-04-11 19:07   ` Sean Christopherson
@ 2024-04-12  5:44     ` Zhang, Xiong Y
  0 siblings, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-12  5:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang



On 4/12/2024 3:07 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@intel.com>
>>
>> Add function to register/unregister PMI handler at KVM module
>> initialization and destroy time. This allows the host PMU with passthough
>> capability enabled switch PMI handler at PMU context switch time.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/x86.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 2c924075f6f1..4432e736129f 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10611,6 +10611,18 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
>>  }
>>  EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
>>  
>> +void kvm_passthrough_pmu_handler(void)
> 
> s/pmu/pmi, and this needs a verb.  Maybe kvm_handle_guest_pmi()?  Definitely
> open to other names.
kvm_handle_guest_pmi() is ok. 
> 
>> +{
>> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>> +
>> +	if (!vcpu) {
>> +		pr_warn_once("%s: no running vcpu found!\n", __func__);
> 
> Unless I misunderstand the code, this can/should be a full WARN_ON_ONCE.  If a
> PMI skids all the way past vcpu_put(), we've got big problems.
yes, it is big problems and user should be noticed.
>  
>> +		return;
>> +	}
>> +
>> +	kvm_make_request(KVM_REQ_PMI, vcpu);
>> +}
>> +
>>  /*
>>   * Called within kvm->srcu read side.
>>   * Returns 1 to let vcpu_run() continue the guest execution loop without
>> @@ -13815,6 +13827,7 @@ static int __init kvm_x86_init(void)
>>  {
>>  	kvm_mmu_x86_module_init();
>>  	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
>> +	kvm_set_vpmu_handler(kvm_passthrough_pmu_handler);
> 
> Hmm, a few patches late, but the "kvm" scope is weird.  This calls a core x86
> function, not a KVM function.
> 
> And to reduce exports and copy+paste, what about something like this?
> 
> void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
> {
> 	if (!handler)
> 		handler = dummy_handler;
> 
> 	if (vector == POSTED_INTR_WAKEUP_VECTOR)
> 		kvm_posted_intr_wakeup_handler = handler;
> 	else if (vector == KVM_GUEST_PMI_VECTOR)
> 		kvm_guest_pmi_handler = handler;
> 	else
> 		WARN_ON_ONCE(1);
> 
> 	if (handler == dummy_handler)
> 		synchronize_rcu();
> }
> EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);
Good suggestion. Follow it in next version.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler
  2024-04-11 19:17   ` Sean Christopherson
  2024-04-11 19:34     ` Sean Christopherson
@ 2024-04-12  5:57     ` Zhang, Xiong Y
  1 sibling, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-12  5:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang



On 4/12/2024 3:17 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@intel.com>
>>
>> Add function to switch PMI handler since passthrough PMU and host PMU will
>> use different interrupt vectors.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/events/core.c            | 15 +++++++++++++++
>>  arch/x86/include/asm/perf_event.h |  3 +++
>>  2 files changed, 18 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 40ad1425ffa2..3f87894d8c8e 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -701,6 +701,21 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>>  
>> +void perf_guest_switch_to_host_pmi_vector(void)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +}
>> +EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
>> +
>> +void perf_guest_switch_to_kvm_pmi_vector(void)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
>> +}
>> +EXPORT_SYMBOL_GPL(perf_guest_switch_to_kvm_pmi_vector);
> 
> Why slice and dice the context switch if it's all in perf?  Just do this in
> perf_guest_enter().  
> 
As perf_guest_enter() is in perf core which manages all PMUs, while switch_pmi_vector is for x86 core PMU only, so switch_pmi_vector is put in x86 pmu driver. pmu driver can call perf core function directly, perf core manage pmu through pmu->ops and pmu->flags. If switch_pmi_vector is called in perf_guest_enter, extra interfaces will be added into pmu->ops, this impacts other PMU driver. 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler
  2024-04-11 19:34     ` Sean Christopherson
@ 2024-04-12  6:03       ` Zhang, Xiong Y
  0 siblings, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-12  6:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang



On 4/12/2024 3:34 AM, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Sean Christopherson wrote:
>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>> From: Xiong Zhang <xiong.y.zhang@intel.com>
>>>
>>> Add function to switch PMI handler since passthrough PMU and host PMU will
>>> use different interrupt vectors.
>>>
>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>  arch/x86/events/core.c            | 15 +++++++++++++++
>>>  arch/x86/include/asm/perf_event.h |  3 +++
>>>  2 files changed, 18 insertions(+)
>>>
>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>> index 40ad1425ffa2..3f87894d8c8e 100644
>>> --- a/arch/x86/events/core.c
>>> +++ b/arch/x86/events/core.c
>>> @@ -701,6 +701,21 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>>>  }
>>>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>>>  
>>> +void perf_guest_switch_to_host_pmi_vector(void)
>>> +{
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
>>> +
>>> +void perf_guest_switch_to_kvm_pmi_vector(void)
>>> +{
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_switch_to_kvm_pmi_vector);
>>
>> Why slice and dice the context switch if it's all in perf?  Just do this in
>> perf_guest_enter().  
> 
> Ah, because perf_guest_enter() isn't x86-specific.
> 
> That can be solved by having the exported APIs be arch specific, e.g.
> x86_perf_guest_enter(), and making perf_guest_enter() a perf-internal API.
> 
> That has the advantage of making it impossible to call perf_guest_enter() on an
> unsupported architecture (modulo perf bugs).
> 
Make sense. I will try it.

thanks

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
  2024-04-11 19:21   ` Sean Christopherson
@ 2024-04-12  6:17     ` Zhang, Xiong Y
  0 siblings, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-12  6:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang


On 4/12/2024 3:21 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@intel.com>
>>
>> When guest clear LVTPC_MASK bit in guest PMI handler at PMU passthrough
>> mode, this bit should be reflected onto HW, otherwise HW couldn't generate
>> PMI again during VM running until it is cleared.
> 
> This fixes a bug in the previous patch, i.e. this should not be a standalone
> patch.
> 
>>
>> This commit set HW LVTPC_MASK bit at PMU vecctor switching to KVM PMI
>> vector.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/events/core.c            | 9 +++++++--
>>  arch/x86/include/asm/perf_event.h | 2 +-
>>  arch/x86/kvm/lapic.h              | 1 -
>>  3 files changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 3f87894d8c8e..ece042cfb470 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -709,13 +709,18 @@ void perf_guest_switch_to_host_pmi_vector(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_guest_switch_to_host_pmi_vector);
>>  
>> -void perf_guest_switch_to_kvm_pmi_vector(void)
>> +void perf_guest_switch_to_kvm_pmi_vector(bool mask)
>>  {
>>  	lockdep_assert_irqs_disabled();
>>  
>> -	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
>> +	if (mask)
>> +		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR |
>> +			   APIC_LVT_MASKED);
>> +	else
>> +		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR);
>>  }
> 
> Or more simply:
> 
> void perf_guest_enter(u32 guest_lvtpc)
> {
> 	...
> 
> 	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_VPMU_VECTOR |
> 			       (guest_lvtpc & APIC_LVT_MASKED));
> }
> 
> and then on the KVM side:
> 
> 	perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> 
> because an in-kernel APIC should be a hard requirement for the mediated PMU.
> this is simpler and we will follow this.

thanks

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities
  2024-04-11 21:50     ` Jim Mattson
@ 2024-04-12 16:01       ` Sean Christopherson
  0 siblings, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-12 16:01 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao,
	Xiong Zhang

On Thu, Apr 11, 2024, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > From: Mingwei Zhang <mizhang@google.com>
> > >
> > > Intercept full-width GP counter MSRs in passthrough PMU if guest does not
> > > have the capability to write in full-width. In addition, opportunistically
> > > add a warning if non-full-width counter MSRs are also intercepted, in which
> > > case it is a clear mistake.
> > >
> > > Co-developed-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > > Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/vmx/pmu_intel.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > > index 7f6cabb2c378..49df154fbb5b 100644
> > > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > > @@ -429,6 +429,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >       default:
> > >               if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
> > >                   (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
> > > +                     if (is_passthrough_pmu_enabled(vcpu) &&
> > > +                         !(msr & MSR_PMC_FULL_WIDTH_BIT) &&
> > > +                         !msr_info->host_initiated) {
> > > +                             pr_warn_once("passthrough PMU never intercepts non-full-width PMU counters\n");
> > > +                             return 1;
> >
> > This is broken, KVM must be prepared to handle WRMSR (and RDMSR and RDPMC) that
> > come in through the emulator.
> 
> Don't tell me that we are still supporting CPUs that don't have
> "unrestricted guest"! Sigh.

Heh, KVM still supports CPUs without VMX virtual NMIs :-)

Practically speaking, if we want to eliminate things like emulated WRMSR/RDMSR,
a Kconfig to build a reduced emulator would be the way to go.  But while a reduced
emulator would be nice for host security, I don't think it would buy us much from
a code perspective, since KVM still needs to handle host userspace MSR accesses.

E.g. KVM could have conditional sanity checks for MSRs that are supposed to be
passed through, but unless a reduced emulator is a hard requirement for passthrough
PMUs, we'd still need the code to handle the emulated accesses.  And even if a
reduced emulator were a hard requirement, I'd still push for a WARN-and-continue
approach, not a "inject a bogus #GP because KVM screwed up" approach.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-12  2:19   ` Zhang, Xiong Y
@ 2024-04-12 18:32     ` Sean Christopherson
  2024-04-15  1:06       ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-12 18:32 UTC (permalink / raw)
  To: Xiong Y Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
> >> 2. NMI watchdog
> >>    the perf event for NMI watchdog is a system wide cpu pinned event, it
> >>    will be stopped also during vm running, but it doesn't have
> >>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> >>    watchdog loses function during VM running.
> >>
> >>    Two candidates exist for replacing perf event of NMI watchdog:
> >>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> >>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> > 
> > I think the simplest solution is to allow mediated PMU usage if and only if
> > the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> > problem to solve.
> Make sense. KVM should not affect host high priority work.
> NMI watchdog is a client of perf and is a system wide perf event, perf can't
> distinguish a system wide perf event is NMI watchdog or others, so how about
> we extend this suggestion to all the system wide perf events ?  mediated PMU
> is only allowed when all system wide perf events are disabled or non-exist at
> vm creation.

What other kernel-driven system wide perf events are there?

> but NMI watchdog is usually enabled, this will limit mediated PMU usage.

I don't think it is at all unreasonable to require users that want optimal PMU
virtualization to adjust their environment.  And we can and should document the
tradeoffs and alternatives, e.g. so that users that want better PMU results don't
need to re-discover all the "gotchas" on their own.

This would even be one of the rare times where I would be ok with a dmesg log.
E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
perf events, pr_warn() to explain the conflict and direct the user at documentation
explaining how to make their system compatible with mediate PMU usage.

> >> 3. Dedicated kvm_pmi_vector
> >>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> >>    PMI into guest when physical PMI belongs to guest counter. If the
> >>    same mechanism is used in passthrough vPMU and PMI skid exists
> >>    which cause physical PMI belonging to guest happens after VM-exit,
> >>    then the host PMI handler couldn't identify this PMI belongs to
> >>    host or guest.
> >>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> >>    has this vector only. The PMI belonging to host still has an NMI
> >>    vector.
> >>
> >>    Without considering PMI skid especially for AMD, the host NMI vector
> >>    could be used for guest PMI also, this method is simpler and doesn't
> > 
> > I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> > the complexity is a wash, just in different locations, and I highly doubt it's
> > a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> > LVTPC.
> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
> NMI between guest and host, we could extend guest PT's PMI framework to
> mediated PMU. so I think this is simpler.

Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
PT framework of multiplexing NMI?

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-11 19:53     ` Liang, Kan
@ 2024-04-12 19:17       ` Sean Christopherson
  2024-04-12 20:56         ` Liang, Kan
  2024-04-26  4:09       ` Zhang, Xiong Y
  1 sibling, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-12 19:17 UTC (permalink / raw)
  To: Kan Liang
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Thu, Apr 11, 2024, Kan Liang wrote:
> >> +/*
> >> + * When a guest enters, force all active events of the PMU, which supports
> >> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
> >> + * PMUs, such as uncore PMU, should not be impacted. The guest can
> >> + * temporarily own all counters of the PMU.
> >> + * During the period, all the creation of the new event of the PMU with
> >> + * !exclude_guest are error out.
> >> + */
> >> +void perf_guest_enter(void)
> >> +{
> >> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> >> +
> >> +	lockdep_assert_irqs_disabled();
> >> +
> >> +	if (__this_cpu_read(__perf_force_exclude_guest))
> > 
> > This should be a WARN_ON_ONCE, no?
> 
> To debug the improper behavior of KVM?

Not so much "debug" as ensure that the platform owner noticies that KVM is buggy.

> >> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
> >> +						 int cpu, struct task_struct *task)
> >> +{
> >> +	bool *force_exclude_guest = NULL;
> >> +
> >> +	if (!has_vpmu_passthrough_cap(event->pmu))
> >> +		return 0;
> >> +
> >> +	if (event->attr.exclude_guest)
> >> +		return 0;
> >> +
> >> +	if (cpu != -1) {
> >> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
> >> +	} else if (task && (task->flags & PF_VCPU)) {
> >> +		/*
> >> +		 * Just need to check the running CPU in the event creation. If the
> >> +		 * task is moved to another CPU which supports the force_exclude_guest.
> >> +		 * The event will filtered out and be moved to the error stage. See
> >> +		 * merge_sched_in().
> >> +		 */
> >> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
> >> +	}
> > 
> > These checks are extremely racy, I don't see how this can possibly do the
> > right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
> > to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
> > include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
> > time slices).
> >
> 
> This is to reject an !exclude_guest event creation for a running
> "passthrough" guest from host perf tool.
> Could you please suggest a way to detect it via the struct task_struct?
> 
> > Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
> > perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
> > per-CPU context mutex.
> 
> Do you mean that the perf_guest_enter() check could be happened right
> after the perf_force_exclude_guest_check()?
> It's possible. For this case, the event can still be created. It will be
> treated as an existing event and handled in merge_sched_in(). It will
> never be scheduled when a guest is running.
> 
> The perf_force_exclude_guest_check() is to make sure most of the cases
> can be rejected at the creation place. For the corner cases, they will
> be rejected in the schedule stage.

Ah, the "rejected in the schedule stage" is what I'm missing.  But that creates
a gross ABI, because IIUC, event creation will "randomly" succeed based on whether
or not a CPU happens to be running in a KVM guest.  I.e. it's not just the kernel
code that has races, the entire event creation is one big race.

What if perf had a global knob to enable/disable mediate PMU support?  Then when
KVM is loaded with enable_mediated_true, call into perf to (a) check that there
are no existing !exclude_guest events (this part could be optional), and (b) set
the global knob to reject all new !exclude_guest events (for the core PMU?).

Hmm, or probably better, do it at VM creation.  That has the advantage of playing
nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
want to run VMs.

E.g. (very roughly)

int x86_perf_get_mediated_pmu(void)
{
	if (refcount_inc_not_zero(...))
		return 0;

	if (<system wide events>)
		return -EBUSY;

	<slow path with locking>
}

void x86_perf_put_mediated_pmu(void)
{
	if (!refcount_dec_and_test(...))
		return;

	<slow path with locking>
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1bbf312cbd73..f2994377ef44 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12467,6 +12467,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
        if (type)
                return -EINVAL;
 
+       if (enable_mediated_pmu)
+               ret = x86_perf_get_mediated_pmu();
+               if (ret)
+                       return ret;
+       }
+
        ret = kvm_page_track_init(kvm);
        if (ret)
                goto out;
@@ -12518,6 +12524,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
        kvm_mmu_uninit_vm(kvm);
        kvm_page_track_cleanup(kvm);
 out:
+       x86_perf_put_mediated_pmu();
        return ret;
 }
 
@@ -12659,6 +12666,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
        kvm_page_track_cleanup(kvm);
        kvm_xen_destroy_vm(kvm);
        kvm_hv_destroy_vm(kvm);
+       x86_perf_put_mediated_pmu();
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-12 19:17       ` Sean Christopherson
@ 2024-04-12 20:56         ` Liang, Kan
  2024-04-15 16:03           ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-12 20:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-12 3:17 p.m., Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Kan Liang wrote:
>>>> +/*
>>>> + * When a guest enters, force all active events of the PMU, which supports
>>>> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
>>>> + * PMUs, such as uncore PMU, should not be impacted. The guest can
>>>> + * temporarily own all counters of the PMU.
>>>> + * During the period, all the creation of the new event of the PMU with
>>>> + * !exclude_guest are error out.
>>>> + */
>>>> +void perf_guest_enter(void)
>>>> +{
>>>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>> +
>>>> +	lockdep_assert_irqs_disabled();
>>>> +
>>>> +	if (__this_cpu_read(__perf_force_exclude_guest))
>>>
>>> This should be a WARN_ON_ONCE, no?
>>
>> To debug the improper behavior of KVM?
> 
> Not so much "debug" as ensure that the platform owner noticies that KVM is buggy.
> 
>>>> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
>>>> +						 int cpu, struct task_struct *task)
>>>> +{
>>>> +	bool *force_exclude_guest = NULL;
>>>> +
>>>> +	if (!has_vpmu_passthrough_cap(event->pmu))
>>>> +		return 0;
>>>> +
>>>> +	if (event->attr.exclude_guest)
>>>> +		return 0;
>>>> +
>>>> +	if (cpu != -1) {
>>>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
>>>> +	} else if (task && (task->flags & PF_VCPU)) {
>>>> +		/*
>>>> +		 * Just need to check the running CPU in the event creation. If the
>>>> +		 * task is moved to another CPU which supports the force_exclude_guest.
>>>> +		 * The event will filtered out and be moved to the error stage. See
>>>> +		 * merge_sched_in().
>>>> +		 */
>>>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
>>>> +	}
>>>
>>> These checks are extremely racy, I don't see how this can possibly do the
>>> right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
>>> to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
>>> include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
>>> time slices).
>>>
>>
>> This is to reject an !exclude_guest event creation for a running
>> "passthrough" guest from host perf tool.
>> Could you please suggest a way to detect it via the struct task_struct?
>>
>>> Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
>>> perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
>>> per-CPU context mutex.
>>
>> Do you mean that the perf_guest_enter() check could be happened right
>> after the perf_force_exclude_guest_check()?
>> It's possible. For this case, the event can still be created. It will be
>> treated as an existing event and handled in merge_sched_in(). It will
>> never be scheduled when a guest is running.
>>
>> The perf_force_exclude_guest_check() is to make sure most of the cases
>> can be rejected at the creation place. For the corner cases, they will
>> be rejected in the schedule stage.
> 
> Ah, the "rejected in the schedule stage" is what I'm missing.  But that creates
> a gross ABI, because IIUC, event creation will "randomly" succeed based on whether
> or not a CPU happens to be running in a KVM guest.  I.e. it's not just the kernel
> code that has races, the entire event creation is one big race.
> 
> What if perf had a global knob to enable/disable mediate PMU support?  Then when
> KVM is loaded with enable_mediated_true, call into perf to (a) check that there
> are no existing !exclude_guest events (this part could be optional), and (b) set
> the global knob to reject all new !exclude_guest events (for the core PMU?).
> 
> Hmm, or probably better, do it at VM creation.  That has the advantage of playing
> nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
> KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
> want to run VMs.

I think it should be doable, and may simplify the perf implementation.
(The check in the schedule stage should not be necessary anymore.)

With this, something like NMI watchdog should fail the VM creation. The
user should either disable the NMI watchdog or use a replacement.

Thanks,
Kan
> 
> E.g. (very roughly)
> 
> int x86_perf_get_mediated_pmu(void)
> {
> 	if (refcount_inc_not_zero(...))
> 		return 0;
> 
> 	if (<system wide events>)
> 		return -EBUSY;
> 
> 	<slow path with locking>
> }
> 
> void x86_perf_put_mediated_pmu(void)
> {
> 	if (!refcount_dec_and_test(...))
> 		return;
> 
> 	<slow path with locking>
> }
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1bbf312cbd73..f2994377ef44 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12467,6 +12467,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>         if (type)
>                 return -EINVAL;
>  
> +       if (enable_mediated_pmu)
> +               ret = x86_perf_get_mediated_pmu();
> +               if (ret)
> +                       return ret;
> +       }
> +
>         ret = kvm_page_track_init(kvm);
>         if (ret)
>                 goto out;
> @@ -12518,6 +12524,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>         kvm_mmu_uninit_vm(kvm);
>         kvm_page_track_cleanup(kvm);
>  out:
> +       x86_perf_put_mediated_pmu();
>         return ret;
>  }
>  
> @@ -12659,6 +12666,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>         kvm_page_track_cleanup(kvm);
>         kvm_xen_destroy_vm(kvm);
>         kvm_hv_destroy_vm(kvm);
> +       x86_perf_put_mediated_pmu();
>  }
>  
>  static void memslot_rmap_free(struct kvm_memory_slot *slot)
> 
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling
  2024-04-12  3:56     ` Zhang, Xiong Y
@ 2024-04-13  1:17       ` Mi, Dapeng
  0 siblings, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  1:17 UTC (permalink / raw)
  To: Zhang, Xiong Y, Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, jmattson, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, Xiong Zhang


On 4/12/2024 11:56 AM, Zhang, Xiong Y wrote:
>
> On 4/12/2024 1:10 AM, Sean Christopherson wrote:
>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>> From: Xiong Zhang <xiong.y.zhang@intel.com>
>>>
>>> Create a new vector in the host IDT for PMI handling within a passthrough
>>> vPMU implementation. In addition, add a function to allow the registration
>>> of the handler and a function to switch the PMI handler.
>>>
>>> This is the preparation work to support KVM passthrough vPMU to handle its
>>> own PMIs without interference from PMI handler of the host PMU.
>>>
>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>   arch/x86/include/asm/hardirq.h           |  1 +
>>>   arch/x86/include/asm/idtentry.h          |  1 +
>>>   arch/x86/include/asm/irq.h               |  1 +
>>>   arch/x86/include/asm/irq_vectors.h       |  2 +-
>>>   arch/x86/kernel/idt.c                    |  1 +
>>>   arch/x86/kernel/irq.c                    | 29 ++++++++++++++++++++++++
>>>   tools/arch/x86/include/asm/irq_vectors.h |  1 +
>>>   7 files changed, 35 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
>>> index 66837b8c67f1..c1e2c1a480bf 100644
>>> --- a/arch/x86/include/asm/hardirq.h
>>> +++ b/arch/x86/include/asm/hardirq.h
>>> @@ -19,6 +19,7 @@ typedef struct {
>>>   	unsigned int kvm_posted_intr_ipis;
>>>   	unsigned int kvm_posted_intr_wakeup_ipis;
>>>   	unsigned int kvm_posted_intr_nested_ipis;
>>> +	unsigned int kvm_vpmu_pmis;
>> Somewhat off topic, does anyone actually ever use these particular stats?  If the
>> desire is to track _all_ IRQs, why not have an array and bump the counts in common
>> code?
> it is used in arch_show_interrupts() for /proc/interrupts.

Yes, these interrupt stats are useful, e.g. when we analyze the VM-EXIT 
performance overhead, if the vm-exits are caused by external interrupt, 
we usually need to look at these interrupt stats and check which exact 
interrupt causes the vm-exits.

>>>   #endif
>>>   	unsigned int x86_platform_ipis;	/* arch dependent */
>>>   	unsigned int apic_perf_irqs;
>>> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
>>> index 05fd175cec7d..d1b58366bc21 100644
>>> --- a/arch/x86/include/asm/idtentry.h
>>> +++ b/arch/x86/include/asm/idtentry.h
>>> @@ -675,6 +675,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
>>>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>>>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
>>>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
>>> +DECLARE_IDTENTRY_SYSVEC(KVM_VPMU_VECTOR,	        sysvec_kvm_vpmu_handler);
>> I vote for KVM_VIRTUAL_PMI_VECTOR.  I don't see any reasy to abbreviate "virtual",
>> and the vector is a for a Performance Monitoring Interupt.
> yes, KVM_GUEST_PMI_VECTOR in your next reply is better.
>>>   #endif
>>>   
>>>   #if IS_ENABLED(CONFIG_HYPERV)
>>> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
>>> index 836c170d3087..ee268f42d04a 100644
>>> --- a/arch/x86/include/asm/irq.h
>>> +++ b/arch/x86/include/asm/irq.h
>>> @@ -31,6 +31,7 @@ extern void fixup_irqs(void);
>>>   
>>>   #ifdef CONFIG_HAVE_KVM
>>>   extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
>>> +extern void kvm_set_vpmu_handler(void (*handler)(void));
>> virtual_pmi_handler()
>>
>>>   #endif
>>>   
>>>   extern void (*x86_platform_ipi_callback)(void);
>>> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
>>> index 3a19904c2db6..120403572307 100644
>>> --- a/arch/x86/include/asm/irq_vectors.h
>>> +++ b/arch/x86/include/asm/irq_vectors.h
>>> @@ -77,7 +77,7 @@
>>>    */
>>>   #define IRQ_WORK_VECTOR			0xf6
>>>   
>>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>>> +#define KVM_VPMU_VECTOR			0xf5
>> This should be inside
>>
>> 	#ifdef CONFIG_HAVE_KVM
>>
>> no?
> yes, it should have #if IS_ENABLED(CONFIG_KVM)
>>>   #define DEFERRED_ERROR_VECTOR		0xf4
>>>   
>>>   /* Vector on which hypervisor callbacks will be delivered */
>>> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
>>> index 8857abc706e4..6944eec251f4 100644
>>> --- a/arch/x86/kernel/idt.c
>>> +++ b/arch/x86/kernel/idt.c
>>> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
>>>   	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>>>   	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>>>   	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
>>> +	INTG(KVM_VPMU_VECTOR,		        asm_sysvec_kvm_vpmu_handler),
>> kvm_virtual_pmi_handler
>>
>>> @@ -332,6 +351,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>>>   	apic_eoi();
>>>   	inc_irq_stat(kvm_posted_intr_nested_ipis);
>>>   }
>>> +
>>> +/*
>>> + * Handler for KVM_PT_PMU_VECTOR.
>> Heh, not sure where the PT part came from...
> I will change it to KVM_GUEST_PMI_VECTOR
>>> + */
>>> +DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_vpmu_handler)
>>> +{
>>> +	apic_eoi();
>>> +	inc_irq_stat(kvm_vpmu_pmis);
>>> +	kvm_vpmu_handler();
>>> +}
>>>   #endif
>>>   
>>>   
>>> diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
>>> index 3a19904c2db6..3773e60f1af8 100644
>>> --- a/tools/arch/x86/include/asm/irq_vectors.h
>>> +++ b/tools/arch/x86/include/asm/irq_vectors.h
>>> @@ -85,6 +85,7 @@
>>>   
>>>   /* Vector for KVM to deliver posted interrupt IPI */
>>>   #ifdef CONFIG_HAVE_KVM
>>> +#define KVM_VPMU_VECTOR			0xf5
>> Heh, and your copy+paste is out of date.
> Get it. 0xf5 isn't aligned with 0xf2, and the above comment should be moved prior POSTED_INTR_VECTOR
>
> thanks
>>>   #define POSTED_INTR_VECTOR		0xf2
>>>   #define POSTED_INTR_WAKEUP_VECTOR	0xf1
>>>   #define POSTED_INTR_NESTED_VECTOR	0xf0
>>> -- 
>>> 2.34.1
>>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-04-11 22:30     ` Jim Mattson
  2024-04-11 23:27       ` Sean Christopherson
@ 2024-04-13  2:10       ` Mi, Dapeng
  1 sibling, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  2:10 UTC (permalink / raw)
  To: Jim Mattson, Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao, Xiong Zhang


On 4/12/2024 6:30 AM, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 2:21 PM Sean Christopherson <seanjc@google.com> wrote:
>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>> +     if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
>>> +             /*
>>> +              * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
>>> +              */
>>> +             if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
>>> +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
>>> +             else {
>>> +                     i = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest,
>>> +                                                    MSR_CORE_PERF_GLOBAL_CTRL);
>>> +                     if (i < 0) {
>>> +                             i = vmx->msr_autoload.guest.nr++;
>>> +                             vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT,
>>> +                                          vmx->msr_autoload.guest.nr);
>>> +                     }
>>> +                     vmx->msr_autoload.guest.val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
>>> +                     vmx->msr_autoload.guest.val[i].value = 0;
>> Eww, no.   Just make cpu_has_load_perf_global_ctrl() and VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL
>> hard requirements for enabling passthrough mode.  And then have clear_atomic_switch_msr()
>> yell if KVM tries to disable loading MSR_CORE_PERF_GLOBAL_CTRL.
> Weren't you just complaining about the PMU version 4 constraint in
> another patch? And here, you are saying, "Don't support anything older
> than Sapphire Rapids."
>
> Sapphire Rapids has PMU version 4, so if we require
> VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, PMU version 4 is irrelevant.

Just clarify Sapphire Rapids has PMU version 5 :).

[    2.687826] Performance Events: XSAVE Architectural LBR, PEBS 
fmt4+-baseline,  AnyThread deprecated, *Sapphire Rapids events*, 32-deep 
LBR, full-width counters, Intel PMU driver.
[    2.687925] ... version:                   5
[    2.687928] ... bit width:                 48
[    2.687929] ... generic counters:          8

>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-11 21:26   ` Sean Christopherson
@ 2024-04-13  2:29     ` Mi, Dapeng
  0 siblings, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  2:29 UTC (permalink / raw)
  To: Sean Christopherson, Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, jmattson, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao


On 4/12/2024 5:26 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>   static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>>   {
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct kvm_pmc *pmc;
>> +	u32 i;
>> +
>> +	if (pmu->version != 2) {
>> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
>> +		return;
>> +	}
>> +
>> +	/* Global ctrl register is already saved at VM-exit. */
>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>> +	if (pmu->global_status)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>> +
>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> +		pmc = &pmu->gp_counters[i];
>> +		rdpmcl(i, pmc->counter);
>> +		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>> +		/*
>> +		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>> +		 * leakage and also avoid this guest GP counter get accidentally
>> +		 * enabled during host running when host enable global ctrl.
>> +		 */
>> +		if (pmc->eventsel)
>> +			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>> +		if (pmc->counter)
>> +			wrmsrl(MSR_IA32_PMC0 + i, 0);
>> +	}
>> +
>> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>> +	/*
>> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>> +	 * also avoid these guest fixed counters get accidentially enabled
>> +	 * during host running when host enable global ctrl.
>> +	 */
>> +	if (pmu->fixed_ctr_ctrl)
>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> +		pmc = &pmu->fixed_counters[i];
>> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>> +		if (pmc->counter)
>> +			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>> +	}
> For the next RFC, please make that it includes AMD support.  Mostly because I'm
> pretty all of this code can be in common x86.  The fixed counters are ugly,
> but pmu->nr_arch_fixed_counters is guaranteed to '0' on AMD, so it's _just_ ugly,
> i.e. not functionally problematic.

Sure. I believe Mingwei would integrate AMD supporting patches in next 
version. Yeah, I agree there could be a part of code which can be put 
into common x86/pmu, but there are still some vendor specific code, we 
still keep an vendor specific callback.



^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-11 21:44   ` Sean Christopherson
  2024-04-11 22:19     ` Jim Mattson
@ 2024-04-13  3:03     ` Mi, Dapeng
  2024-04-13  3:34       ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  3:03 UTC (permalink / raw)
  To: Sean Christopherson, Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, jmattson, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao


On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
>> and gains the full HW PMU ownership. On the contrary, host regains the
>> ownership of PMU HW from KVM when control flow leaves the scope of
>> passthrough PMU.
>>
>> Implement PMU context switches for Intel CPUs and opptunistically use
>> rdpmcl() instead of rdmsrl() when reading counters since the former has
>> lower latency in Intel CPUs.
>>
>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>   arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 73 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 0d58fe7d243e..f79bebe7093d 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>>   
>>   static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
Yeah. It looks clearer.
>
>>   {
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct kvm_pmc *pmc;
>> +	u32 i;
>> +
>> +	if (pmu->version != 2) {
>> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
>> +		return;
>> +	}
>> +
>> +	/* Global ctrl register is already saved at VM-exit. */
>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>> +	if (pmu->global_status)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>> +
>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> +		pmc = &pmu->gp_counters[i];
>> +		rdpmcl(i, pmc->counter);
>> +		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>> +		/*
>> +		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>> +		 * leakage and also avoid this guest GP counter get accidentally
>> +		 * enabled during host running when host enable global ctrl.
>> +		 */
>> +		if (pmc->eventsel)
>> +			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>> +		if (pmc->counter)
>> +			wrmsrl(MSR_IA32_PMC0 + i, 0);
> This doesn't make much sense.  The kernel already has full access to the guest,
> I don't see what is gained by zeroing out the MSRs just to hide them from perf.

It's necessary to clear the EVENTSELx MSRs for both GP and fixed 
counters. Considering this case, Guest uses GP counter 2, but Host 
doesn't use it. So if the EVENTSEL2 MSR is not cleared here, the GP 
counter 2 would be enabled unexpectedly on host later since Host perf 
always enable all validate bits in PERF_GLOBAL_CTRL MSR. That would 
cause issues.

Yeah,  the clearing for PMCx MSR should be unnecessary .


>
> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> the event selector, we gots problems.
>
> Same thing for the fixed counters below.  Can't this just be?
>
> 	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> 		rdpmcl(i, pmu->gp_counters[i].counter);
>
> 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> 		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> 		       pmu->fixed_counters[i].counter);
>
>> +	}
>> +
>> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>> +	/*
>> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>> +	 * also avoid these guest fixed counters get accidentially enabled
>> +	 * during host running when host enable global ctrl.
>> +	 */
>> +	if (pmu->fixed_ctr_ctrl)
>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> +		pmc = &pmu->fixed_counters[i];
>> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>> +		if (pmc->counter)
>> +			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>> +	}
>>   }
>>   
>>   static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>>   {
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct kvm_pmc *pmc;
>> +	u64 global_status;
>> +	int i;
>> +
>> +	if (pmu->version != 2) {
>> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
>> +		return;
>> +	}
>> +
>> +	/* Clear host global_ctrl and global_status MSR if non-zero. */
>> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?

As previous comments say, host perf always enable all counters in 
PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here 
is to ensure all counters in disabled state and the later counter 
manipulation (writing MSR) won't cause any race condition or unexpected 
behavior on HW.


>
>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>> +	if (global_status)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> This seems especially silly, isn't the full MSR being written below?  Or am I
> misunderstanding how these things work?

I think Jim's comment has already explain why we need to do this.

>
>> +	wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>> +
>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> +		pmc = &pmu->gp_counters[i];
>> +		wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
>> +		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>> +	}
>> +
>> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> +		pmc = &pmu->fixed_counters[i];
>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>> +	}
>>   }
>>   
>>   struct kvm_pmu_ops intel_pmu_ops __initdata = {
>> -- 
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-11 23:31       ` Sean Christopherson
@ 2024-04-13  3:19         ` Mi, Dapeng
  0 siblings, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  3:19 UTC (permalink / raw)
  To: Sean Christopherson, Jim Mattson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao


On 4/12/2024 7:31 AM, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Jim Mattson wrote:
>> On Thu, Apr 11, 2024 at 2:44 PM Sean Christopherson <seanjc@google.com> wrote:
>>>> +     /* Clear host global_ctrl and global_status MSR if non-zero. */
>>>> +     wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>>>
>>>> +     rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>>>> +     if (global_status)
>>>> +             wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>>> This seems especially silly, isn't the full MSR being written below?  Or am I
>>> misunderstanding how these things work?
>> LOL! You expect CPU design to follow basic logic?!?
>>
>> Writing a 1 to a bit in IA32_PERF_GLOBAL_STATUS_SET sets the
>> corresponding bit in IA32_PERF_GLOBAL_STATUS to 1.
>>
>> Writing a 0 to a bit in to IA32_PERF_GLOBAL_STATUS_SET is a nop.
>>
>> To clear a bit in IA32_PERF_GLOBAL_STATUS, you need to write a 1 to
>> the corresponding bit in IA32_PERF_GLOBAL_STATUS_RESET (aka
>> IA32_PERF_GLOBAL_OVF_CTRL).
> If only C had a way to annotate what the code is doing. :-)
>
>>>> +     wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> IIUC, that means this should be:
>
> 	if (pmu->global_status)
> 		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>
> or even better:
>
> 	toggle = pmu->global_status ^ global_status;
> 	if (global_status & toggle)
> 		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
> 	if (pmu->global_status & toggle)
> 		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
Thanks, it looks better. Since PMU v4+, the MSR 
CORE_PERF_GLOBAL_OVF_CTRL is renamed to CORE_PERF_GLOBAL_STATUS_RESET 
with supporting more bits. CORE_PERF_GLOBAL_STATUS_RESET looks more 
easily understand just from name than CORE_PERF_GLOBAL_OVF_CTRL, I would 
prefer use this name in next version.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest
  2024-04-11 21:50   ` Sean Christopherson
@ 2024-04-13  3:29     ` Mi, Dapeng
  0 siblings, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  3:29 UTC (permalink / raw)
  To: Sean Christopherson, Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, jmattson, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao


On 4/12/2024 5:50 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Since perf topdown metrics feature is not supported yet, clear
>> PERF_METRICS MSR for guest.
> Please rewrite with --verbose, I have no idea what MSR_PERF_METRICS, and thus no
> clue why it needs to be zeroed when loading guest context, e.g. it's not passed
> through, so why does it matter?

Sure. MSR_PERF_METRICS actually reports the perf topdown metrics 
portion. I would add more details.


>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>   arch/x86/kvm/vmx/pmu_intel.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 4b4da7f17895..ad0434646a29 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -916,6 +916,10 @@ static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>>   	 */
>>   	for (i = pmu->nr_arch_fixed_counters; i < kvm_pmu_cap.num_counters_fixed; i++)
>>   		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>> +
>> +	/* Clear PERF_METRICS MSR since guest topdown metrics is not supported yet. */
>> +	if (kvm_caps.host_perf_cap & PMU_CAP_PERF_METRICS)
>> +		wrmsrl(MSR_PERF_METRICS, 0);
>>   }
>>   
>>   struct kvm_pmu_ops intel_pmu_ops __initdata = {
>> -- 
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-13  3:03     ` Mi, Dapeng
@ 2024-04-13  3:34       ` Mingwei Zhang
  2024-04-13  4:25         ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-13  3:34 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Sat, Apr 13, 2024, Mi, Dapeng wrote:
> 
> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > > 
> > > Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> > > passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> > > the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> > > and gains the full HW PMU ownership. On the contrary, host regains the
> > > ownership of PMU HW from KVM when control flow leaves the scope of
> > > passthrough PMU.
> > > 
> > > Implement PMU context switches for Intel CPUs and opptunistically use
> > > rdpmcl() instead of rdmsrl() when reading counters since the former has
> > > lower latency in Intel CPUs.
> > > 
> > > Co-developed-by: Mingwei Zhang <mizhang@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > > ---
> > >   arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 73 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > > index 0d58fe7d243e..f79bebe7093d 100644
> > > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > > @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> > >   static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> > I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
> Yeah. It looks clearer.
> > 
> > >   {
> > > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > +	struct kvm_pmc *pmc;
> > > +	u32 i;
> > > +
> > > +	if (pmu->version != 2) {
> > > +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > > +		return;
> > > +	}
> > > +
> > > +	/* Global ctrl register is already saved at VM-exit. */
> > > +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> > > +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> > > +	if (pmu->global_status)
> > > +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> > > +
> > > +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > > +		pmc = &pmu->gp_counters[i];
> > > +		rdpmcl(i, pmc->counter);
> > > +		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> > > +		/*
> > > +		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> > > +		 * leakage and also avoid this guest GP counter get accidentally
> > > +		 * enabled during host running when host enable global ctrl.
> > > +		 */
> > > +		if (pmc->eventsel)
> > > +			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> > > +		if (pmc->counter)
> > > +			wrmsrl(MSR_IA32_PMC0 + i, 0);
> > This doesn't make much sense.  The kernel already has full access to the guest,
> > I don't see what is gained by zeroing out the MSRs just to hide them from perf.
> 
> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> unexpectedly on host later since Host perf always enable all validate bits
> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> 
> Yeah,  the clearing for PMCx MSR should be unnecessary .
> 

Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
values to the host? NO. Not in cloud usage.

Please make changes to this patch with **extreme** caution.

According to our past experience, if there is a bug somewhere,
there is a catch here (normally).

Thanks.
-Mingwei
> 
> > 
> > Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> > the event selector, we gots problems.
> > 
> > Same thing for the fixed counters below.  Can't this just be?
> > 
> > 	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> > 		rdpmcl(i, pmu->gp_counters[i].counter);
> > 
> > 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> > 		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> > 		       pmu->fixed_counters[i].counter);
> > 
> > > +	}
> > > +
> > > +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > > +	/*
> > > +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> > > +	 * also avoid these guest fixed counters get accidentially enabled
> > > +	 * during host running when host enable global ctrl.
> > > +	 */
> > > +	if (pmu->fixed_ctr_ctrl)
> > > +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> > > +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > > +		pmc = &pmu->fixed_counters[i];
> > > +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> > > +		if (pmc->counter)
> > > +			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> > > +	}
> > >   }
> > >   static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> > >   {
> > > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > +	struct kvm_pmc *pmc;
> > > +	u64 global_status;
> > > +	int i;
> > > +
> > > +	if (pmu->version != 2) {
> > > +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > > +		return;
> > > +	}
> > > +
> > > +	/* Clear host global_ctrl and global_status MSR if non-zero. */
> > > +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> > Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> 
> As previous comments say, host perf always enable all counters in
> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
> ensure all counters in disabled state and the later counter manipulation
> (writing MSR) won't cause any race condition or unexpected behavior on HW.
> 
> 
> > 
> > > +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > > +	if (global_status)
> > > +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> > This seems especially silly, isn't the full MSR being written below?  Or am I
> > misunderstanding how these things work?
> 
> I think Jim's comment has already explain why we need to do this.
> 
> > 
> > > +	wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> > > +
> > > +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > > +		pmc = &pmu->gp_counters[i];
> > > +		wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> > > +		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> > > +	}
> > > +
> > > +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > > +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > > +		pmc = &pmu->fixed_counters[i];
> > > +		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> > > +	}
> > >   }
> > >   struct kvm_pmu_ops intel_pmu_ops __initdata = {
> > > -- 
> > > 2.34.1
> > > 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed
  2024-04-11 22:03   ` Sean Christopherson
@ 2024-04-13  4:12     ` Mi, Dapeng
  0 siblings, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  4:12 UTC (permalink / raw)
  To: Sean Christopherson, Xiong Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, jmattson, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao


On 4/12/2024 6:03 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Mingwei Zhang <mizhang@google.com>
>>
>> Allow writing to fixed counter selector if counter is exposed. If this
>> fixed counter is filtered out, this counter won't be enabled on HW.
>>
>> Passthrough PMU implements the context switch at VM Enter/Exit boundary the
>> guest value cannot be directly written to HW since the HW PMU is owned by
>> the host. Introduce a new field fixed_ctr_ctrl_hw in kvm_pmu to cache the
>> guest value.  which will be assigne to HW at PMU context restore.
>>
>> Since passthrough PMU intercept writes to fixed counter selector, there is
>> no need to read the value at pmu context save, but still clear the fix
>> counter ctrl MSR and counters when switching out to host PMU.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |  1 +
>>   arch/x86/kvm/vmx/pmu_intel.c    | 28 ++++++++++++++++++++++++----
>>   2 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index fd1c69371dbf..b02688ed74f7 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -527,6 +527,7 @@ struct kvm_pmu {
>>   	unsigned nr_arch_fixed_counters;
>>   	unsigned available_event_types;
>>   	u64 fixed_ctr_ctrl;
>> +	u64 fixed_ctr_ctrl_hw;
>>   	u64 fixed_ctr_ctrl_mask;
> Before introduce more fields, can someone please send a patch/series to rename
> the _mask fields?  AFAIK, they all should be e.g. fixed_ctr_ctrl_rsvd, or something
> to that effect.

Yeah, I remember I ever said to cook a patch to rename all these _mask 
fields. I would do it now.


>
> Because I think we should avoid reinventing the naming wheel, and use "shadow"
> instead of "hw", because KVM developers already know what "shadow" means.  But
> "mask" also has very specific meaning for shadowed fields.  That, and "mask" is
> a freaking awful name in the first place.
>
>>   	u64 global_ctrl;
>>   	u64 global_status;
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 713c2a7c7f07..93cfb86c1292 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -68,6 +68,25 @@ static int fixed_pmc_events[] = {
>>   	[2] = PSEUDO_ARCH_REFERENCE_CYCLES,
>>   };
>>   
>> +static void reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64 data)
> We need to come up with shorter names, this ain't Java.  :-)  Heh, that can be
> another argument for "mediated", it saves three characters.
>
> And somewhat related, kernel style is <scope>_<blah>, i.e.
>
> static void mediated_pmu_reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-13  3:34       ` Mingwei Zhang
@ 2024-04-13  4:25         ` Mi, Dapeng
  2024-04-15  6:06           ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-13  4:25 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
> On Sat, Apr 13, 2024, Mi, Dapeng wrote:
>> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>
>>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
>>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
>>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
>>>> and gains the full HW PMU ownership. On the contrary, host regains the
>>>> ownership of PMU HW from KVM when control flow leaves the scope of
>>>> passthrough PMU.
>>>>
>>>> Implement PMU context switches for Intel CPUs and opptunistically use
>>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
>>>> lower latency in Intel CPUs.
>>>>
>>>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> ---
>>>>    arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index 0d58fe7d243e..f79bebe7093d 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>>>>    static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
>> Yeah. It looks clearer.
>>>>    {
>>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>> +	struct kvm_pmc *pmc;
>>>> +	u32 i;
>>>> +
>>>> +	if (pmu->version != 2) {
>>>> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	/* Global ctrl register is already saved at VM-exit. */
>>>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>>>> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>>>> +	if (pmu->global_status)
>>>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>>>> +
>>>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>> +		pmc = &pmu->gp_counters[i];
>>>> +		rdpmcl(i, pmc->counter);
>>>> +		rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>>>> +		/*
>>>> +		 * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>>>> +		 * leakage and also avoid this guest GP counter get accidentally
>>>> +		 * enabled during host running when host enable global ctrl.
>>>> +		 */
>>>> +		if (pmc->eventsel)
>>>> +			wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>>>> +		if (pmc->counter)
>>>> +			wrmsrl(MSR_IA32_PMC0 + i, 0);
>>> This doesn't make much sense.  The kernel already has full access to the guest,
>>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
>> unexpectedly on host later since Host perf always enable all validate bits
>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>>
>> Yeah,  the clearing for PMCx MSR should be unnecessary .
>>
> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> values to the host? NO. Not in cloud usage.

No, this place is clearing the guest counter value instead of host 
counter value. Host always has method to see guest value in a normal VM 
if he want. I don't see its necessity, it's just a overkill and 
introduce extra overhead to write MSRs.


>
> Please make changes to this patch with **extreme** caution.
>
> According to our past experience, if there is a bug somewhere,
> there is a catch here (normally).
>
> Thanks.
> -Mingwei
>>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
>>> the event selector, we gots problems.
>>>
>>> Same thing for the fixed counters below.  Can't this just be?
>>>
>>> 	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>>> 		rdpmcl(i, pmu->gp_counters[i].counter);
>>>
>>> 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>> 		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
>>> 		       pmu->fixed_counters[i].counter);
>>>
>>>> +	}
>>>> +
>>>> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>> +	/*
>>>> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>>>> +	 * also avoid these guest fixed counters get accidentially enabled
>>>> +	 * during host running when host enable global ctrl.
>>>> +	 */
>>>> +	if (pmu->fixed_ctr_ctrl)
>>>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> +		pmc = &pmu->fixed_counters[i];
>>>> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>>>> +		if (pmc->counter)
>>>> +			wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>>>> +	}
>>>>    }
>>>>    static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>>>>    {
>>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>> +	struct kvm_pmc *pmc;
>>>> +	u64 global_status;
>>>> +	int i;
>>>> +
>>>> +	if (pmu->version != 2) {
>>>> +		pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	/* Clear host global_ctrl and global_status MSR if non-zero. */
>>>> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>> As previous comments say, host perf always enable all counters in
>> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
>> ensure all counters in disabled state and the later counter manipulation
>> (writing MSR) won't cause any race condition or unexpected behavior on HW.
>>
>>
>>>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>>>> +	if (global_status)
>>>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>>> This seems especially silly, isn't the full MSR being written below?  Or am I
>>> misunderstanding how these things work?
>> I think Jim's comment has already explain why we need to do this.
>>
>>>> +	wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>>>> +
>>>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>> +		pmc = &pmu->gp_counters[i];
>>>> +		wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
>>>> +		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>>>> +	}
>>>> +
>>>> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> +		pmc = &pmu->fixed_counters[i];
>>>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>>>> +	}
>>>>    }
>>>>    struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>>> -- 
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-12 18:32     ` Sean Christopherson
@ 2024-04-15  1:06       ` Zhang, Xiong Y
  2024-04-15 15:05         ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-15  1:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 4/13/2024 2:32 AM, Sean Christopherson wrote:
> On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
>>>> 2. NMI watchdog
>>>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>>>    will be stopped also during vm running, but it doesn't have
>>>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>>>    watchdog loses function during VM running.
>>>>
>>>>    Two candidates exist for replacing perf event of NMI watchdog:
>>>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
>>>
>>> I think the simplest solution is to allow mediated PMU usage if and only if
>>> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
>>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
>>> problem to solve.
>> Make sense. KVM should not affect host high priority work.
>> NMI watchdog is a client of perf and is a system wide perf event, perf can't
>> distinguish a system wide perf event is NMI watchdog or others, so how about
>> we extend this suggestion to all the system wide perf events ?  mediated PMU
>> is only allowed when all system wide perf events are disabled or non-exist at
>> vm creation.
> 
> What other kernel-driven system wide perf events are there?
does "kernel-driven" mean perf events created through perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ?
User can create system wide perf event through "perf record -e {} -a" also, I call it as user-driven system wide perf events.
Perf subsystem doesn't distinguish "kernel-driven" and "user-driven" system wide perf events.
> 
>> but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> 
> I don't think it is at all unreasonable to require users that want optimal PMU
> virtualization to adjust their environment.  And we can and should document the
> tradeoffs and alternatives, e.g. so that users that want better PMU results don't
> need to re-discover all the "gotchas" on their own.
> 
> This would even be one of the rare times where I would be ok with a dmesg log.
> E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
> perf events, pr_warn() to explain the conflict and direct the user at documentation
> explaining how to make their system compatible with mediate PMU usage.> 
>>>> 3. Dedicated kvm_pmi_vector
>>>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>>>    PMI into guest when physical PMI belongs to guest counter. If the
>>>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>>>    which cause physical PMI belonging to guest happens after VM-exit,
>>>>    then the host PMI handler couldn't identify this PMI belongs to
>>>>    host or guest.
>>>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>>>    has this vector only. The PMI belonging to host still has an NMI
>>>>    vector.
>>>>
>>>>    Without considering PMI skid especially for AMD, the host NMI vector
>>>>    could be used for guest PMI also, this method is simpler and doesn't
>>>
>>> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
>>> the complexity is a wash, just in different locations, and I highly doubt it's
>>> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
>>> LVTPC.
>> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
>> NMI between guest and host, we could extend guest PT's PMI framework to
>> mediated PMU. so I think this is simpler.
> 
> Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
> PT framework of multiplexing NMI?
here "this" means "extending the PT framework of multiplexing NMI".

thanks
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-13  4:25         ` Mi, Dapeng
@ 2024-04-15  6:06           ` Mingwei Zhang
  2024-04-15 10:04             ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-15  6:06 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
> > On Sat, Apr 13, 2024, Mi, Dapeng wrote:
> >> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> >>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> >>>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>
> >>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> >>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> >>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> >>>> and gains the full HW PMU ownership. On the contrary, host regains the
> >>>> ownership of PMU HW from KVM when control flow leaves the scope of
> >>>> passthrough PMU.
> >>>>
> >>>> Implement PMU context switches for Intel CPUs and opptunistically use
> >>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
> >>>> lower latency in Intel CPUs.
> >>>>
> >>>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> >>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> ---
> >>>>    arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 73 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> >>>> index 0d58fe7d243e..f79bebe7093d 100644
> >>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
> >>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> >>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >>>>    static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> >>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
> >> Yeah. It looks clearer.
> >>>>    {
> >>>> +  struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>> +  struct kvm_pmc *pmc;
> >>>> +  u32 i;
> >>>> +
> >>>> +  if (pmu->version != 2) {
> >>>> +          pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>> +          return;
> >>>> +  }
> >>>> +
> >>>> +  /* Global ctrl register is already saved at VM-exit. */
> >>>> +  rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> >>>> +  /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> >>>> +  if (pmu->global_status)
> >>>> +          wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> >>>> +
> >>>> +  for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>> +          pmc = &pmu->gp_counters[i];
> >>>> +          rdpmcl(i, pmc->counter);
> >>>> +          rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> >>>> +          /*
> >>>> +           * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> >>>> +           * leakage and also avoid this guest GP counter get accidentally
> >>>> +           * enabled during host running when host enable global ctrl.
> >>>> +           */
> >>>> +          if (pmc->eventsel)
> >>>> +                  wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> >>>> +          if (pmc->counter)
> >>>> +                  wrmsrl(MSR_IA32_PMC0 + i, 0);
> >>> This doesn't make much sense.  The kernel already has full access to the guest,
> >>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
> >> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> >> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> >> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> >> unexpectedly on host later since Host perf always enable all validate bits
> >> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> >>
> >> Yeah,  the clearing for PMCx MSR should be unnecessary .
> >>
> > Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > values to the host? NO. Not in cloud usage.
>
> No, this place is clearing the guest counter value instead of host
> counter value. Host always has method to see guest value in a normal VM
> if he want. I don't see its necessity, it's just a overkill and
> introduce extra overhead to write MSRs.
>

I am curious how the perf subsystem solves the problem? Does perf
subsystem in the host only scrubbing the selector but not the counter
value when doing the context switch?

>
> >
> > Please make changes to this patch with **extreme** caution.
> >
> > According to our past experience, if there is a bug somewhere,
> > there is a catch here (normally).
> >
> > Thanks.
> > -Mingwei
> >>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> >>> the event selector, we gots problems.
> >>>
> >>> Same thing for the fixed counters below.  Can't this just be?
> >>>
> >>>     for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> >>>             rdpmcl(i, pmu->gp_counters[i].counter);
> >>>
> >>>     for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> >>>             rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> >>>                    pmu->fixed_counters[i].counter);
> >>>
> >>>> +  }
> >>>> +
> >>>> +  rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>> +  /*
> >>>> +   * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> >>>> +   * also avoid these guest fixed counters get accidentially enabled
> >>>> +   * during host running when host enable global ctrl.
> >>>> +   */
> >>>> +  if (pmu->fixed_ctr_ctrl)
> >>>> +          wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> >>>> +  for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>> +          pmc = &pmu->fixed_counters[i];
> >>>> +          rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> >>>> +          if (pmc->counter)
> >>>> +                  wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> >>>> +  }
> >>>>    }
> >>>>    static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> >>>>    {
> >>>> +  struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>> +  struct kvm_pmc *pmc;
> >>>> +  u64 global_status;
> >>>> +  int i;
> >>>> +
> >>>> +  if (pmu->version != 2) {
> >>>> +          pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>> +          return;
> >>>> +  }
> >>>> +
> >>>> +  /* Clear host global_ctrl and global_status MSR if non-zero. */
> >>>> +  wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >>> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> >> As previous comments say, host perf always enable all counters in
> >> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
> >> ensure all counters in disabled state and the later counter manipulation
> >> (writing MSR) won't cause any race condition or unexpected behavior on HW.
> >>
> >>
> >>>> +  rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> >>>> +  if (global_status)
> >>>> +          wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> >>> This seems especially silly, isn't the full MSR being written below?  Or am I
> >>> misunderstanding how these things work?
> >> I think Jim's comment has already explain why we need to do this.
> >>
> >>>> +  wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> >>>> +
> >>>> +  for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>> +          pmc = &pmu->gp_counters[i];
> >>>> +          wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> >>>> +          wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> >>>> +  }
> >>>> +
> >>>> +  wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>> +  for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>> +          pmc = &pmu->fixed_counters[i];
> >>>> +          wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> >>>> +  }
> >>>>    }
> >>>>    struct kvm_pmu_ops intel_pmu_ops __initdata = {
> >>>> --
> >>>> 2.34.1
> >>>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15  6:06           ` Mingwei Zhang
@ 2024-04-15 10:04             ` Mi, Dapeng
  2024-04-15 16:44               ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-15 10:04 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
>>> On Sat, Apr 13, 2024, Mi, Dapeng wrote:
>>>> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
>>>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>>>>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>
>>>>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
>>>>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
>>>>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
>>>>>> and gains the full HW PMU ownership. On the contrary, host regains the
>>>>>> ownership of PMU HW from KVM when control flow leaves the scope of
>>>>>> passthrough PMU.
>>>>>>
>>>>>> Implement PMU context switches for Intel CPUs and opptunistically use
>>>>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
>>>>>> lower latency in Intel CPUs.
>>>>>>
>>>>>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
>>>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> ---
>>>>>>     arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 73 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> index 0d58fe7d243e..f79bebe7093d 100644
>>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>>>>>>     static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>>>>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
>>>> Yeah. It looks clearer.
>>>>>>     {
>>>>>> +  struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>>>> +  struct kvm_pmc *pmc;
>>>>>> +  u32 i;
>>>>>> +
>>>>>> +  if (pmu->version != 2) {
>>>>>> +          pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>>>> +          return;
>>>>>> +  }
>>>>>> +
>>>>>> +  /* Global ctrl register is already saved at VM-exit. */
>>>>>> +  rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>>>>>> +  /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>>>>>> +  if (pmu->global_status)
>>>>>> +          wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>>>>>> +
>>>>>> +  for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>>>> +          pmc = &pmu->gp_counters[i];
>>>>>> +          rdpmcl(i, pmc->counter);
>>>>>> +          rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>>>>>> +          /*
>>>>>> +           * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>>>>>> +           * leakage and also avoid this guest GP counter get accidentally
>>>>>> +           * enabled during host running when host enable global ctrl.
>>>>>> +           */
>>>>>> +          if (pmc->eventsel)
>>>>>> +                  wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>>>>>> +          if (pmc->counter)
>>>>>> +                  wrmsrl(MSR_IA32_PMC0 + i, 0);
>>>>> This doesn't make much sense.  The kernel already has full access to the guest,
>>>>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
>>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
>>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
>>>> unexpectedly on host later since Host perf always enable all validate bits
>>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>>>>
>>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
>>>>
>>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
>>> values to the host? NO. Not in cloud usage.
>> No, this place is clearing the guest counter value instead of host
>> counter value. Host always has method to see guest value in a normal VM
>> if he want. I don't see its necessity, it's just a overkill and
>> introduce extra overhead to write MSRs.
>>
> I am curious how the perf subsystem solves the problem? Does perf
> subsystem in the host only scrubbing the selector but not the counter
> value when doing the context switch?

When context switch happens, perf code would schedule out the old events 
and schedule in the new events. When scheduling out, the ENABLE bit of 
EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx 
and PMCx MSRs would be overwritten with new event's attr.config and 
sample_period separately.  Of course, these is only for the case when 
there are new events to be programmed on the PMC. If no new events, the 
PMCx MSR would keep stall value and won't be cleared.

Anyway, I don't see any reason that PMCx MSR must be cleared.



>
>>> Please make changes to this patch with **extreme** caution.
>>>
>>> According to our past experience, if there is a bug somewhere,
>>> there is a catch here (normally).
>>>
>>> Thanks.
>>> -Mingwei
>>>>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
>>>>> the event selector, we gots problems.
>>>>>
>>>>> Same thing for the fixed counters below.  Can't this just be?
>>>>>
>>>>>      for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>>>>>              rdpmcl(i, pmu->gp_counters[i].counter);
>>>>>
>>>>>      for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>>>>              rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
>>>>>                     pmu->fixed_counters[i].counter);
>>>>>
>>>>>> +  }
>>>>>> +
>>>>>> +  rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>>>> +  /*
>>>>>> +   * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>>>>>> +   * also avoid these guest fixed counters get accidentially enabled
>>>>>> +   * during host running when host enable global ctrl.
>>>>>> +   */
>>>>>> +  if (pmu->fixed_ctr_ctrl)
>>>>>> +          wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>>>>> +  for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>>>> +          pmc = &pmu->fixed_counters[i];
>>>>>> +          rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>>>>>> +          if (pmc->counter)
>>>>>> +                  wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>>>>>> +  }
>>>>>>     }
>>>>>>     static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>>>>>>     {
>>>>>> +  struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>>>> +  struct kvm_pmc *pmc;
>>>>>> +  u64 global_status;
>>>>>> +  int i;
>>>>>> +
>>>>>> +  if (pmu->version != 2) {
>>>>>> +          pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>>>> +          return;
>>>>>> +  }
>>>>>> +
>>>>>> +  /* Clear host global_ctrl and global_status MSR if non-zero. */
>>>>>> +  wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>>>> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>>>> As previous comments say, host perf always enable all counters in
>>>> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
>>>> ensure all counters in disabled state and the later counter manipulation
>>>> (writing MSR) won't cause any race condition or unexpected behavior on HW.
>>>>
>>>>
>>>>>> +  rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>>>>>> +  if (global_status)
>>>>>> +          wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>>>>> This seems especially silly, isn't the full MSR being written below?  Or am I
>>>>> misunderstanding how these things work?
>>>> I think Jim's comment has already explain why we need to do this.
>>>>
>>>>>> +  wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>>>>>> +
>>>>>> +  for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>>>> +          pmc = &pmu->gp_counters[i];
>>>>>> +          wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
>>>>>> +          wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>>>>>> +  }
>>>>>> +
>>>>>> +  wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>>>> +  for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>>>> +          pmc = &pmu->fixed_counters[i];
>>>>>> +          wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>>>>>> +  }
>>>>>>     }
>>>>>>     struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>>>>> --
>>>>>> 2.34.1
>>>>>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-15  1:06       ` Zhang, Xiong Y
@ 2024-04-15 15:05         ` Sean Christopherson
  2024-04-16  5:11           ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-15 15:05 UTC (permalink / raw)
  To: Xiong Y Zhang
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Mon, Apr 15, 2024, Xiong Y Zhang wrote:
> On 4/13/2024 2:32 AM, Sean Christopherson wrote:
> > On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
> >>>> 2. NMI watchdog
> >>>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
> >>>>    will be stopped also during vm running, but it doesn't have
> >>>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> >>>>    watchdog loses function during VM running.
> >>>>
> >>>>    Two candidates exist for replacing perf event of NMI watchdog:
> >>>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> >>>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> >>>
> >>> I think the simplest solution is to allow mediated PMU usage if and only if
> >>> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> >>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> >>> problem to solve.
> >> Make sense. KVM should not affect host high priority work.
> >> NMI watchdog is a client of perf and is a system wide perf event, perf can't
> >> distinguish a system wide perf event is NMI watchdog or others, so how about
> >> we extend this suggestion to all the system wide perf events ?  mediated PMU
> >> is only allowed when all system wide perf events are disabled or non-exist at
> >> vm creation.
> > 
> > What other kernel-driven system wide perf events are there?
> does "kernel-driven" mean perf events created through
> perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ?

By kernel-driven I meant events that aren't tied to a single userspace process
or action.

E.g. KVM creates events, but those events are effectively user-driven because
they will go away if the associated VM terminates.

> User can create system wide perf event through "perf record -e {} -a" also, I
> call it as user-driven system wide perf events.  Perf subsystem doesn't
> distinguish "kernel-driven" and "user-driven" system wide perf events.

Right, but us humans can build a list, even if it's only for documentation, e.g.
to provide help for someone to run KVM guests with mediated PMUs, but can't
because there are active !exclude_guest events.

> >> but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> > 
> > I don't think it is at all unreasonable to require users that want optimal PMU
> > virtualization to adjust their environment.  And we can and should document the
> > tradeoffs and alternatives, e.g. so that users that want better PMU results don't
> > need to re-discover all the "gotchas" on their own.
> > 
> > This would even be one of the rare times where I would be ok with a dmesg log.
> > E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
> > perf events, pr_warn() to explain the conflict and direct the user at documentation
> > explaining how to make their system compatible with mediate PMU usage.> 
> >>>> 3. Dedicated kvm_pmi_vector
> >>>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> >>>>    PMI into guest when physical PMI belongs to guest counter. If the
> >>>>    same mechanism is used in passthrough vPMU and PMI skid exists
> >>>>    which cause physical PMI belonging to guest happens after VM-exit,
> >>>>    then the host PMI handler couldn't identify this PMI belongs to
> >>>>    host or guest.
> >>>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> >>>>    has this vector only. The PMI belonging to host still has an NMI
> >>>>    vector.
> >>>>
> >>>>    Without considering PMI skid especially for AMD, the host NMI vector
> >>>>    could be used for guest PMI also, this method is simpler and doesn't
> >>>
> >>> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> >>> the complexity is a wash, just in different locations, and I highly doubt it's
> >>> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> >>> LVTPC.
> >> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
> >> NMI between guest and host, we could extend guest PT's PMI framework to
> >> mediated PMU. so I think this is simpler.
> > 
> > Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
> > PT framework of multiplexing NMI?
> here "this" means "extending the PT framework of multiplexing NMI".

The PT framework's multiplexing is just as crude as regular PMIs though.  Perf
basically just asks KVM: is this yours?  And KVM simply checks that the callback
occurred while KVM_HANDLING_NMI is set.

E.g. prior to commit 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region"),
nothing would prevent perf from miscontruing a host PMI as a guest PMI, because
KVM re-enabled host PT prior to servicing guest NMIs, i.e. host PT would be active
while KVM_HANDLING_NMI is set.

And conversely, if a guest PMI skids past VM-Exit, as things currently stand, the
NMI will always be treated as host PMI, because KVM will not be in KVM_HANDLING_NMI.
KVM's emulated PMI can (and should) eliminate false positives for host PMIs by
precisely checking exclude_guest, but that doesn't help with false negatives for
guest PMIs, nor does it help with NMIs that aren't perf related, i.e. didn't come
from the LVTPC.

Is a naive implementation simpler?  Maybe.  But IMO, multiplexing NMI and getting
all the edge cases right is more complex than using a dedicated vector for guest
PMIs, as the latter provides a "hard" boundary and allows the kernel to _know_ that
an interrupt is for a guest PMI.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-12 20:56         ` Liang, Kan
@ 2024-04-15 16:03           ` Liang, Kan
  2024-04-16  5:34             ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-15 16:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-12 4:56 p.m., Liang, Kan wrote:
>> What if perf had a global knob to enable/disable mediate PMU support?  Then when
>> KVM is loaded with enable_mediated_true, call into perf to (a) check that there
>> are no existing !exclude_guest events (this part could be optional), and (b) set
>> the global knob to reject all new !exclude_guest events (for the core PMU?).
>>
>> Hmm, or probably better, do it at VM creation.  That has the advantage of playing
>> nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
>> KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
>> want to run VMs.
> I think it should be doable, and may simplify the perf implementation.
> (The check in the schedule stage should not be necessary anymore.)
> 
> With this, something like NMI watchdog should fail the VM creation. The
> user should either disable the NMI watchdog or use a replacement.
> 
> Thanks,
> Kan
>> E.g. (very roughly)
>>
>> int x86_perf_get_mediated_pmu(void)
>> {
>> 	if (refcount_inc_not_zero(...))
>> 		return 0;
>>
>> 	if (<system wide events>)
>> 		return -EBUSY;
>>
>> 	<slow path with locking>
>> }
>>
>> void x86_perf_put_mediated_pmu(void)
>> {
>> 	if (!refcount_dec_and_test(...))
>> 		return;
>>
>> 	<slow path with locking>
>> }


I think the locking should include the refcount check and system wide
event check as well.
It should be possible that two VMs are created very close.
The second creation may mistakenly return 0 if there is no lock.

I plan to do something as below (not test yet).

+/*
+ * Currently invoked at VM creation to
+ * - Check whether there are existing !exclude_guest system wide events
+ *   of PMU with PERF_PMU_CAP_MEDIATED_VPMU
+ * - Set nr_mediated_pmu to prevent !exclude_guest event creation on
+ *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
+ *
+ * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
+ * still owns all the PMU resources.
+ */
+int x86_perf_get_mediated_pmu(void)
+{
+	int ret = 0;
+	mutex_lock(&perf_mediated_pmu_mutex);
+	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
+		goto end;
+
+	if (atomic_read(&nr_include_guest_events)) {
+		ret = -EBUSY;
+		goto end;
+	}
+	refcount_inc(&nr_mediated_pmu_vms);
+end:
+	mutex_unlock(&perf_mediated_pmu_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_mediated_pmu);
+
+void x86_perf_put_mediated_pmu(void)
+{
+	mutex_lock(&perf_mediated_pmu_mutex);
+	refcount_dec(&nr_mediated_pmu_vms);
+	mutex_unlock(&perf_mediated_pmu_mutex);
+}
+EXPORT_SYMBOL_GPL(x86_perf_put_mediated_pmu);


Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15 10:04             ` Mi, Dapeng
@ 2024-04-15 16:44               ` Mingwei Zhang
  2024-04-15 17:38                 ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-15 16:44 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
> >>> On Sat, Apr 13, 2024, Mi, Dapeng wrote:
> >>>> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> >>>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> >>>>>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>
> >>>>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> >>>>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> >>>>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> >>>>>> and gains the full HW PMU ownership. On the contrary, host regains the
> >>>>>> ownership of PMU HW from KVM when control flow leaves the scope of
> >>>>>> passthrough PMU.
> >>>>>>
> >>>>>> Implement PMU context switches for Intel CPUs and opptunistically use
> >>>>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
> >>>>>> lower latency in Intel CPUs.
> >>>>>>
> >>>>>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> >>>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> ---
> >>>>>>     arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> >>>>>>     1 file changed, 73 insertions(+)
> >>>>>>
> >>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> >>>>>> index 0d58fe7d243e..f79bebe7093d 100644
> >>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
> >>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> >>>>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >>>>>>     static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> >>>>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
> >>>> Yeah. It looks clearer.
> >>>>>>     {
> >>>>>> +  struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>>>> +  struct kvm_pmc *pmc;
> >>>>>> +  u32 i;
> >>>>>> +
> >>>>>> +  if (pmu->version != 2) {
> >>>>>> +          pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>>>> +          return;
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  /* Global ctrl register is already saved at VM-exit. */
> >>>>>> +  rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> >>>>>> +  /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> >>>>>> +  if (pmu->global_status)
> >>>>>> +          wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> >>>>>> +
> >>>>>> +  for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>>>> +          pmc = &pmu->gp_counters[i];
> >>>>>> +          rdpmcl(i, pmc->counter);
> >>>>>> +          rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> >>>>>> +          /*
> >>>>>> +           * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> >>>>>> +           * leakage and also avoid this guest GP counter get accidentally
> >>>>>> +           * enabled during host running when host enable global ctrl.
> >>>>>> +           */
> >>>>>> +          if (pmc->eventsel)
> >>>>>> +                  wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> >>>>>> +          if (pmc->counter)
> >>>>>> +                  wrmsrl(MSR_IA32_PMC0 + i, 0);
> >>>>> This doesn't make much sense.  The kernel already has full access to the guest,
> >>>>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
> >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> >>>> unexpectedly on host later since Host perf always enable all validate bits
> >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> >>>>
> >>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
> >>>>
> >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> >>> values to the host? NO. Not in cloud usage.
> >> No, this place is clearing the guest counter value instead of host
> >> counter value. Host always has method to see guest value in a normal VM
> >> if he want. I don't see its necessity, it's just a overkill and
> >> introduce extra overhead to write MSRs.
> >>
> > I am curious how the perf subsystem solves the problem? Does perf
> > subsystem in the host only scrubbing the selector but not the counter
> > value when doing the context switch?
>
> When context switch happens, perf code would schedule out the old events
> and schedule in the new events. When scheduling out, the ENABLE bit of
> EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> and PMCx MSRs would be overwritten with new event's attr.config and
> sample_period separately.  Of course, these is only for the case when
> there are new events to be programmed on the PMC. If no new events, the
> PMCx MSR would keep stall value and won't be cleared.
>
> Anyway, I don't see any reason that PMCx MSR must be cleared.
>

I don't have a strong opinion on the upstream version. But since both
the mediated vPMU and perf are clients of PMU HW, leaving PMC values
uncleared when transition out of the vPMU boundary is leaking info
technically.

Alternatively, doing the clearing at vcpu loop boundary should be
sufficient if considering performance overhead.

Thanks
-Mingwei
>
>
> >
> >>> Please make changes to this patch with **extreme** caution.
> >>>
> >>> According to our past experience, if there is a bug somewhere,
> >>> there is a catch here (normally).
> >>>
> >>> Thanks.
> >>> -Mingwei
> >>>>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> >>>>> the event selector, we gots problems.
> >>>>>
> >>>>> Same thing for the fixed counters below.  Can't this just be?
> >>>>>
> >>>>>      for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> >>>>>              rdpmcl(i, pmu->gp_counters[i].counter);
> >>>>>
> >>>>>      for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> >>>>>              rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> >>>>>                     pmu->fixed_counters[i].counter);
> >>>>>
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>>>> +  /*
> >>>>>> +   * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> >>>>>> +   * also avoid these guest fixed counters get accidentially enabled
> >>>>>> +   * during host running when host enable global ctrl.
> >>>>>> +   */
> >>>>>> +  if (pmu->fixed_ctr_ctrl)
> >>>>>> +          wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> >>>>>> +  for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>>>> +          pmc = &pmu->fixed_counters[i];
> >>>>>> +          rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> >>>>>> +          if (pmc->counter)
> >>>>>> +                  wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> >>>>>> +  }
> >>>>>>     }
> >>>>>>     static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> >>>>>>     {
> >>>>>> +  struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>>>> +  struct kvm_pmc *pmc;
> >>>>>> +  u64 global_status;
> >>>>>> +  int i;
> >>>>>> +
> >>>>>> +  if (pmu->version != 2) {
> >>>>>> +          pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>>>> +          return;
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  /* Clear host global_ctrl and global_status MSR if non-zero. */
> >>>>>> +  wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >>>>> Why?  PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> >>>> As previous comments say, host perf always enable all counters in
> >>>> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
> >>>> ensure all counters in disabled state and the later counter manipulation
> >>>> (writing MSR) won't cause any race condition or unexpected behavior on HW.
> >>>>
> >>>>
> >>>>>> +  rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> >>>>>> +  if (global_status)
> >>>>>> +          wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> >>>>> This seems especially silly, isn't the full MSR being written below?  Or am I
> >>>>> misunderstanding how these things work?
> >>>> I think Jim's comment has already explain why we need to do this.
> >>>>
> >>>>>> +  wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> >>>>>> +
> >>>>>> +  for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>>>> +          pmc = &pmu->gp_counters[i];
> >>>>>> +          wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> >>>>>> +          wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>>>> +  for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>>>> +          pmc = &pmu->fixed_counters[i];
> >>>>>> +          wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> >>>>>> +  }
> >>>>>>     }
> >>>>>>     struct kvm_pmu_ops intel_pmu_ops __initdata = {
> >>>>>> --
> >>>>>> 2.34.1
> >>>>>>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15 16:44               ` Mingwei Zhang
@ 2024-04-15 17:38                 ` Sean Christopherson
  2024-04-15 17:54                   ` Mingwei Zhang
  2024-04-18 21:21                   ` Mingwei Zhang
  0 siblings, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-15 17:38 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Dapeng Mi, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > >>>>
> > >>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
> > >>>>
> > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > >>> values to the host? NO. Not in cloud usage.
> > >> No, this place is clearing the guest counter value instead of host
> > >> counter value. Host always has method to see guest value in a normal VM
> > >> if he want. I don't see its necessity, it's just a overkill and
> > >> introduce extra overhead to write MSRs.
> > >>
> > > I am curious how the perf subsystem solves the problem? Does perf
> > > subsystem in the host only scrubbing the selector but not the counter
> > > value when doing the context switch?
> >
> > When context switch happens, perf code would schedule out the old events
> > and schedule in the new events. When scheduling out, the ENABLE bit of
> > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > and PMCx MSRs would be overwritten with new event's attr.config and
> > sample_period separately.  Of course, these is only for the case when
> > there are new events to be programmed on the PMC. If no new events, the
> > PMCx MSR would keep stall value and won't be cleared.
> >
> > Anyway, I don't see any reason that PMCx MSR must be cleared.
> >
> 
> I don't have a strong opinion on the upstream version. But since both
> the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> uncleared when transition out of the vPMU boundary is leaking info
> technically.

I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
in the guest's TCB, which is what I would consider a true leak.  I'm objecting
to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
PMCs when saving guest state without coordinating with perf in any way.

I am ok if we start with (or default to) a "safe" implementation that zeroes all
PMCs when switching to host context, but I want KVM and perf to work together to
do the context switches, e.g. so that we don't end up with code where KVM writes
to all PMC MSRs and that perf also immediately writes to all PMC MSRs.

One my biggest complaints with the current vPMU code is that the roles and
responsibilities between KVM and perf are poorly defined, which leads to suboptimal
and hard to maintain code.

Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
state to userspace processes that have RDPMC permissions, as the PMCs might not
be dirty from perf's perspective (see perf_clear_dirty_counters()).

Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
or if KVM is just being paranoid.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15 17:38                 ` Sean Christopherson
@ 2024-04-15 17:54                   ` Mingwei Zhang
  2024-04-15 22:45                     ` Sean Christopherson
  2024-04-18 21:21                   ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-15 17:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dapeng Mi, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > > >>>>
> > > >>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
> > > >>>>
> > > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > > >>> values to the host? NO. Not in cloud usage.
> > > >> No, this place is clearing the guest counter value instead of host
> > > >> counter value. Host always has method to see guest value in a normal VM
> > > >> if he want. I don't see its necessity, it's just a overkill and
> > > >> introduce extra overhead to write MSRs.
> > > >>
> > > > I am curious how the perf subsystem solves the problem? Does perf
> > > > subsystem in the host only scrubbing the selector but not the counter
> > > > value when doing the context switch?
> > >
> > > When context switch happens, perf code would schedule out the old events
> > > and schedule in the new events. When scheduling out, the ENABLE bit of
> > > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > > and PMCx MSRs would be overwritten with new event's attr.config and
> > > sample_period separately.  Of course, these is only for the case when
> > > there are new events to be programmed on the PMC. If no new events, the
> > > PMCx MSR would keep stall value and won't be cleared.
> > >
> > > Anyway, I don't see any reason that PMCx MSR must be cleared.
> > >
> >
> > I don't have a strong opinion on the upstream version. But since both
> > the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> > uncleared when transition out of the vPMU boundary is leaking info
> > technically.
>
> I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
> in the guest's TCB, which is what I would consider a true leak.  I'm objecting
> to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
> PMCs when saving guest state without coordinating with perf in any way.
>
> I am ok if we start with (or default to) a "safe" implementation that zeroes all
> PMCs when switching to host context, but I want KVM and perf to work together to
> do the context switches, e.g. so that we don't end up with code where KVM writes
> to all PMC MSRs and that perf also immediately writes to all PMC MSRs.

I am fully aligned with you on this.

>
> One my biggest complaints with the current vPMU code is that the roles and
> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> and hard to maintain code.
>
> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> state to userspace processes that have RDPMC permissions, as the PMCs might not
> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>
> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> or if KVM is just being paranoid.

So once this rolls out, perf and vPMU are clients directly to PMU HW.
Faithful cleaning (blind cleaning) has to be the baseline
implementation, until both clients agree to a "deal" between them.
Currently, there is no such deal, but I believe we could have one via
future discussion.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15 17:54                   ` Mingwei Zhang
@ 2024-04-15 22:45                     ` Sean Christopherson
  2024-04-22  2:14                       ` maobibo
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-15 22:45 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Dapeng Mi, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <seanjc@google.com> wrote:
> > One my biggest complaints with the current vPMU code is that the roles and
> > responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> > and hard to maintain code.
> >
> > Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> > state to userspace processes that have RDPMC permissions, as the PMCs might not
> > be dirty from perf's perspective (see perf_clear_dirty_counters()).
> >
> > Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> > overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> > or if KVM is just being paranoid.
> 
> So once this rolls out, perf and vPMU are clients directly to PMU HW.

I don't think this is a statement we want to make, as it opens a discussion
that we won't win.  Nor do I think it's one we *need* to make.  KVM doesn't need
to be on equal footing with perf in terms of owning/managing PMU hardware, KVM
just needs a few APIs to allow faithfully and accurately virtualizing a guest PMU.

> Faithful cleaning (blind cleaning) has to be the baseline
> implementation, until both clients agree to a "deal" between them.
> Currently, there is no such deal, but I believe we could have one via
> future discussion.

What I am saying is that there needs to be a "deal" in place before this code
is merged.  It doesn't need to be anything fancy, e.g. perf can still pave over
PMCs it doesn't immediately load, as opposed to using cpu_hw_events.dirty to lazily
do the clearing.  But perf and KVM need to work together from the get go, i.e. I
don't want KVM doing something without regard to what perf does, and vice versa.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-15 15:05         ` Sean Christopherson
@ 2024-04-16  5:11           ` Zhang, Xiong Y
  0 siblings, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-16  5:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 4/15/2024 11:05 PM, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Xiong Y Zhang wrote:
>> On 4/13/2024 2:32 AM, Sean Christopherson wrote:
>>> On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
>>>>>> 2. NMI watchdog
>>>>>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>>>>>    will be stopped also during vm running, but it doesn't have
>>>>>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>>>>>    watchdog loses function during VM running.
>>>>>>
>>>>>>    Two candidates exist for replacing perf event of NMI watchdog:
>>>>>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>>>>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
>>>>>
>>>>> I think the simplest solution is to allow mediated PMU usage if and only if
>>>>> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
>>>>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
>>>>> problem to solve.
>>>> Make sense. KVM should not affect host high priority work.
>>>> NMI watchdog is a client of perf and is a system wide perf event, perf can't
>>>> distinguish a system wide perf event is NMI watchdog or others, so how about
>>>> we extend this suggestion to all the system wide perf events ?  mediated PMU
>>>> is only allowed when all system wide perf events are disabled or non-exist at
>>>> vm creation.
>>>
>>> What other kernel-driven system wide perf events are there?
>> does "kernel-driven" mean perf events created through
>> perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ?
> 
> By kernel-driven I meant events that aren't tied to a single userspace process
> or action.
> 
> E.g. KVM creates events, but those events are effectively user-driven because
> they will go away if the associated VM terminates.
> 
>> User can create system wide perf event through "perf record -e {} -a" also, I
>> call it as user-driven system wide perf events.  Perf subsystem doesn't
>> distinguish "kernel-driven" and "user-driven" system wide perf events.
> 
> Right, but us humans can build a list, even if it's only for documentation, e.g.
> to provide help for someone to run KVM guests with mediated PMUs, but can't
> because there are active !exclude_guest events.
> 
>>>> but NMI watchdog is usually enabled, this will limit mediated PMU usage.
>>>
>>> I don't think it is at all unreasonable to require users that want optimal PMU
>>> virtualization to adjust their environment.  And we can and should document the
>>> tradeoffs and alternatives, e.g. so that users that want better PMU results don't
>>> need to re-discover all the "gotchas" on their own.
>>>
>>> This would even be one of the rare times where I would be ok with a dmesg log.
>>> E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
>>> perf events, pr_warn() to explain the conflict and direct the user at documentation
>>> explaining how to make their system compatible with mediate PMU usage.> 
>>>>>> 3. Dedicated kvm_pmi_vector
>>>>>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>>>>>    PMI into guest when physical PMI belongs to guest counter. If the
>>>>>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>>>>>    which cause physical PMI belonging to guest happens after VM-exit,
>>>>>>    then the host PMI handler couldn't identify this PMI belongs to
>>>>>>    host or guest.
>>>>>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>>>>>    has this vector only. The PMI belonging to host still has an NMI
>>>>>>    vector.
>>>>>>
>>>>>>    Without considering PMI skid especially for AMD, the host NMI vector
>>>>>>    could be used for guest PMI also, this method is simpler and doesn't
>>>>>
>>>>> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
>>>>> the complexity is a wash, just in different locations, and I highly doubt it's
>>>>> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
>>>>> LVTPC.
>>>> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
>>>> NMI between guest and host, we could extend guest PT's PMI framework to
>>>> mediated PMU. so I think this is simpler.
>>>
>>> Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
>>> PT framework of multiplexing NMI?
>> here "this" means "extending the PT framework of multiplexing NMI".
> 
> The PT framework's multiplexing is just as crude as regular PMIs though.  Perf
> basically just asks KVM: is this yours?  And KVM simply checks that the callback
> occurred while KVM_HANDLING_NMI is set.
> 
> E.g. prior to commit 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region"),
> nothing would prevent perf from miscontruing a host PMI as a guest PMI, because
> KVM re-enabled host PT prior to servicing guest NMIs, i.e. host PT would be active
> while KVM_HANDLING_NMI is set.
> 
> And conversely, if a guest PMI skids past VM-Exit, as things currently stand, the
> NMI will always be treated as host PMI, because KVM will not be in KVM_HANDLING_NMI.
> KVM's emulated PMI can (and should) eliminate false positives for host PMIs by
> precisely checking exclude_guest, but that doesn't help with false negatives for
> guest PMIs, nor does it help with NMIs that aren't perf related, i.e. didn't come
> from the LVTPC> 
> Is a naive implementation simpler?  Maybe.  But IMO, multiplexing NMI and getting
> all the edge cases right is more complex than using a dedicated vector for guest
> PMIs, as the latter provides a "hard" boundary and allows the kernel to _know_ that
> an interrupt is for a guest PMI.
>Totally agree the complex to fix multiplexing NMI corner case. Thanks for explanation.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-15 16:03           ` Liang, Kan
@ 2024-04-16  5:34             ` Zhang, Xiong Y
  2024-04-16 12:48               ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-16  5:34 UTC (permalink / raw)
  To: Liang, Kan, Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 4/16/2024 12:03 AM, Liang, Kan wrote:
> 
> 
> On 2024-04-12 4:56 p.m., Liang, Kan wrote:
>>> What if perf had a global knob to enable/disable mediate PMU support?  Then when
>>> KVM is loaded with enable_mediated_true, call into perf to (a) check that there
>>> are no existing !exclude_guest events (this part could be optional), and (b) set
>>> the global knob to reject all new !exclude_guest events (for the core PMU?).
>>>
>>> Hmm, or probably better, do it at VM creation.  That has the advantage of playing
>>> nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
>>> KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
>>> want to run VMs.
>> I think it should be doable, and may simplify the perf implementation.
>> (The check in the schedule stage should not be necessary anymore.)
>>
>> With this, something like NMI watchdog should fail the VM creation. The
>> user should either disable the NMI watchdog or use a replacement.
>>
>> Thanks,
>> Kan
>>> E.g. (very roughly)
>>>
>>> int x86_perf_get_mediated_pmu(void)
>>> {
>>> 	if (refcount_inc_not_zero(...))
>>> 		return 0;
>>>
>>> 	if (<system wide events>)
>>> 		return -EBUSY;
>>>
>>> 	<slow path with locking>
>>> }
>>>
>>> void x86_perf_put_mediated_pmu(void)
>>> {
>>> 	if (!refcount_dec_and_test(...))
>>> 		return;
>>>
>>> 	<slow path with locking>
>>> }
> 
> 
> I think the locking should include the refcount check and system wide
> event check as well.
> It should be possible that two VMs are created very close.
> The second creation may mistakenly return 0 if there is no lock.
> 
> I plan to do something as below (not test yet).
> 
> +/*
> + * Currently invoked at VM creation to
> + * - Check whether there are existing !exclude_guest system wide events
> + *   of PMU with PERF_PMU_CAP_MEDIATED_VPMU
> + * - Set nr_mediated_pmu to prevent !exclude_guest event creation on
> + *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
> + *
> + * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
> + * still owns all the PMU resources.
> + */
> +int x86_perf_get_mediated_pmu(void)
> +{
> +	int ret = 0;
> +	mutex_lock(&perf_mediated_pmu_mutex);
> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
> +		goto end;
> +
> +	if (atomic_read(&nr_include_guest_events)) {
> +		ret = -EBUSY;
> +		goto end;
> +	}
> +	refcount_inc(&nr_mediated_pmu_vms);
> +end:
> +	mutex_unlock(&perf_mediated_pmu_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(x86_perf_get_mediated_pmu);
> +
> +void x86_perf_put_mediated_pmu(void)
> +{
> +	mutex_lock(&perf_mediated_pmu_mutex);
> +	refcount_dec(&nr_mediated_pmu_vms);
> +	mutex_unlock(&perf_mediated_pmu_mutex);
> +}
> +EXPORT_SYMBOL_GPL(x86_perf_put_mediated_pmu);
> 
> 
> Thanks,
> Kan
x86_perf_get_mediated_pmu() is called at vm_create(), x86_perf_put_mediated_pmu() is called at vm_destroy(), then system wide perf events without exclude_guest=1 can not be created during the whole vm life cycle (where nr_mediated_pmu_vms > 0 always), do I understand and use the interface correctly ?

thanks

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-16  5:34             ` Zhang, Xiong Y
@ 2024-04-16 12:48               ` Liang, Kan
  2024-04-17  9:42                 ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-16 12:48 UTC (permalink / raw)
  To: Zhang, Xiong Y, Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024-04-16 1:34 a.m., Zhang, Xiong Y wrote:
> 
> 
> On 4/16/2024 12:03 AM, Liang, Kan wrote:
>>
>>
>> On 2024-04-12 4:56 p.m., Liang, Kan wrote:
>>>> What if perf had a global knob to enable/disable mediate PMU support?  Then when
>>>> KVM is loaded with enable_mediated_true, call into perf to (a) check that there
>>>> are no existing !exclude_guest events (this part could be optional), and (b) set
>>>> the global knob to reject all new !exclude_guest events (for the core PMU?).
>>>>
>>>> Hmm, or probably better, do it at VM creation.  That has the advantage of playing
>>>> nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
>>>> KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
>>>> want to run VMs.
>>> I think it should be doable, and may simplify the perf implementation.
>>> (The check in the schedule stage should not be necessary anymore.)
>>>
>>> With this, something like NMI watchdog should fail the VM creation. The
>>> user should either disable the NMI watchdog or use a replacement.
>>>
>>> Thanks,
>>> Kan
>>>> E.g. (very roughly)
>>>>
>>>> int x86_perf_get_mediated_pmu(void)
>>>> {
>>>> 	if (refcount_inc_not_zero(...))
>>>> 		return 0;
>>>>
>>>> 	if (<system wide events>)
>>>> 		return -EBUSY;
>>>>
>>>> 	<slow path with locking>
>>>> }
>>>>
>>>> void x86_perf_put_mediated_pmu(void)
>>>> {
>>>> 	if (!refcount_dec_and_test(...))
>>>> 		return;
>>>>
>>>> 	<slow path with locking>
>>>> }
>>
>>
>> I think the locking should include the refcount check and system wide
>> event check as well.
>> It should be possible that two VMs are created very close.
>> The second creation may mistakenly return 0 if there is no lock.
>>
>> I plan to do something as below (not test yet).
>>
>> +/*
>> + * Currently invoked at VM creation to
>> + * - Check whether there are existing !exclude_guest system wide events
>> + *   of PMU with PERF_PMU_CAP_MEDIATED_VPMU
>> + * - Set nr_mediated_pmu to prevent !exclude_guest event creation on
>> + *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
>> + *
>> + * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
>> + * still owns all the PMU resources.
>> + */
>> +int x86_perf_get_mediated_pmu(void)
>> +{
>> +	int ret = 0;
>> +	mutex_lock(&perf_mediated_pmu_mutex);
>> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
>> +		goto end;
>> +
>> +	if (atomic_read(&nr_include_guest_events)) {
>> +		ret = -EBUSY;
>> +		goto end;
>> +	}
>> +	refcount_inc(&nr_mediated_pmu_vms);
>> +end:
>> +	mutex_unlock(&perf_mediated_pmu_mutex);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(x86_perf_get_mediated_pmu);
>> +
>> +void x86_perf_put_mediated_pmu(void)
>> +{
>> +	mutex_lock(&perf_mediated_pmu_mutex);
>> +	refcount_dec(&nr_mediated_pmu_vms);
>> +	mutex_unlock(&perf_mediated_pmu_mutex);
>> +}
>> +EXPORT_SYMBOL_GPL(x86_perf_put_mediated_pmu);
>>
>>
>> Thanks,
>> Kan
> x86_perf_get_mediated_pmu() is called at vm_create(), x86_perf_put_mediated_pmu() is called at vm_destroy(), then system wide perf events without exclude_guest=1 can not be created during the whole vm life cycle (where nr_mediated_pmu_vms > 0 always), do I understand and use the interface correctly ?

Right, but it only impacts the events of PMU with the
PERF_PMU_CAP_MEDIATED_VPMU.
For other PMUs, the event with exclude_guest=1 can still be created.
KVM should not touch the counters of the PMU without
PERF_PMU_CAP_MEDIATED_VPMU.

BTW: I will also remove the prefix x86, since the functions are in the
generic code.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-16 12:48               ` Liang, Kan
@ 2024-04-17  9:42                 ` Zhang, Xiong Y
  2024-04-18 16:11                   ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-17  9:42 UTC (permalink / raw)
  To: Liang, Kan, Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 4/16/2024 8:48 PM, Liang, Kan wrote:
> 
> 
> On 2024-04-16 1:34 a.m., Zhang, Xiong Y wrote:
>>
>>
>> On 4/16/2024 12:03 AM, Liang, Kan wrote:
>>>
>>>
>>> On 2024-04-12 4:56 p.m., Liang, Kan wrote:
>>>>> What if perf had a global knob to enable/disable mediate PMU support?  Then when
>>>>> KVM is loaded with enable_mediated_true, call into perf to (a) check that there
>>>>> are no existing !exclude_guest events (this part could be optional), and (b) set
>>>>> the global knob to reject all new !exclude_guest events (for the core PMU?).
>>>>>
>>>>> Hmm, or probably better, do it at VM creation.  That has the advantage of playing
>>>>> nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
>>>>> KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
>>>>> want to run VMs.
>>>> I think it should be doable, and may simplify the perf implementation.
>>>> (The check in the schedule stage should not be necessary anymore.)
>>>>
>>>> With this, something like NMI watchdog should fail the VM creation. The
>>>> user should either disable the NMI watchdog or use a replacement.
>>>>
>>>> Thanks,
>>>> Kan
>>>>> E.g. (very roughly)
>>>>>
>>>>> int x86_perf_get_mediated_pmu(void)
>>>>> {
>>>>> 	if (refcount_inc_not_zero(...))
>>>>> 		return 0;
>>>>>
>>>>> 	if (<system wide events>)
>>>>> 		return -EBUSY;
>>>>>
>>>>> 	<slow path with locking>
>>>>> }
>>>>>
>>>>> void x86_perf_put_mediated_pmu(void)
>>>>> {
>>>>> 	if (!refcount_dec_and_test(...))
>>>>> 		return;
>>>>>
>>>>> 	<slow path with locking>
>>>>> }
>>>
>>>
>>> I think the locking should include the refcount check and system wide
>>> event check as well.
>>> It should be possible that two VMs are created very close.
>>> The second creation may mistakenly return 0 if there is no lock.
>>>
>>> I plan to do something as below (not test yet).
>>>
>>> +/*
>>> + * Currently invoked at VM creation to
>>> + * - Check whether there are existing !exclude_guest system wide events
>>> + *   of PMU with PERF_PMU_CAP_MEDIATED_VPMU
>>> + * - Set nr_mediated_pmu to prevent !exclude_guest event creation on
>>> + *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
>>> + *
>>> + * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
>>> + * still owns all the PMU resources.
>>> + */
>>> +int x86_perf_get_mediated_pmu(void)
>>> +{
>>> +	int ret = 0;
>>> +	mutex_lock(&perf_mediated_pmu_mutex);
>>> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
>>> +		goto end;
>>> +
>>> +	if (atomic_read(&nr_include_guest_events)) {
>>> +		ret = -EBUSY;
>>> +		goto end;
>>> +	}
>>> +	refcount_inc(&nr_mediated_pmu_vms);
>>> +end:
>>> +	mutex_unlock(&perf_mediated_pmu_mutex);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(x86_perf_get_mediated_pmu);
>>> +
>>> +void x86_perf_put_mediated_pmu(void)
>>> +{
>>> +	mutex_lock(&perf_mediated_pmu_mutex);
>>> +	refcount_dec(&nr_mediated_pmu_vms);
>>> +	mutex_unlock(&perf_mediated_pmu_mutex);
>>> +}
>>> +EXPORT_SYMBOL_GPL(x86_perf_put_mediated_pmu);
>>>
>>>
>>> Thanks,
>>> Kan
>> x86_perf_get_mediated_pmu() is called at vm_create(), x86_perf_put_mediated_pmu() is called at vm_destroy(), then system wide perf events without exclude_guest=1 can not be created during the whole vm life cycle (where nr_mediated_pmu_vms > 0 always), do I understand and use the interface correctly ?
> 
> Right, but it only impacts the events of PMU with the
> PERF_PMU_CAP_MEDIATED_VPMU.
> For other PMUs, the event with exclude_guest=1 can still be created.
> KVM should not touch the counters of the PMU without
> PERF_PMU_CAP_MEDIATED_VPMU.
> 
> BTW: I will also remove the prefix x86, since the functions are in the
> generic code.
> 
> Thanks,
> Kan
After userspace VMM call VCPU SET_CPUID() ioctl, KVM knows whether vPMU is enabled or not. If perf_get_mediated_pmu() is called at vm create, it is too early. 
it is better to let perf_get_mediated_pmu() track per cpu PMU state, so perf_get_mediated_pmu() can be called by kvm after vcpu_cpuid_set(). Note user space vmm may call SET_CPUID() on one vcpu multi times, then here refcount maybe isn't suitable. what's a better solution ?

thanks

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-17  9:42                 ` Zhang, Xiong Y
@ 2024-04-18 16:11                   ` Sean Christopherson
  2024-04-19  1:37                     ` Zhang, Xiong Y
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-18 16:11 UTC (permalink / raw)
  To: Xiong Y Zhang
  Cc: Kan Liang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Wed, Apr 17, 2024, Xiong Y Zhang wrote:
> On 4/16/2024 8:48 PM, Liang, Kan wrote:
> >> x86_perf_get_mediated_pmu() is called at vm_create(),
> >> x86_perf_put_mediated_pmu() is called at vm_destroy(), then system wide
> >> perf events without exclude_guest=1 can not be created during the whole vm
> >> life cycle (where nr_mediated_pmu_vms > 0 always), do I understand and use
> >> the interface correctly ?
> > 
> > Right, but it only impacts the events of PMU with the
> > PERF_PMU_CAP_MEDIATED_VPMU.  For other PMUs, the event with exclude_guest=1
> > can still be created.  KVM should not touch the counters of the PMU without
> > PERF_PMU_CAP_MEDIATED_VPMU.
> > 
> > BTW: I will also remove the prefix x86, since the functions are in the
> > generic code.
> > 
> > Thanks,
> > Kan
> After userspace VMM call VCPU SET_CPUID() ioctl, KVM knows whether vPMU is
> enabled or not. If perf_get_mediated_pmu() is called at vm create, it is too
> early.

Eh, if someone wants to create _only_ VMs without vPMUs, then they should load
KVM with enable_pmu=false.  I can see people complaining about not being able to
create VMs if they don't want to use have *any* vPMU usage, but I doubt anyone
has a use cases where they want a mix of PMU-enabled and PMU- disabled VMs, *and*
they are ok with VM creation failing for some VMs but not others.

> it is better to let perf_get_mediated_pmu() track per cpu PMU state,
> so perf_get_mediated_pmu() can be called by kvm after vcpu_cpuid_set(). Note
> user space vmm may call SET_CPUID() on one vcpu multi times, then here
> refcount maybe isn't suitable. 

Yeah, waiting until KVM_SET_CPUID2 would be unpleasant for both KVM and userspace.
E.g. failing KVM_SET_CPUID2 because KVM can't enable mediated PMU support would
be rather confusing for userspace.

> what's a better solution ?

If doing the checks at VM creation is a stick point for folks, then the best
approach is probably to use KVM_CAP_PMU_CAPABILITY, i.e. require userspace to
explicitly opt-in to enabling mediated PMU usage.  Ha!  We can even do that
without additional uAPI, because KVM interprets cap->args[0]==0 as "enable vPMU".

The big problem with this is that enabling mediated PMU support by default would
break userspace.  Hmm, but that's arguably the case no matter what, as a setup
that worked before would suddenly start failing if the host was configured to use
the PMU-based NMI watchdog.

E.g. this, if we're ok commiting to never enabling mediated PMU by default.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..01d9ee2114c8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6664,9 +6664,21 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
                        break;
 
                mutex_lock(&kvm->lock);
-               if (!kvm->created_vcpus) {
-                       kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
-                       r = 0;
+               /*
+                * To keep PMU configuration "simple", setting vPMU support is
+                * disallowed if vCPUs are created, or if mediated PMU support
+                * was already enabled for the VM.
+                */
+               if (!kvm->created_vcpus &&
+                   (!enable_mediated_pmu || !kvm->arch.enable_pmu)) {
+                       if (enable_mediated_pmu &&
+                           !(cap->args[0] & KVM_PMU_CAP_DISABLE))
+                               r = x86_perf_get_mediated_pmu();
+                       else
+                               r = 0;
+
+                       if (!r)
+                               kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
                }
                mutex_unlock(&kvm->lock);
                break;
@@ -12563,7 +12575,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
        kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
        kvm->arch.guest_can_read_msr_platform_info = true;
-       kvm->arch.enable_pmu = enable_pmu;
+
+       /* PMU virtualization is opt-in when mediated PMU support is enabled. */
+       kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;
 
 #if IS_ENABLED(CONFIG_HYPERV)
        spin_lock_init(&kvm->arch.hv_root_tdp_lock);


^ permalink raw reply related	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-11 17:03 ` [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Sean Christopherson
  2024-04-12  2:19   ` Zhang, Xiong Y
@ 2024-04-18 20:46   ` Mingwei Zhang
  2024-04-18 21:52     ` Mingwei Zhang
  2024-04-19 19:14     ` Sean Christopherson
  1 sibling, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-18 20:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 11, 2024, Sean Christopherson wrote:
> <bikeshed>
> 
> I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> emulates the control plane (controls and event selectors), while the data is
> fully passed through (counters).
> 
> </bikeshed>
Sean,

I feel "mediated PMU" seems to be a little bit off the ..., no? In
KVM, almost all of features are mediated. In our specific case, the
legacy PMU is mediated by KVM and perf subsystem on the host. In new
design, it is mediated by KVM only.

We intercept the control plan in current design, but the only thing
we do is the event filtering. No fancy code change to emulate the control
registers. So, it is still a passthrough logic.

In some (rare) business cases, I think maybe we could fully passthrough
the control plan as well. For instance, sole-tenant machine, or
full-machine VM + full offload. In case if there is a cpu errata, KVM
can force vmexit and dynamically intercept the selectors on all vcpus
with filters checked. It is not supported in current RFC, but maybe
doable in later versions.

With the above, I wonder if we can still use passthrough PMU for
simplicity? But no strong opinion if you really want to keep this name.
I would have to take some time to convince myself.

Thanks.
-Mingwei
> 
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> 
> > 1. host system wide / QEMU events handling during VM running
> >    At VM-entry, all the host perf events which use host x86 PMU will be
> >    stopped. These events with attr.exclude_guest = 1 will be stopped here
> >    and re-started after vm-exit. These events without attr.exclude_guest=1
> >    will be in error state, and they cannot recovery into active state even
> >    if the guest stops running. This impacts host perf a lot and request
> >    host system wide perf events have attr.exclude_guest=1.
> > 
> >    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
> > 
> >    During VM running, perf event creation for system wide and QEMU
> >    process without attr.exclude_guest=1 fail with -EBUSY. 
> > 
> > 2. NMI watchdog
> >    the perf event for NMI watchdog is a system wide cpu pinned event, it
> >    will be stopped also during vm running, but it doesn't have
> >    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> >    watchdog loses function during VM running.
> > 
> >    Two candidates exist for replacing perf event of NMI watchdog:
> >    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> >    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> 
> I think the simplest solution is to allow mediated PMU usage if and only if
> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> problem to solve.
> 
> > 3. Dedicated kvm_pmi_vector
> >    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> >    PMI into guest when physical PMI belongs to guest counter. If the
> >    same mechanism is used in passthrough vPMU and PMI skid exists
> >    which cause physical PMI belonging to guest happens after VM-exit,
> >    then the host PMI handler couldn't identify this PMI belongs to
> >    host or guest.
> >    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> >    has this vector only. The PMI belonging to host still has an NMI
> >    vector.
> > 
> >    Without considering PMI skid especially for AMD, the host NMI vector
> >    could be used for guest PMI also, this method is simpler and doesn't
> 
> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> the complexity is a wash, just in different locations, and I highly doubt it's
> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> LVTPC.
> 
> E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> 
> >    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
> >    didn't meet the skid PMI issue on modern Intel processors.
> > 
> > 4. per-VM passthrough mode configuration
> >    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
> >    it decides vPMU is passthrough mode or emulated mode at kvm module
> >    load time.
> >    Do we need the capability of per-VM passthrough mode configuration?
> >    So an admin can launch some non-passthrough VM and profile these
> >    non-passthrough VMs in host, but admin still cannot profile all
> >    the VMs once passthrough VM existence. This means passthrough vPMU
> >    and emulated vPMU mix on one platform, it has challenges to implement.
> >    As the commit message in commit 0011, the main challenge is 
> >    passthrough vPMU and emulated vPMU have different vPMU features, this
> >    ends up with two different values for kvm_cap.supported_perf_cap, which
> >    is initialized at module load time. To support it, more refactor is
> >    needed.
> 
> I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> existing vPMU support entirely, but that's probably not be realistic, at least not
> in the near future.
> 
> > Remain Works
> > ===
> > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> 
> Before this gets out of its "RFC" phase, I would at least like line of sight to
> a more optimized switch.  I 100% agree that starting with a conservative
> implementation is the way to go, and the kernel absolutely needs to be able to
> profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> 
> But I also don't want to get into a situation where can't figure out a clean,
> robust way to do the optimized context switch without needing (another) massive
> rewrite.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15 17:38                 ` Sean Christopherson
  2024-04-15 17:54                   ` Mingwei Zhang
@ 2024-04-18 21:21                   ` Mingwei Zhang
  2024-04-18 21:41                     ` Mingwei Zhang
  2024-04-19  1:02                     ` Mi, Dapeng
  1 sibling, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-18 21:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dapeng Mi, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Mon, Apr 15, 2024, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > > >>>>
> > > >>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
> > > >>>>
> > > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > > >>> values to the host? NO. Not in cloud usage.
> > > >> No, this place is clearing the guest counter value instead of host
> > > >> counter value. Host always has method to see guest value in a normal VM
> > > >> if he want. I don't see its necessity, it's just a overkill and
> > > >> introduce extra overhead to write MSRs.
> > > >>
> > > > I am curious how the perf subsystem solves the problem? Does perf
> > > > subsystem in the host only scrubbing the selector but not the counter
> > > > value when doing the context switch?
> > >
> > > When context switch happens, perf code would schedule out the old events
> > > and schedule in the new events. When scheduling out, the ENABLE bit of
> > > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > > and PMCx MSRs would be overwritten with new event's attr.config and
> > > sample_period separately.  Of course, these is only for the case when
> > > there are new events to be programmed on the PMC. If no new events, the
> > > PMCx MSR would keep stall value and won't be cleared.
> > >
> > > Anyway, I don't see any reason that PMCx MSR must be cleared.
> > >
> > 
> > I don't have a strong opinion on the upstream version. But since both
> > the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> > uncleared when transition out of the vPMU boundary is leaking info
> > technically.
> 
> I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
> in the guest's TCB, which is what I would consider a true leak.  I'm objecting
> to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
> PMCs when saving guest state without coordinating with perf in any way.

Agree. blindly clearing PMCs is the basic implementation. I am thinking
about what coordination between perf and KVM as well.

> 
> I am ok if we start with (or default to) a "safe" implementation that zeroes all
> PMCs when switching to host context, but I want KVM and perf to work together to
> do the context switches, e.g. so that we don't end up with code where KVM writes
> to all PMC MSRs and that perf also immediately writes to all PMC MSRs.

Sure. Point taken.
> 
> One my biggest complaints with the current vPMU code is that the roles and
> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> and hard to maintain code.

Right.
> 
> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> state to userspace processes that have RDPMC permissions, as the PMCs might not
> be dirty from perf's perspective (see perf_clear_dirty_counters()).
> 

ah. This is a good point.

		switch_mm_irqs_off() =>
		cr4_update_pce_mm() =>
		/*
		 * Clear the existing dirty counters to
		 * prevent the leak for an RDPMC task.
		 */
		perf_clear_dirty_counters()

So perf does clear dirty counter values on process context switch. This
is nice to know.

perf_clear_dirty_counters() clear the counter values according to
cpuc->dirty except for those assigned counters.

> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> or if KVM is just being paranoid.

There is a difference between KVM and perf subsystem on PMU context
switch. The latter has the notion of "perf_events", while the former
currently does not. It is quite hard for KVM to know which counters are
really "in use".

Another point I want to raise up to you is that, KVM PMU context switch
and Perf PMU context switch happens at different timing:

 - The former is a context switch between guest/host state of the same
   process, happening at VM-enter/exit boundary.
 - The latter is a context switch beteen two host-level processes.
 - The former happens before the latter.
 - Current design has no PMC partitioning between host/guest due to
   arch limitation.

From the above, I feel that it might be impossible to combine them or to
add coordination? Unless we do the KVM PMU context switch at vcpu loop
boundary...

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-18 21:21                   ` Mingwei Zhang
@ 2024-04-18 21:41                     ` Mingwei Zhang
  2024-04-19  1:02                     ` Mi, Dapeng
  1 sibling, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-18 21:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dapeng Mi, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024, Sean Christopherson wrote:
> > On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > > On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > > > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > > > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > > > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > > > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > > > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > > > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > > > >>>>
> > > > >>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
> > > > >>>>
> > > > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > > > >>> values to the host? NO. Not in cloud usage.
> > > > >> No, this place is clearing the guest counter value instead of host
> > > > >> counter value. Host always has method to see guest value in a normal VM
> > > > >> if he want. I don't see its necessity, it's just a overkill and
> > > > >> introduce extra overhead to write MSRs.
> > > > >>
> > > > > I am curious how the perf subsystem solves the problem? Does perf
> > > > > subsystem in the host only scrubbing the selector but not the counter
> > > > > value when doing the context switch?
> > > >
> > > > When context switch happens, perf code would schedule out the old events
> > > > and schedule in the new events. When scheduling out, the ENABLE bit of
> > > > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > > > and PMCx MSRs would be overwritten with new event's attr.config and
> > > > sample_period separately.  Of course, these is only for the case when
> > > > there are new events to be programmed on the PMC. If no new events, the
> > > > PMCx MSR would keep stall value and won't be cleared.
> > > >
> > > > Anyway, I don't see any reason that PMCx MSR must be cleared.
> > > >
> > > 
> > > I don't have a strong opinion on the upstream version. But since both
> > > the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> > > uncleared when transition out of the vPMU boundary is leaking info
> > > technically.
> > 
> > I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
> > in the guest's TCB, which is what I would consider a true leak.  I'm objecting
> > to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
> > PMCs when saving guest state without coordinating with perf in any way.
> 
> Agree. blindly clearing PMCs is the basic implementation. I am thinking
> about what coordination between perf and KVM as well.
> 
> > 
> > I am ok if we start with (or default to) a "safe" implementation that zeroes all
> > PMCs when switching to host context, but I want KVM and perf to work together to
> > do the context switches, e.g. so that we don't end up with code where KVM writes
> > to all PMC MSRs and that perf also immediately writes to all PMC MSRs.
> 
> Sure. Point taken.
> > 
> > One my biggest complaints with the current vPMU code is that the roles and
> > responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> > and hard to maintain code.
> 
> Right.
> > 
> > Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> > state to userspace processes that have RDPMC permissions, as the PMCs might not
> > be dirty from perf's perspective (see perf_clear_dirty_counters()).
> > 
> 
> ah. This is a good point.
> 
> 		switch_mm_irqs_off() =>
> 		cr4_update_pce_mm() =>
> 		/*
> 		 * Clear the existing dirty counters to
> 		 * prevent the leak for an RDPMC task.
> 		 */

FYI, for the elaboration of "an RDPMC task".

when echo 2> /sys/devices/cpu/rdpmc, kernel set CR4.PCE to 1.

Once that is done, rdpmc instruction is no longer a priviledged
instruction. It is allowed for all tasks to execute in userspace.

Thanks.
-Mingwei
> 		perf_clear_dirty_counters()
> 
> So perf does clear dirty counter values on process context switch. This
> is nice to know.
> 
> perf_clear_dirty_counters() clear the counter values according to
> cpuc->dirty except for those assigned counters.
> 
> > Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> > overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> > or if KVM is just being paranoid.
> 
> There is a difference between KVM and perf subsystem on PMU context
> switch. The latter has the notion of "perf_events", while the former
> currently does not. It is quite hard for KVM to know which counters are
> really "in use".
> 
> Another point I want to raise up to you is that, KVM PMU context switch
> and Perf PMU context switch happens at different timing:
> 
>  - The former is a context switch between guest/host state of the same
>    process, happening at VM-enter/exit boundary.
>  - The latter is a context switch beteen two host-level processes.
>  - The former happens before the latter.
>  - Current design has no PMC partitioning between host/guest due to
>    arch limitation.
> 
> From the above, I feel that it might be impossible to combine them or to
> add coordination? Unless we do the KVM PMU context switch at vcpu loop
> boundary...
> 
> Thanks.
> -Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-18 20:46   ` Mingwei Zhang
@ 2024-04-18 21:52     ` Mingwei Zhang
  2024-04-19 19:14     ` Sean Christopherson
  1 sibling, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-18 21:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Thu, Apr 11, 2024, Sean Christopherson wrote:
> > <bikeshed>
> > 
> > I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> > emulates the control plane (controls and event selectors), while the data is
> > fully passed through (counters).
> > 
> > </bikeshed>
> Sean,
> 
> I feel "mediated PMU" seems to be a little bit off the ..., no? In
> KVM, almost all of features are mediated. In our specific case, the
> legacy PMU is mediated by KVM and perf subsystem on the host. In new
> design, it is mediated by KVM only.
> 
> We intercept the control plan in current design, but the only thing
> we do is the event filtering. No fancy code change to emulate the control
> registers. So, it is still a passthrough logic.
> 
> In some (rare) business cases, I think maybe we could fully passthrough
> the control plan as well. For instance, sole-tenant machine, or
> full-machine VM + full offload. In case if there is a cpu errata, KVM
> can force vmexit and dynamically intercept the selectors on all vcpus
> with filters checked. It is not supported in current RFC, but maybe
> doable in later versions.
> 
> With the above, I wonder if we can still use passthrough PMU for
> simplicity? But no strong opinion if you really want to keep this name.
> I would have to take some time to convince myself.
> 

One propoal. Maybe "direct vPMU"? I think there would be many words that
focus on the "passthrough" side but not on the "interception/mediation"
side?

> Thanks.
> -Mingwei
> > 
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > 
> > > 1. host system wide / QEMU events handling during VM running
> > >    At VM-entry, all the host perf events which use host x86 PMU will be
> > >    stopped. These events with attr.exclude_guest = 1 will be stopped here
> > >    and re-started after vm-exit. These events without attr.exclude_guest=1
> > >    will be in error state, and they cannot recovery into active state even
> > >    if the guest stops running. This impacts host perf a lot and request
> > >    host system wide perf events have attr.exclude_guest=1.
> > > 
> > >    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
> > > 
> > >    During VM running, perf event creation for system wide and QEMU
> > >    process without attr.exclude_guest=1 fail with -EBUSY. 
> > > 
> > > 2. NMI watchdog
> > >    the perf event for NMI watchdog is a system wide cpu pinned event, it
> > >    will be stopped also during vm running, but it doesn't have
> > >    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> > >    watchdog loses function during VM running.
> > > 
> > >    Two candidates exist for replacing perf event of NMI watchdog:
> > >    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> > >    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> > 
> > I think the simplest solution is to allow mediated PMU usage if and only if
> > the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> > problem to solve.
> > 
> > > 3. Dedicated kvm_pmi_vector
> > >    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> > >    PMI into guest when physical PMI belongs to guest counter. If the
> > >    same mechanism is used in passthrough vPMU and PMI skid exists
> > >    which cause physical PMI belonging to guest happens after VM-exit,
> > >    then the host PMI handler couldn't identify this PMI belongs to
> > >    host or guest.
> > >    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> > >    has this vector only. The PMI belonging to host still has an NMI
> > >    vector.
> > > 
> > >    Without considering PMI skid especially for AMD, the host NMI vector
> > >    could be used for guest PMI also, this method is simpler and doesn't
> > 
> > I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> > the complexity is a wash, just in different locations, and I highly doubt it's
> > a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> > LVTPC.
> > 
> > E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> > SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> > 
> > >    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
> > >    didn't meet the skid PMI issue on modern Intel processors.
> > > 
> > > 4. per-VM passthrough mode configuration
> > >    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
> > >    it decides vPMU is passthrough mode or emulated mode at kvm module
> > >    load time.
> > >    Do we need the capability of per-VM passthrough mode configuration?
> > >    So an admin can launch some non-passthrough VM and profile these
> > >    non-passthrough VMs in host, but admin still cannot profile all
> > >    the VMs once passthrough VM existence. This means passthrough vPMU
> > >    and emulated vPMU mix on one platform, it has challenges to implement.
> > >    As the commit message in commit 0011, the main challenge is 
> > >    passthrough vPMU and emulated vPMU have different vPMU features, this
> > >    ends up with two different values for kvm_cap.supported_perf_cap, which
> > >    is initialized at module load time. To support it, more refactor is
> > >    needed.
> > 
> > I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> > existing vPMU support entirely, but that's probably not be realistic, at least not
> > in the near future.
> > 
> > > Remain Works
> > > ===
> > > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> > 
> > Before this gets out of its "RFC" phase, I would at least like line of sight to
> > a more optimized switch.  I 100% agree that starting with a conservative
> > implementation is the way to go, and the kernel absolutely needs to be able to
> > profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> > guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> > 
> > But I also don't want to get into a situation where can't figure out a clean,
> > robust way to do the optimized context switch without needing (another) massive
> > rewrite.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU
  2024-04-11 23:18   ` Sean Christopherson
@ 2024-04-18 21:54     ` Mingwei Zhang
  0 siblings, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-18 21:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 11, 2024, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > From: Mingwei Zhang <mizhang@google.com>
> > 
> > Separate passthrough PMU logic from non-passthrough vPMU code. There are
> > two places in passthrough vPMU when set/get_msr() may call into the
> > existing non-passthrough vPMU code: 1) set/get counters; 2) set global_ctrl
> > MSR.
> > 
> > In the former case, non-passthrough vPMU will call into
> > pmc_{read,write}_counter() which wires to the perf API. Update these
> > functions to avoid the perf API invocation.
> > 
> > The 2nd case is where global_ctrl MSR writes invokes reprogram_counters()
> > which will invokes the non-passthrough PMU logic. So use pmu->passthrough
> > flag to wrap out the call.
> > 
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/pmu.c |  4 +++-
> >  arch/x86/kvm/pmu.h | 10 +++++++++-
> >  2 files changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 9e62e96fe48a..de653a67ba93 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -652,7 +652,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  		if (pmu->global_ctrl != data) {
> >  			diff = pmu->global_ctrl ^ data;
> >  			pmu->global_ctrl = data;
> > -			reprogram_counters(pmu, diff);
> > +			/* Passthrough vPMU never reprogram counters. */
> > +			if (!pmu->passthrough)
> 
> This should probably be handled in reprogram_counters(), otherwise we'll be
> playing whack-a-mole, e.g. this misses MSR_IA32_PEBS_ENABLE, which benign, but
> only because PEBS isn't yet supported.
> 
> > +				reprogram_counters(pmu, diff);
> >  		}
> >  		break;
> >  	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > index 0fc37a06fe48..ab8d4a8e58a8 100644
> > --- a/arch/x86/kvm/pmu.h
> > +++ b/arch/x86/kvm/pmu.h
> > @@ -70,6 +70,9 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> >  	u64 counter, enabled, running;
> >  
> >  	counter = pmc->counter;
> > +	if (pmc_to_pmu(pmc)->passthrough)
> > +		return counter & pmc_bitmask(pmc);
> 
> Won't perf_event always be NULL for mediated counters?  I.e. this can be dropped,
> I think.

yeah. I double checked and seems when perf_event == NULL, the logic is
correct. If so, we can drop that.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-18 21:21                   ` Mingwei Zhang
  2024-04-18 21:41                     ` Mingwei Zhang
@ 2024-04-19  1:02                     ` Mi, Dapeng
  1 sibling, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-19  1:02 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, jmattson, kvm,
	linux-perf-users, linux-kernel, zhiyuan.lv, eranian, irogers,
	samantha.alt, like.xu.linux, chao.gao


On 4/19/2024 5:21 AM, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024, Sean Christopherson wrote:
>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>> On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
>>>>> On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
>>>>>>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
>>>>>>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
>>>>>>>> unexpectedly on host later since Host perf always enable all validate bits
>>>>>>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>>>>>>>>
>>>>>>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
>>>>>>>>
>>>>>>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
>>>>>>> values to the host? NO. Not in cloud usage.
>>>>>> No, this place is clearing the guest counter value instead of host
>>>>>> counter value. Host always has method to see guest value in a normal VM
>>>>>> if he want. I don't see its necessity, it's just a overkill and
>>>>>> introduce extra overhead to write MSRs.
>>>>>>
>>>>> I am curious how the perf subsystem solves the problem? Does perf
>>>>> subsystem in the host only scrubbing the selector but not the counter
>>>>> value when doing the context switch?
>>>> When context switch happens, perf code would schedule out the old events
>>>> and schedule in the new events. When scheduling out, the ENABLE bit of
>>>> EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
>>>> and PMCx MSRs would be overwritten with new event's attr.config and
>>>> sample_period separately.  Of course, these is only for the case when
>>>> there are new events to be programmed on the PMC. If no new events, the
>>>> PMCx MSR would keep stall value and won't be cleared.
>>>>
>>>> Anyway, I don't see any reason that PMCx MSR must be cleared.
>>>>
>>> I don't have a strong opinion on the upstream version. But since both
>>> the mediated vPMU and perf are clients of PMU HW, leaving PMC values
>>> uncleared when transition out of the vPMU boundary is leaking info
>>> technically.
>> I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
>> in the guest's TCB, which is what I would consider a true leak.  I'm objecting
>> to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
>> PMCs when saving guest state without coordinating with perf in any way.
> Agree. blindly clearing PMCs is the basic implementation. I am thinking
> about what coordination between perf and KVM as well.
>
>> I am ok if we start with (or default to) a "safe" implementation that zeroes all
>> PMCs when switching to host context, but I want KVM and perf to work together to
>> do the context switches, e.g. so that we don't end up with code where KVM writes
>> to all PMC MSRs and that perf also immediately writes to all PMC MSRs.
> Sure. Point taken.
>> One my biggest complaints with the current vPMU code is that the roles and
>> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
>> and hard to maintain code.
> Right.
>> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
>> state to userspace processes that have RDPMC permissions, as the PMCs might not
>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>
> ah. This is a good point.
>
> 		switch_mm_irqs_off() =>
> 		cr4_update_pce_mm() =>
> 		/*
> 		 * Clear the existing dirty counters to
> 		 * prevent the leak for an RDPMC task.
> 		 */
> 		perf_clear_dirty_counters()
>
> So perf does clear dirty counter values on process context switch. This
> is nice to know.
>
> perf_clear_dirty_counters() clear the counter values according to
> cpuc->dirty except for those assigned counters.
>
>> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
>> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
>> or if KVM is just being paranoid.
> There is a difference between KVM and perf subsystem on PMU context
> switch. The latter has the notion of "perf_events", while the former
> currently does not. It is quite hard for KVM to know which counters are
> really "in use".
>
> Another point I want to raise up to you is that, KVM PMU context switch
> and Perf PMU context switch happens at different timing:
>
>   - The former is a context switch between guest/host state of the same
>     process, happening at VM-enter/exit boundary.
>   - The latter is a context switch beteen two host-level processes.
>   - The former happens before the latter.
>   - Current design has no PMC partitioning between host/guest due to
>     arch limitation.
>
>  From the above, I feel that it might be impossible to combine them or to
> add coordination? Unless we do the KVM PMU context switch at vcpu loop
> boundary...

It seems there are two ways to clear the PMCx MSRs.

a) KVM clears these guest only used counters' PMCx MSRs when VM exits 
and saving the guest MSRs. This can be seen as a portion of the next 
step optimization that only guest used MSRs are saved and restored. It 
would avoid to save/restore unnecessary MSR and decrease the performance 
impact.

b) Perf subsystem clears these guest used MSRs, but perf system doesn't 
know which MSRs are touched by guest, KVM has to provide a API to tell 
perf subsystem the information. Another issue is that perf does the 
clearing in task context switch. It may be too late, user can get the 
guest counter value via rdpmc instruction as long as the vCPU process is 
allowed to use rdpmc from user space.

We had an internal rough talk on this, It seems the option 1 is the 
better one which looks simpler and has clearer boundary. We would 
implement it together with the optimization mentioned in option 1.


>
> Thanks.
> -Mingwei
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-18 16:11                   ` Sean Christopherson
@ 2024-04-19  1:37                     ` Zhang, Xiong Y
  0 siblings, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-19  1:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kan Liang, pbonzini, peterz, mizhang, kan.liang, zhenyuw,
	dapeng1.mi, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 4/19/2024 12:11 AM, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Xiong Y Zhang wrote:
>> On 4/16/2024 8:48 PM, Liang, Kan wrote:
>>>> x86_perf_get_mediated_pmu() is called at vm_create(),
>>>> x86_perf_put_mediated_pmu() is called at vm_destroy(), then system wide
>>>> perf events without exclude_guest=1 can not be created during the whole vm
>>>> life cycle (where nr_mediated_pmu_vms > 0 always), do I understand and use
>>>> the interface correctly ?
>>>
>>> Right, but it only impacts the events of PMU with the
>>> PERF_PMU_CAP_MEDIATED_VPMU.  For other PMUs, the event with exclude_guest=1
>>> can still be created.  KVM should not touch the counters of the PMU without
>>> PERF_PMU_CAP_MEDIATED_VPMU.
>>>
>>> BTW: I will also remove the prefix x86, since the functions are in the
>>> generic code.
>>>
>>> Thanks,
>>> Kan
>> After userspace VMM call VCPU SET_CPUID() ioctl, KVM knows whether vPMU is
>> enabled or not. If perf_get_mediated_pmu() is called at vm create, it is too
>> early.
> 
> Eh, if someone wants to create _only_ VMs without vPMUs, then they should load
> KVM with enable_pmu=false.  I can see people complaining about not being able to
> create VMs if they don't want to use have *any* vPMU usage, but I doubt anyone
> has a use cases where they want a mix of PMU-enabled and PMU- disabled VMs, *and*
> they are ok with VM creation failing for some VMs but not others.
enable_mediated_pmu and PMU-based nmi_watchdog are enabled by default on my ubuntu system, some ubuntu services create vm during ubuntu bootup, these ubuntu services fail after I add perf_get_mediated_pmu() in kvm_arch_init_vm(). so do this checking at vm creation may break some bootup services.
  
> 
>> it is better to let perf_get_mediated_pmu() track per cpu PMU state,
>> so perf_get_mediated_pmu() can be called by kvm after vcpu_cpuid_set(). Note
>> user space vmm may call SET_CPUID() on one vcpu multi times, then here
>> refcount maybe isn't suitable. 
> 
> Yeah, waiting until KVM_SET_CPUID2 would be unpleasant for both KVM and userspace.
> E.g. failing KVM_SET_CPUID2 because KVM can't enable mediated PMU support would
> be rather confusing for userspace.
> 
>> what's a better solution ?
> 
> If doing the checks at VM creation is a stick point for folks, then the best
> approach is probably to use KVM_CAP_PMU_CAPABILITY, i.e. require userspace to
> explicitly opt-in to enabling mediated PMU usage.  Ha!  We can even do that
> without additional uAPI, because KVM interprets cap->args[0]==0 as "enable vPMU".
> 
QEMU doesn't use KVM_CAP_PMU_CAPABILITY to enable/disable pmu. enable_cap(KVM_CAP_PMU_CAPABILITY) will be added into QEMU for mediated PMU.
With old QEMU, guest PMU will always use emulated vPMU, mediated PMU won't be enabled, if emulated vPMU is replaced later, the old QEMU guest will be broken.
> The big problem with this is that enabling mediated PMU support by default would
> break userspace.  Hmm, but that's arguably the case no matter what, as a setup
> that worked before would suddenly start failing if the host was configured to use
> the PMU-based NMI watchdog.
Based on perf_get_mediated_pmu() interface, admin need to disable all the system wide perf events before vm creation, no matter where the perf_get_mediated_pmu() is called in vm_create() or enable_cap(KVM_CAP_PMU_CAPABILITY) ioctl.
> 
> E.g. this, if we're ok commiting to never enabling mediated PMU by defau
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..01d9ee2114c8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6664,9 +6664,21 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>                         break;
>  
>                 mutex_lock(&kvm->lock);
> -               if (!kvm->created_vcpus) {
> -                       kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
> -                       r = 0;
> +               /*
> +                * To keep PMU configuration "simple", setting vPMU support is
> +                * disallowed if vCPUs are created, or if mediated PMU support
> +                * was already enabled for the VM.
> +                */
> +               if (!kvm->created_vcpus &&
> +                   (!enable_mediated_pmu || !kvm->arch.enable_pmu)) {
> +                       if (enable_mediated_pmu &&
> +                           !(cap->args[0] & KVM_PMU_CAP_DISABLE))
> +                               r = x86_perf_get_mediated_pmu();
> +                       else
> +                               r = 0;
> +
> +                       if (!r)
> +                               kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
>                 }
>                 mutex_unlock(&kvm->lock);
>                 break;
> @@ -12563,7 +12575,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>         kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
>         kvm->arch.guest_can_read_msr_platform_info = true;
> -       kvm->arch.enable_pmu = enable_pmu;
> +
> +       /* PMU virtualization is opt-in when mediated PMU support is enabled. */
> +       kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
>         spin_lock_init(&kvm->arch.hv_root_tdp_lock);
> 
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-18 20:46   ` Mingwei Zhang
  2024-04-18 21:52     ` Mingwei Zhang
@ 2024-04-19 19:14     ` Sean Christopherson
  2024-04-19 22:02       ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-19 19:14 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Thu, Apr 11, 2024, Sean Christopherson wrote:
> > <bikeshed>
> > 
> > I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> > emulates the control plane (controls and event selectors), while the data is
> > fully passed through (counters).
> > 
> > </bikeshed>
> Sean,
> 
> I feel "mediated PMU" seems to be a little bit off the ..., no? In
> KVM, almost all of features are mediated. In our specific case, the
> legacy PMU is mediated by KVM and perf subsystem on the host. In new
> design, it is mediated by KVM only.

Currently, at a feature level, I mentally bin things into two rough categories
in KVM:

 1. Virtualized - Guest state is loaded into hardware, or hardware supports
                  running with both host and guest state (e.g. TSC scaling), and
                  the guest has full read/write access to its state while running.

 2. Emulated    - Guest state is never loaded into hardware, and instead the 
                  feature/state is emulated in software.

There is no "Passthrough" because that's (mostly) covered by my Virtualized
definition.   And because I also think of passthrough as being about *assets*,
not about the features themselves.
 
They are far from perfect definitions, e.g. individual assets can be passed through,
virtualized by hardware, or emulated in software.  But for the most part, I think
classifying features as virtualized vs. emulated works well, as it helps reason
about the expected behavior and performance of a feature.

E.g. for some virtualized features, certain assets may need to be explicitly passed
through, e.g. access to x2APIC MSRs for APICv.  But APICv itself still falls
into the virtualized category, e.g. the "real" APIC state isn't passed through
to the guest.

If KVM didn't already have a PMU implementation to deal with, this wouldn't be
an issue, e.g. we'd just add "enable_pmu" and I'd mentally bin it into the
virtualized category.  But we need to distinguish between the two PMU models,
and using "enable_virtualized_pmu" would be comically confusing for users. :-)

And because this is user visible, I would like to come up with a name that (some)
KVM users will already be familiar with, i.e. will have some chance of intuitively
understand without having to go read docs.

Which is why I proposed "mediated"; what we are proposing for the PMU is similar
to the "mediated device" concepts in VFIO.  And I also think "mediated" is a good
fit in general, e.g. this becomes my third classification:

 3. Mediated    - Guest is context switched at VM-Enter/VM-Exit, i.e. is loaded
                  into hardware, but the guest does NOT have full read/write access
                  to the feature.

But my main motiviation for using "mediated" really is that I hope that it will
help KVM users grok the basic gist of the design without having to read and
understand KVM documentation, because there is already existing terminology in
the broader KVM space.

> We intercept the control plan in current design, but the only thing
> we do is the event filtering. No fancy code change to emulate the control
> registers. So, it is still a passthrough logic.

It's not though.  Passthrough very specifically means the guest has unfettered
access to some asset, and/or KVM does no filtering/adjustments whatseover.

"Direct" is similar, e.g. KVM's uses "direct" in MMU context to refer to addresses
that don't require KVM to intervene and translate.  E.g. entire MMUs can be direct,
but individual shadow pages can also be direct (no corresponding guest PTE to
translate).

For this flavor of PMU, it's not full passthrough or direct.  Some assets are
passed through, e.g. PMCs, but others are not.  

> In some (rare) business cases, I think maybe we could fully passthrough
> the control plan as well. For instance, sole-tenant machine, or
> full-machine VM + full offload. In case if there is a cpu errata, KVM
> can force vmexit and dynamically intercept the selectors on all vcpus
> with filters checked. It is not supported in current RFC, but maybe
> doable in later versions.

Heh, that's an argument for using something other than "passthrough", because if
we ever do support such a use case, we'd end up with enable_fully_passthrough_pmu,
or in the spirit of KVM shortlogs, really_passthrough_pmu :-)

Though I think even then I would vote for "enable_dedicated_pmu", or something
along those lines, purely to avoid overloading "passthrough", i.e. to try to use
passhtrough strictly when talking about assets, not features.  And because unless
we can also passthrough LVTPC, it still wouldn't be a complete passthrough of the
PMU as KVM would be emulating PMIs.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM
  2024-04-19 19:14     ` Sean Christopherson
@ 2024-04-19 22:02       ` Mingwei Zhang
  0 siblings, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-19 22:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao

> Currently, at a feature level, I mentally bin things into two rough categories
> in KVM:
>
>  1. Virtualized - Guest state is loaded into hardware, or hardware supports
>                   running with both host and guest state (e.g. TSC scaling), and
>                   the guest has full read/write access to its state while running.
>
>  2. Emulated    - Guest state is never loaded into hardware, and instead the
>                   feature/state is emulated in software.
>
> There is no "Passthrough" because that's (mostly) covered by my Virtualized
> definition.   And because I also think of passthrough as being about *assets*,
> not about the features themselves.

Sure. In fact, "virtualized" works for me as well. My mind is aligned with this.

>
> They are far from perfect definitions, e.g. individual assets can be passed through,
> virtualized by hardware, or emulated in software.  But for the most part, I think
> classifying features as virtualized vs. emulated works well, as it helps reason
> about the expected behavior and performance of a feature.
>
> E.g. for some virtualized features, certain assets may need to be explicitly passed
> through, e.g. access to x2APIC MSRs for APICv.  But APICv itself still falls
> into the virtualized category, e.g. the "real" APIC state isn't passed through
> to the guest.
>
> If KVM didn't already have a PMU implementation to deal with, this wouldn't be
> an issue, e.g. we'd just add "enable_pmu" and I'd mentally bin it into the
> virtualized category.  But we need to distinguish between the two PMU models,
> and using "enable_virtualized_pmu" would be comically confusing for users. :-)
>
> And because this is user visible, I would like to come up with a name that (some)
> KVM users will already be familiar with, i.e. will have some chance of intuitively
> understand without having to go read docs.
>
> Which is why I proposed "mediated"; what we are proposing for the PMU is similar
> to the "mediated device" concepts in VFIO.  And I also think "mediated" is a good
> fit in general, e.g. this becomes my third classification:
>
>  3. Mediated    - Guest is context switched at VM-Enter/VM-Exit, i.e. is loaded
>                   into hardware, but the guest does NOT have full read/write access
>                   to the feature.
>
> But my main motiviation for using "mediated" really is that I hope that it will
> help KVM users grok the basic gist of the design without having to read and
> understand KVM documentation, because there is already existing terminology in
> the broader KVM space.

Understand this part. Mediated is the fact that KVM sits in between,
but I feel we can find a better name :)
>
> > We intercept the control plan in current design, but the only thing
> > we do is the event filtering. No fancy code change to emulate the control
> > registers. So, it is still a passthrough logic.
>
> It's not though.  Passthrough very specifically means the guest has unfettered
> access to some asset, and/or KVM does no filtering/adjustments whatseover.
>
> "Direct" is similar, e.g. KVM's uses "direct" in MMU context to refer to addresses
> that don't require KVM to intervene and translate.  E.g. entire MMUs can be direct,
> but individual shadow pages can also be direct (no corresponding guest PTE to
> translate).

Oh, isn't "direct" a perfect word for this? Look, our new design does
not require KVM to translate the encodings into events and into
encoding again (in "perf subsystem") before entering HW. It is really
"direct" in this sense, no?

Neither does KVM do any translation of the event encodings across
micro-architectures. So, it is really _direct_ from this perspective
as well.

On the other hand, "direct" means straightforward, indicating
passthrough, but not always, in which KVM retains the power of
control.

>
> For this flavor of PMU, it's not full passthrough or direct.  Some assets are
> passed through, e.g. PMCs, but others are not.
>
> > In some (rare) business cases, I think maybe we could fully passthrough
> > the control plan as well. For instance, sole-tenant machine, or
> > full-machine VM + full offload. In case if there is a cpu errata, KVM
> > can force vmexit and dynamically intercept the selectors on all vcpus
> > with filters checked. It is not supported in current RFC, but maybe
> > doable in later versions.
>
> Heh, that's an argument for using something other than "passthrough", because if
> we ever do support such a use case, we'd end up with enable_fully_passthrough_pmu,
> or in the spirit of KVM shortlogs, really_passthrough_pmu :-)

Full passthrough is possible and the naming of "really_passthrough"
and others can all be alive under the "direct PMU".

>
> Though I think even then I would vote for "enable_dedicated_pmu", or something
> along those lines, purely to avoid overloading "passthrough", i.e. to try to use
> passhtrough strictly when talking about assets, not features.  And because unless
> we can also passthrough LVTPC, it still wouldn't be a complete passthrough of the
> PMU as KVM would be emulating PMIs.

I agree to avoid "passthrough". Dedicated is also a fine word. It
indicates the PMU is dedicated to serve the KVM guests. But the scope
might be a little narrow. This is just my opinion. Maybe it is because
my mind has been stuck with "direct" :)

Thanks.

-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-15 22:45                     ` Sean Christopherson
@ 2024-04-22  2:14                       ` maobibo
  2024-04-22 17:01                         ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-22  2:14 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Dapeng Mi, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024/4/16 上午6:45, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <seanjc@google.com> wrote:
>>> One my biggest complaints with the current vPMU code is that the roles and
>>> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
>>> and hard to maintain code.
>>>
>>> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
>>> state to userspace processes that have RDPMC permissions, as the PMCs might not
>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>
>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
>>> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
>>> or if KVM is just being paranoid.
>>
>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
> 
> I don't think this is a statement we want to make, as it opens a discussion
> that we won't win.  Nor do I think it's one we *need* to make.  KVM doesn't need
> to be on equal footing with perf in terms of owning/managing PMU hardware, KVM
> just needs a few APIs to allow faithfully and accurately virtualizing a guest PMU.
> 
>> Faithful cleaning (blind cleaning) has to be the baseline
>> implementation, until both clients agree to a "deal" between them.
>> Currently, there is no such deal, but I believe we could have one via
>> future discussion.
> 
> What I am saying is that there needs to be a "deal" in place before this code
> is merged.  It doesn't need to be anything fancy, e.g. perf can still pave over
> PMCs it doesn't immediately load, as opposed to using cpu_hw_events.dirty to lazily
> do the clearing.  But perf and KVM need to work together from the get go, ie. I
> don't want KVM doing something without regard to what perf does, and vice versa.
> 
There is similar issue on LoongArch vPMU where vm can directly pmu 
hardware and pmu hw is shard with guest and host. Besides context switch 
there are other places where perf core will access pmu hw, such as tick 
timer/hrtimer/ipi function call, and KVM can only intercept context switch.

Can we add callback handler in structure kvm_guest_cbs?  just like this:
@@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
kvm_guest_cbs = {
         .state                  = kvm_guest_state,
         .get_ip                 = kvm_guest_get_ip,
         .handle_intel_pt_intr   = NULL,
+       .lose_pmu               = kvm_guest_lose_pmu,
  };

By the way, I do not know should the callback handler be triggered in 
perf core or detailed pmu hw driver. From ARM pmu hw driver, it is 
triggered in pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
but I think it will be better if it is done in perf core.

Regards
Bibo Mao


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-22  2:14                       ` maobibo
@ 2024-04-22 17:01                         ` Sean Christopherson
  2024-04-23  1:01                           ` maobibo
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-22 17:01 UTC (permalink / raw)
  To: maobibo
  Cc: Mingwei Zhang, Dapeng Mi, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Mon, Apr 22, 2024, maobibo wrote:
> On 2024/4/16 上午6:45, Sean Christopherson wrote:
> > On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > > On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > One my biggest complaints with the current vPMU code is that the roles and
> > > > responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> > > > and hard to maintain code.
> > > > 
> > > > Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> > > > state to userspace processes that have RDPMC permissions, as the PMCs might not
> > > > be dirty from perf's perspective (see perf_clear_dirty_counters()).
> > > > 
> > > > Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> > > > overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> > > > or if KVM is just being paranoid.
> > > 
> > > So once this rolls out, perf and vPMU are clients directly to PMU HW.
> > 
> > I don't think this is a statement we want to make, as it opens a discussion
> > that we won't win.  Nor do I think it's one we *need* to make.  KVM doesn't need
> > to be on equal footing with perf in terms of owning/managing PMU hardware, KVM
> > just needs a few APIs to allow faithfully and accurately virtualizing a guest PMU.
> > 
> > > Faithful cleaning (blind cleaning) has to be the baseline
> > > implementation, until both clients agree to a "deal" between them.
> > > Currently, there is no such deal, but I believe we could have one via
> > > future discussion.
> > 
> > What I am saying is that there needs to be a "deal" in place before this code
> > is merged.  It doesn't need to be anything fancy, e.g. perf can still pave over
> > PMCs it doesn't immediately load, as opposed to using cpu_hw_events.dirty to lazily
> > do the clearing.  But perf and KVM need to work together from the get go, ie. I
> > don't want KVM doing something without regard to what perf does, and vice versa.
> > 
> There is similar issue on LoongArch vPMU where vm can directly pmu hardware
> and pmu hw is shard with guest and host. Besides context switch there are
> other places where perf core will access pmu hw, such as tick
> timer/hrtimer/ipi function call, and KVM can only intercept context switch.

Two questions:

 1) Can KVM prevent the guest from accessing the PMU?

 2) If so, KVM can grant partial access to the PMU, or is it all or nothing?

If the answer to both questions is "yes", then it sounds like LoongArch *requires*
mediated/passthrough support in order to virtualize its PMU.

> Can we add callback handler in structure kvm_guest_cbs?  just like this:
> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs
> = {
>         .state                  = kvm_guest_state,
>         .get_ip                 = kvm_guest_get_ip,
>         .handle_intel_pt_intr   = NULL,
> +       .lose_pmu               = kvm_guest_lose_pmu,
>  };
> 
> By the way, I do not know should the callback handler be triggered in perf
> core or detailed pmu hw driver. From ARM pmu hw driver, it is triggered in
> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
> but I think it will be better if it is done in perf core.

I don't think we want to take the approach of perf and KVM guests "fighting" over
the PMU.  That's effectively what we have today, and it's a mess for KVM because
it's impossible to provide consistent, deterministic behavior for the guest.  And
it's just as messy for perf, which ends up having wierd, cumbersome flows that
exists purely to try to play nice with KVM.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-22 17:01                         ` Sean Christopherson
@ 2024-04-23  1:01                           ` maobibo
  2024-04-23  2:44                             ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-23  1:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Dapeng Mi, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024/4/23 上午1:01, Sean Christopherson wrote:
> On Mon, Apr 22, 2024, maobibo wrote:
>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <seanjc@google.com> wrote:
>>>>> One my biggest complaints with the current vPMU code is that the roles and
>>>>> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
>>>>> and hard to maintain code.
>>>>>
>>>>> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
>>>>> state to userspace processes that have RDPMC permissions, as the PMCs might not
>>>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>>>
>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
>>>>> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
>>>>> or if KVM is just being paranoid.
>>>>
>>>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>>>
>>> I don't think this is a statement we want to make, as it opens a discussion
>>> that we won't win.  Nor do I think it's one we *need* to make.  KVM doesn't need
>>> to be on equal footing with perf in terms of owning/managing PMU hardware, KVM
>>> just needs a few APIs to allow faithfully and accurately virtualizing a guest PMU.
>>>
>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>> implementation, until both clients agree to a "deal" between them.
>>>> Currently, there is no such deal, but I believe we could have one via
>>>> future discussion.
>>>
>>> What I am saying is that there needs to be a "deal" in place before this code
>>> is merged.  It doesn't need to be anything fancy, e.g. perf can still pave over
>>> PMCs it doesn't immediately load, as opposed to using cpu_hw_events.dirty to lazily
>>> do the clearing.  But perf and KVM need to work together from the get go, ie. I
>>> don't want KVM doing something without regard to what perf does, and vice versa.
>>>
>> There is similar issue on LoongArch vPMU where vm can directly pmu hardware
>> and pmu hw is shard with guest and host. Besides context switch there are
>> other places where perf core will access pmu hw, such as tick
>> timer/hrtimer/ipi function call, and KVM can only intercept context switch.
> 
> Two questions:
> 
>   1) Can KVM prevent the guest from accessing the PMU?
> 
>   2) If so, KVM can grant partial access to the PMU, or is it all or nothing?
> 
> If the answer to both questions is "yes", then it sounds like LoongArch *requires*
> mediated/passthrough support in order to virtualize its PMU.

Hi Sean,

Thank for your quick response.

yes, kvm can prevent guest from accessing the PMU and grant partial or 
all to access to the PMU. Only that if one pmu event is granted to VM, 
host can not access this pmu event again. There must be pmu event switch 
if host want to.

> 
>> Can we add callback handler in structure kvm_guest_cbs?  just like this:
>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs
>> = {
>>          .state                  = kvm_guest_state,
>>          .get_ip                 = kvm_guest_get_ip,
>>          .handle_intel_pt_intr   = NULL,
>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>   };
>>
>> By the way, I do not know should the callback handler be triggered in perf
>> core or detailed pmu hw driver. From ARM pmu hw driver, it is triggered in
>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>> but I think it will be better if it is done in perf core.
> 
> I don't think we want to take the approach of perf and KVM guests "fighting" over
> the PMU.  That's effectively what we have today, and it's a mess for KVM because
> it's impossible to provide consistent, deterministic behavior for the guest.  And
> it's just as messy for perf, which ends up having wierd, cumbersome flows that
> exists purely to try to play nice with KVM.
With existing pmu core code, in tick timer interrupt or IPI function 
call interrupt pmu hw may be accessed by host when VM is running and pmu 
is already granted to guest. KVM can not intercept host IPI/timer 
interrupt, there is no pmu context switch, there will be problem.

Regards
Bibo Mao


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  1:01                           ` maobibo
@ 2024-04-23  2:44                             ` Mi, Dapeng
  2024-04-23  2:53                               ` maobibo
  0 siblings, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-23  2:44 UTC (permalink / raw)
  To: maobibo, Sean Christopherson
  Cc: Mingwei Zhang, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao


On 4/23/2024 9:01 AM, maobibo wrote:
>
>
> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>> On Mon, Apr 22, 2024, maobibo wrote:
>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson 
>>>>> <seanjc@google.com> wrote:
>>>>>> One my biggest complaints with the current vPMU code is that the 
>>>>>> roles and
>>>>>> responsibilities between KVM and perf are poorly defined, which 
>>>>>> leads to suboptimal
>>>>>> and hard to maintain code.
>>>>>>
>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs 
>>>>>> _would_ leak guest
>>>>>> state to userspace processes that have RDPMC permissions, as the 
>>>>>> PMCs might not
>>>>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>>>>
>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing 
>>>>>> so makes the
>>>>>> overall code brittle because it's not clear whether KVM _needs_ 
>>>>>> to clear PMCs,
>>>>>> or if KVM is just being paranoid.
>>>>>
>>>>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>>>>
>>>> I don't think this is a statement we want to make, as it opens a 
>>>> discussion
>>>> that we won't win.  Nor do I think it's one we *need* to make.  KVM 
>>>> doesn't need
>>>> to be on equal footing with perf in terms of owning/managing PMU 
>>>> hardware, KVM
>>>> just needs a few APIs to allow faithfully and accurately 
>>>> virtualizing a guest PMU.
>>>>
>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>> implementation, until both clients agree to a "deal" between them.
>>>>> Currently, there is no such deal, but I believe we could have one via
>>>>> future discussion.
>>>>
>>>> What I am saying is that there needs to be a "deal" in place before 
>>>> this code
>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can 
>>>> still pave over
>>>> PMCs it doesn't immediately load, as opposed to using 
>>>> cpu_hw_events.dirty to lazily
>>>> do the clearing.  But perf and KVM need to work together from the 
>>>> get go, ie. I
>>>> don't want KVM doing something without regard to what perf does, 
>>>> and vice versa.
>>>>
>>> There is similar issue on LoongArch vPMU where vm can directly pmu 
>>> hardware
>>> and pmu hw is shard with guest and host. Besides context switch 
>>> there are
>>> other places where perf core will access pmu hw, such as tick
>>> timer/hrtimer/ipi function call, and KVM can only intercept context 
>>> switch.
>>
>> Two questions:
>>
>>   1) Can KVM prevent the guest from accessing the PMU?
>>
>>   2) If so, KVM can grant partial access to the PMU, or is it all or 
>> nothing?
>>
>> If the answer to both questions is "yes", then it sounds like 
>> LoongArch *requires*
>> mediated/passthrough support in order to virtualize its PMU.
>
> Hi Sean,
>
> Thank for your quick response.
>
> yes, kvm can prevent guest from accessing the PMU and grant partial or 
> all to access to the PMU. Only that if one pmu event is granted to VM, 
> host can not access this pmu event again. There must be pmu event 
> switch if host want to.

PMU event is a software entity which won't be shared. did you mean if a 
PMU HW counter is granted to VM, then Host can't access the PMU HW 
counter, right?


>
>>
>>> Can we add callback handler in structure kvm_guest_cbs?  just like 
>>> this:
>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
>>> kvm_guest_cbs
>>> = {
>>>          .state                  = kvm_guest_state,
>>>          .get_ip                 = kvm_guest_get_ip,
>>>          .handle_intel_pt_intr   = NULL,
>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>   };
>>>
>>> By the way, I do not know should the callback handler be triggered 
>>> in perf
>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is 
>>> triggered in
>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>> but I think it will be better if it is done in perf core.
>>
>> I don't think we want to take the approach of perf and KVM guests 
>> "fighting" over
>> the PMU.  That's effectively what we have today, and it's a mess for 
>> KVM because
>> it's impossible to provide consistent, deterministic behavior for the 
>> guest.  And
>> it's just as messy for perf, which ends up having wierd, cumbersome 
>> flows that
>> exists purely to try to play nice with KVM.
> With existing pmu core code, in tick timer interrupt or IPI function 
> call interrupt pmu hw may be accessed by host when VM is running and 
> pmu is already granted to guest. KVM can not intercept host IPI/timer 
> interrupt, there is no pmu context switch, there will be problem.
>
> Regards
> Bibo Mao
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  2:44                             ` Mi, Dapeng
@ 2024-04-23  2:53                               ` maobibo
  2024-04-23  3:13                                 ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-23  2:53 UTC (permalink / raw)
  To: Mi, Dapeng, Sean Christopherson
  Cc: Mingwei Zhang, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024/4/23 上午10:44, Mi, Dapeng wrote:
> 
> On 4/23/2024 9:01 AM, maobibo wrote:
>>
>>
>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson 
>>>>>> <seanjc@google.com> wrote:
>>>>>>> One my biggest complaints with the current vPMU code is that the 
>>>>>>> roles and
>>>>>>> responsibilities between KVM and perf are poorly defined, which 
>>>>>>> leads to suboptimal
>>>>>>> and hard to maintain code.
>>>>>>>
>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs 
>>>>>>> _would_ leak guest
>>>>>>> state to userspace processes that have RDPMC permissions, as the 
>>>>>>> PMCs might not
>>>>>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>>>>>
>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing 
>>>>>>> so makes the
>>>>>>> overall code brittle because it's not clear whether KVM _needs_ 
>>>>>>> to clear PMCs,
>>>>>>> or if KVM is just being paranoid.
>>>>>>
>>>>>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>>>>>
>>>>> I don't think this is a statement we want to make, as it opens a 
>>>>> discussion
>>>>> that we won't win.  Nor do I think it's one we *need* to make.  KVM 
>>>>> doesn't need
>>>>> to be on equal footing with perf in terms of owning/managing PMU 
>>>>> hardware, KVM
>>>>> just needs a few APIs to allow faithfully and accurately 
>>>>> virtualizing a guest PMU.
>>>>>
>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>> Currently, there is no such deal, but I believe we could have one via
>>>>>> future discussion.
>>>>>
>>>>> What I am saying is that there needs to be a "deal" in place before 
>>>>> this code
>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can 
>>>>> still pave over
>>>>> PMCs it doesn't immediately load, as opposed to using 
>>>>> cpu_hw_events.dirty to lazily
>>>>> do the clearing.  But perf and KVM need to work together from the 
>>>>> get go, ie. I
>>>>> don't want KVM doing something without regard to what perf does, 
>>>>> and vice versa.
>>>>>
>>>> There is similar issue on LoongArch vPMU where vm can directly pmu 
>>>> hardware
>>>> and pmu hw is shard with guest and host. Besides context switch 
>>>> there are
>>>> other places where perf core will access pmu hw, such as tick
>>>> timer/hrtimer/ipi function call, and KVM can only intercept context 
>>>> switch.
>>>
>>> Two questions:
>>>
>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>
>>>   2) If so, KVM can grant partial access to the PMU, or is it all or 
>>> nothing?
>>>
>>> If the answer to both questions is "yes", then it sounds like 
>>> LoongArch *requires*
>>> mediated/passthrough support in order to virtualize its PMU.
>>
>> Hi Sean,
>>
>> Thank for your quick response.
>>
>> yes, kvm can prevent guest from accessing the PMU and grant partial or 
>> all to access to the PMU. Only that if one pmu event is granted to VM, 
>> host can not access this pmu event again. There must be pmu event 
>> switch if host want to.
> 
> PMU event is a software entity which won't be shared. did you mean if a 
> PMU HW counter is granted to VM, then Host can't access the PMU HW 
> counter, right?
yes, if PMU HW counter/control is granted to VM. The value comes from 
guest, and is not meaningful for host.  Host pmu core does not know that 
it is granted to VM, host still think that it owns pmu.

Just like FPU register, it is shared by VM and host during different 
time and it is lately switched. But if IPI or timer interrupt uses FPU 
register on host, there will be the same issue.

Regards
Bibo Mao
> 
> 
>>
>>>
>>>> Can we add callback handler in structure kvm_guest_cbs?  just like 
>>>> this:
>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
>>>> kvm_guest_cbs
>>>> = {
>>>>          .state                  = kvm_guest_state,
>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>          .handle_intel_pt_intr   = NULL,
>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>   };
>>>>
>>>> By the way, I do not know should the callback handler be triggered 
>>>> in perf
>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is 
>>>> triggered in
>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>> but I think it will be better if it is done in perf core.
>>>
>>> I don't think we want to take the approach of perf and KVM guests 
>>> "fighting" over
>>> the PMU.  That's effectively what we have today, and it's a mess for 
>>> KVM because
>>> it's impossible to provide consistent, deterministic behavior for the 
>>> guest.  And
>>> it's just as messy for perf, which ends up having wierd, cumbersome 
>>> flows that
>>> exists purely to try to play nice with KVM.
>> With existing pmu core code, in tick timer interrupt or IPI function 
>> call interrupt pmu hw may be accessed by host when VM is running and 
>> pmu is already granted to guest. KVM can not intercept host IPI/timer 
>> interrupt, there is no pmu context switch, there will be problem.
>>
>> Regards
>> Bibo Mao
>>


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  2:53                               ` maobibo
@ 2024-04-23  3:13                                 ` Mi, Dapeng
  2024-04-23  3:26                                   ` maobibo
  2024-04-23  3:55                                   ` maobibo
  0 siblings, 2 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-23  3:13 UTC (permalink / raw)
  To: maobibo, Sean Christopherson
  Cc: Mingwei Zhang, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao


On 4/23/2024 10:53 AM, maobibo wrote:
>
>
> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>
>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>
>>>
>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson 
>>>>>>> <seanjc@google.com> wrote:
>>>>>>>> One my biggest complaints with the current vPMU code is that 
>>>>>>>> the roles and
>>>>>>>> responsibilities between KVM and perf are poorly defined, which 
>>>>>>>> leads to suboptimal
>>>>>>>> and hard to maintain code.
>>>>>>>>
>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs 
>>>>>>>> _would_ leak guest
>>>>>>>> state to userspace processes that have RDPMC permissions, as 
>>>>>>>> the PMCs might not
>>>>>>>> be dirty from perf's perspective (see 
>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>
>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in 
>>>>>>>> doing so makes the
>>>>>>>> overall code brittle because it's not clear whether KVM _needs_ 
>>>>>>>> to clear PMCs,
>>>>>>>> or if KVM is just being paranoid.
>>>>>>>
>>>>>>> So once this rolls out, perf and vPMU are clients directly to 
>>>>>>> PMU HW.
>>>>>>
>>>>>> I don't think this is a statement we want to make, as it opens a 
>>>>>> discussion
>>>>>> that we won't win.  Nor do I think it's one we *need* to make.  
>>>>>> KVM doesn't need
>>>>>> to be on equal footing with perf in terms of owning/managing PMU 
>>>>>> hardware, KVM
>>>>>> just needs a few APIs to allow faithfully and accurately 
>>>>>> virtualizing a guest PMU.
>>>>>>
>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>> Currently, there is no such deal, but I believe we could have 
>>>>>>> one via
>>>>>>> future discussion.
>>>>>>
>>>>>> What I am saying is that there needs to be a "deal" in place 
>>>>>> before this code
>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can 
>>>>>> still pave over
>>>>>> PMCs it doesn't immediately load, as opposed to using 
>>>>>> cpu_hw_events.dirty to lazily
>>>>>> do the clearing.  But perf and KVM need to work together from the 
>>>>>> get go, ie. I
>>>>>> don't want KVM doing something without regard to what perf does, 
>>>>>> and vice versa.
>>>>>>
>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu 
>>>>> hardware
>>>>> and pmu hw is shard with guest and host. Besides context switch 
>>>>> there are
>>>>> other places where perf core will access pmu hw, such as tick
>>>>> timer/hrtimer/ipi function call, and KVM can only intercept 
>>>>> context switch.
>>>>
>>>> Two questions:
>>>>
>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>
>>>>   2) If so, KVM can grant partial access to the PMU, or is it all 
>>>> or nothing?
>>>>
>>>> If the answer to both questions is "yes", then it sounds like 
>>>> LoongArch *requires*
>>>> mediated/passthrough support in order to virtualize its PMU.
>>>
>>> Hi Sean,
>>>
>>> Thank for your quick response.
>>>
>>> yes, kvm can prevent guest from accessing the PMU and grant partial 
>>> or all to access to the PMU. Only that if one pmu event is granted 
>>> to VM, host can not access this pmu event again. There must be pmu 
>>> event switch if host want to.
>>
>> PMU event is a software entity which won't be shared. did you mean if 
>> a PMU HW counter is granted to VM, then Host can't access the PMU HW 
>> counter, right?
> yes, if PMU HW counter/control is granted to VM. The value comes from 
> guest, and is not meaningful for host.  Host pmu core does not know 
> that it is granted to VM, host still think that it owns pmu.

That's one issue this patchset tries to solve. Current new mediated x86 
vPMU framework doesn't allow Host or Guest own the PMU HW resource 
simultaneously. Only when there is no !exclude_guest event on host, 
guest is allowed to exclusively own the PMU HW resource.


>
> Just like FPU register, it is shared by VM and host during different 
> time and it is lately switched. But if IPI or timer interrupt uses FPU 
> register on host, there will be the same issue.

I didn't fully get your point. When IPI or timer interrupt reach, a 
VM-exit is triggered to make CPU traps into host first and then the host 
interrupt handler is called. Or are you complaining the executing 
sequence of switching guest PMU MSRs and these interrupt handler?


>
> Regards
> Bibo Mao
>>
>>
>>>
>>>>
>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like 
>>>>> this:
>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
>>>>> kvm_guest_cbs
>>>>> = {
>>>>>          .state                  = kvm_guest_state,
>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>          .handle_intel_pt_intr   = NULL,
>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>   };
>>>>>
>>>>> By the way, I do not know should the callback handler be triggered 
>>>>> in perf
>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is 
>>>>> triggered in
>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>> but I think it will be better if it is done in perf core.
>>>>
>>>> I don't think we want to take the approach of perf and KVM guests 
>>>> "fighting" over
>>>> the PMU.  That's effectively what we have today, and it's a mess 
>>>> for KVM because
>>>> it's impossible to provide consistent, deterministic behavior for 
>>>> the guest.  And
>>>> it's just as messy for perf, which ends up having wierd, cumbersome 
>>>> flows that
>>>> exists purely to try to play nice with KVM.
>>> With existing pmu core code, in tick timer interrupt or IPI function 
>>> call interrupt pmu hw may be accessed by host when VM is running and 
>>> pmu is already granted to guest. KVM can not intercept host 
>>> IPI/timer interrupt, there is no pmu context switch, there will be 
>>> problem.
>>>
>>> Regards
>>> Bibo Mao
>>>
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  3:13                                 ` Mi, Dapeng
@ 2024-04-23  3:26                                   ` maobibo
  2024-04-23  3:59                                     ` Mi, Dapeng
  2024-04-23  3:55                                   ` maobibo
  1 sibling, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-23  3:26 UTC (permalink / raw)
  To: Mi, Dapeng, Sean Christopherson
  Cc: Mingwei Zhang, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024/4/23 上午11:13, Mi, Dapeng wrote:
> 
> On 4/23/2024 10:53 AM, maobibo wrote:
>>
>>
>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson 
>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>> One my biggest complaints with the current vPMU code is that 
>>>>>>>>> the roles and
>>>>>>>>> responsibilities between KVM and perf are poorly defined, which 
>>>>>>>>> leads to suboptimal
>>>>>>>>> and hard to maintain code.
>>>>>>>>>
>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs 
>>>>>>>>> _would_ leak guest
>>>>>>>>> state to userspace processes that have RDPMC permissions, as 
>>>>>>>>> the PMCs might not
>>>>>>>>> be dirty from perf's perspective (see 
>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>
>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in 
>>>>>>>>> doing so makes the
>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_ 
>>>>>>>>> to clear PMCs,
>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>
>>>>>>>> So once this rolls out, perf and vPMU are clients directly to 
>>>>>>>> PMU HW.
>>>>>>>
>>>>>>> I don't think this is a statement we want to make, as it opens a 
>>>>>>> discussion
>>>>>>> that we won't win.  Nor do I think it's one we *need* to make. 
>>>>>>> KVM doesn't need
>>>>>>> to be on equal footing with perf in terms of owning/managing PMU 
>>>>>>> hardware, KVM
>>>>>>> just needs a few APIs to allow faithfully and accurately 
>>>>>>> virtualizing a guest PMU.
>>>>>>>
>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>> Currently, there is no such deal, but I believe we could have 
>>>>>>>> one via
>>>>>>>> future discussion.
>>>>>>>
>>>>>>> What I am saying is that there needs to be a "deal" in place 
>>>>>>> before this code
>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can 
>>>>>>> still pave over
>>>>>>> PMCs it doesn't immediately load, as opposed to using 
>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>> do the clearing.  But perf and KVM need to work together from the 
>>>>>>> get go, ie. I
>>>>>>> don't want KVM doing something without regard to what perf does, 
>>>>>>> and vice versa.
>>>>>>>
>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu 
>>>>>> hardware
>>>>>> and pmu hw is shard with guest and host. Besides context switch 
>>>>>> there are
>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept 
>>>>>> context switch.
>>>>>
>>>>> Two questions:
>>>>>
>>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>>
>>>>>   2) If so, KVM can grant partial access to the PMU, or is it all 
>>>>> or nothing?
>>>>>
>>>>> If the answer to both questions is "yes", then it sounds like 
>>>>> LoongArch *requires*
>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>
>>>> Hi Sean,
>>>>
>>>> Thank for your quick response.
>>>>
>>>> yes, kvm can prevent guest from accessing the PMU and grant partial 
>>>> or all to access to the PMU. Only that if one pmu event is granted 
>>>> to VM, host can not access this pmu event again. There must be pmu 
>>>> event switch if host want to.
>>>
>>> PMU event is a software entity which won't be shared. did you mean if 
>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW 
>>> counter, right?
>> yes, if PMU HW counter/control is granted to VM. The value comes from 
>> guest, and is not meaningful for host.  Host pmu core does not know 
>> that it is granted to VM, host still think that it owns pmu.
> 
> That's one issue this patchset tries to solve. Current new mediated x86 
> vPMU framework doesn't allow Host or Guest own the PMU HW resource 
> simultaneously. Only when there is no !exclude_guest event on host, 
> guest is allowed to exclusively own the PMU HW resource.
> 
> 
>>
>> Just like FPU register, it is shared by VM and host during different 
>> time and it is lately switched. But if IPI or timer interrupt uses FPU 
>> register on host, there will be the same issue.
> 
> I didn't fully get your point. When IPI or timer interrupt reach, a 
> VM-exit is triggered to make CPU traps into host first and then the host 
> interrupt handler is called. Or are you complaining the executing 
> sequence of switching guest PMU MSRs and these interrupt handler?
It is not necessary to save/restore PMU HW at every vm exit, it had 
better be lately saved/restored, such as only when vcpu thread is 
sched-out/sched-in, else the cost will be a little expensive.

I know little about perf core. However there is PMU HW access in 
interrupt mode. That means PMU HW access should be irq disabled in 
general mode, else there may be nested PMU HW access. Is that true?

> 
> 
>>
>> Regards
>> Bibo Mao
>>>
>>>
>>>>
>>>>>
>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like 
>>>>>> this:
>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
>>>>>> kvm_guest_cbs
>>>>>> = {
>>>>>>          .state                  = kvm_guest_state,
>>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>>          .handle_intel_pt_intr   = NULL,
>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>   };
>>>>>>
>>>>>> By the way, I do not know should the callback handler be triggered 
>>>>>> in perf
>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is 
>>>>>> triggered in
>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>> but I think it will be better if it is done in perf core.
>>>>>
>>>>> I don't think we want to take the approach of perf and KVM guests 
>>>>> "fighting" over
>>>>> the PMU.  That's effectively what we have today, and it's a mess 
>>>>> for KVM because
>>>>> it's impossible to provide consistent, deterministic behavior for 
>>>>> the guest.  And
>>>>> it's just as messy for perf, which ends up having wierd, cumbersome 
>>>>> flows that
>>>>> exists purely to try to play nice with KVM.
>>>> With existing pmu core code, in tick timer interrupt or IPI function 
>>>> call interrupt pmu hw may be accessed by host when VM is running and 
>>>> pmu is already granted to guest. KVM can not intercept host 
>>>> IPI/timer interrupt, there is no pmu context switch, there will be 
>>>> problem.
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>
>>


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  3:13                                 ` Mi, Dapeng
  2024-04-23  3:26                                   ` maobibo
@ 2024-04-23  3:55                                   ` maobibo
  2024-04-23  4:23                                     ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-23  3:55 UTC (permalink / raw)
  To: Mi, Dapeng, Sean Christopherson
  Cc: Mingwei Zhang, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao



On 2024/4/23 上午11:13, Mi, Dapeng wrote:
> 
> On 4/23/2024 10:53 AM, maobibo wrote:
>>
>>
>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson 
>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>> One my biggest complaints with the current vPMU code is that 
>>>>>>>>> the roles and
>>>>>>>>> responsibilities between KVM and perf are poorly defined, which 
>>>>>>>>> leads to suboptimal
>>>>>>>>> and hard to maintain code.
>>>>>>>>>
>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs 
>>>>>>>>> _would_ leak guest
>>>>>>>>> state to userspace processes that have RDPMC permissions, as 
>>>>>>>>> the PMCs might not
>>>>>>>>> be dirty from perf's perspective (see 
>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>
>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in 
>>>>>>>>> doing so makes the
>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_ 
>>>>>>>>> to clear PMCs,
>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>
>>>>>>>> So once this rolls out, perf and vPMU are clients directly to 
>>>>>>>> PMU HW.
>>>>>>>
>>>>>>> I don't think this is a statement we want to make, as it opens a 
>>>>>>> discussion
>>>>>>> that we won't win.  Nor do I think it's one we *need* to make. 
>>>>>>> KVM doesn't need
>>>>>>> to be on equal footing with perf in terms of owning/managing PMU 
>>>>>>> hardware, KVM
>>>>>>> just needs a few APIs to allow faithfully and accurately 
>>>>>>> virtualizing a guest PMU.
>>>>>>>
>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>> Currently, there is no such deal, but I believe we could have 
>>>>>>>> one via
>>>>>>>> future discussion.
>>>>>>>
>>>>>>> What I am saying is that there needs to be a "deal" in place 
>>>>>>> before this code
>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can 
>>>>>>> still pave over
>>>>>>> PMCs it doesn't immediately load, as opposed to using 
>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>> do the clearing.  But perf and KVM need to work together from the 
>>>>>>> get go, ie. I
>>>>>>> don't want KVM doing something without regard to what perf does, 
>>>>>>> and vice versa.
>>>>>>>
>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu 
>>>>>> hardware
>>>>>> and pmu hw is shard with guest and host. Besides context switch 
>>>>>> there are
>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept 
>>>>>> context switch.
>>>>>
>>>>> Two questions:
>>>>>
>>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>>
>>>>>   2) If so, KVM can grant partial access to the PMU, or is it all 
>>>>> or nothing?
>>>>>
>>>>> If the answer to both questions is "yes", then it sounds like 
>>>>> LoongArch *requires*
>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>
>>>> Hi Sean,
>>>>
>>>> Thank for your quick response.
>>>>
>>>> yes, kvm can prevent guest from accessing the PMU and grant partial 
>>>> or all to access to the PMU. Only that if one pmu event is granted 
>>>> to VM, host can not access this pmu event again. There must be pmu 
>>>> event switch if host want to.
>>>
>>> PMU event is a software entity which won't be shared. did you mean if 
>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW 
>>> counter, right?
>> yes, if PMU HW counter/control is granted to VM. The value comes from 
>> guest, and is not meaningful for host.  Host pmu core does not know 
>> that it is granted to VM, host still think that it owns pmu.
> 
> That's one issue this patchset tries to solve. Current new mediated x86 
> vPMU framework doesn't allow Host or Guest own the PMU HW resource 
> simultaneously. Only when there is no !exclude_guest event on host, 
> guest is allowed to exclusively own the PMU HW resource.
> 
> 
>>
>> Just like FPU register, it is shared by VM and host during different 
>> time and it is lately switched. But if IPI or timer interrupt uses FPU 
>> register on host, there will be the same issue.
> 
> I didn't fully get your point. When IPI or timer interrupt reach, a 
> VM-exit is triggered to make CPU traps into host first and then the host 
yes, it is.

> interrupt handler is called. Or are you complaining the executing 
> sequence of switching guest PMU MSRs and these interrupt handler?
In our vPMU implementation, it is ok if vPMU is switched in vm exit 
path, however there is problem if vPMU is switched during vcpu thread 
sched-out/sched-in path since IPI/timer irq interrupt access pmu 
register in host mode.

In general it will be better if the switch is done in vcpu thread 
sched-out/sched-in, else there is requirement to profile kvm 
hypervisor.Even there is such requirement, it is only one option. In 
most conditions, it will better if time of VM context exit is small.

> 
> 
>>
>> Regards
>> Bibo Mao
>>>
>>>
>>>>
>>>>>
>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like 
>>>>>> this:
>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
>>>>>> kvm_guest_cbs
>>>>>> = {
>>>>>>          .state                  = kvm_guest_state,
>>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>>          .handle_intel_pt_intr   = NULL,
>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>   };
>>>>>>
>>>>>> By the way, I do not know should the callback handler be triggered 
>>>>>> in perf
>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is 
>>>>>> triggered in
>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>> but I think it will be better if it is done in perf core.
>>>>>
>>>>> I don't think we want to take the approach of perf and KVM guests 
>>>>> "fighting" over
>>>>> the PMU.  That's effectively what we have today, and it's a mess 
>>>>> for KVM because
>>>>> it's impossible to provide consistent, deterministic behavior for 
>>>>> the guest.  And
>>>>> it's just as messy for perf, which ends up having wierd, cumbersome 
>>>>> flows that
>>>>> exists purely to try to play nice with KVM.
>>>> With existing pmu core code, in tick timer interrupt or IPI function 
>>>> call interrupt pmu hw may be accessed by host when VM is running and 
>>>> pmu is already granted to guest. KVM can not intercept host 
>>>> IPI/timer interrupt, there is no pmu context switch, there will be 
>>>> problem.
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>
>>


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  3:26                                   ` maobibo
@ 2024-04-23  3:59                                     ` Mi, Dapeng
  0 siblings, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-23  3:59 UTC (permalink / raw)
  To: maobibo, Sean Christopherson
  Cc: Mingwei Zhang, Xiong Zhang, pbonzini, peterz, kan.liang, zhenyuw,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao


On 4/23/2024 11:26 AM, maobibo wrote:
>
>
> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>
>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>
>>>
>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>
>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson 
>>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>>> One my biggest complaints with the current vPMU code is that 
>>>>>>>>>> the roles and
>>>>>>>>>> responsibilities between KVM and perf are poorly defined, 
>>>>>>>>>> which leads to suboptimal
>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>
>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs 
>>>>>>>>>> _would_ leak guest
>>>>>>>>>> state to userspace processes that have RDPMC permissions, as 
>>>>>>>>>> the PMCs might not
>>>>>>>>>> be dirty from perf's perspective (see 
>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>
>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in 
>>>>>>>>>> doing so makes the
>>>>>>>>>> overall code brittle because it's not clear whether KVM 
>>>>>>>>>> _needs_ to clear PMCs,
>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>
>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to 
>>>>>>>>> PMU HW.
>>>>>>>>
>>>>>>>> I don't think this is a statement we want to make, as it opens 
>>>>>>>> a discussion
>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make. 
>>>>>>>> KVM doesn't need
>>>>>>>> to be on equal footing with perf in terms of owning/managing 
>>>>>>>> PMU hardware, KVM
>>>>>>>> just needs a few APIs to allow faithfully and accurately 
>>>>>>>> virtualizing a guest PMU.
>>>>>>>>
>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>> implementation, until both clients agree to a "deal" between 
>>>>>>>>> them.
>>>>>>>>> Currently, there is no such deal, but I believe we could have 
>>>>>>>>> one via
>>>>>>>>> future discussion.
>>>>>>>>
>>>>>>>> What I am saying is that there needs to be a "deal" in place 
>>>>>>>> before this code
>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can 
>>>>>>>> still pave over
>>>>>>>> PMCs it doesn't immediately load, as opposed to using 
>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>> do the clearing.  But perf and KVM need to work together from 
>>>>>>>> the get go, ie. I
>>>>>>>> don't want KVM doing something without regard to what perf 
>>>>>>>> does, and vice versa.
>>>>>>>>
>>>>>>> There is similar issue on LoongArch vPMU where vm can directly 
>>>>>>> pmu hardware
>>>>>>> and pmu hw is shard with guest and host. Besides context switch 
>>>>>>> there are
>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept 
>>>>>>> context switch.
>>>>>>
>>>>>> Two questions:
>>>>>>
>>>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>>>
>>>>>>   2) If so, KVM can grant partial access to the PMU, or is it all 
>>>>>> or nothing?
>>>>>>
>>>>>> If the answer to both questions is "yes", then it sounds like 
>>>>>> LoongArch *requires*
>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Thank for your quick response.
>>>>>
>>>>> yes, kvm can prevent guest from accessing the PMU and grant 
>>>>> partial or all to access to the PMU. Only that if one pmu event is 
>>>>> granted to VM, host can not access this pmu event again. There 
>>>>> must be pmu event switch if host want to.
>>>>
>>>> PMU event is a software entity which won't be shared. did you mean 
>>>> if a PMU HW counter is granted to VM, then Host can't access the 
>>>> PMU HW counter, right?
>>> yes, if PMU HW counter/control is granted to VM. The value comes 
>>> from guest, and is not meaningful for host.  Host pmu core does not 
>>> know that it is granted to VM, host still think that it owns pmu.
>>
>> That's one issue this patchset tries to solve. Current new mediated 
>> x86 vPMU framework doesn't allow Host or Guest own the PMU HW 
>> resource simultaneously. Only when there is no !exclude_guest event 
>> on host, guest is allowed to exclusively own the PMU HW resource.
>>
>>
>>>
>>> Just like FPU register, it is shared by VM and host during different 
>>> time and it is lately switched. But if IPI or timer interrupt uses 
>>> FPU register on host, there will be the same issue.
>>
>> I didn't fully get your point. When IPI or timer interrupt reach, a 
>> VM-exit is triggered to make CPU traps into host first and then the 
>> host interrupt handler is called. Or are you complaining the 
>> executing sequence of switching guest PMU MSRs and these interrupt 
>> handler?
> It is not necessary to save/restore PMU HW at every vm exit, it had 
> better be lately saved/restored, such as only when vcpu thread is 
> sched-out/sched-in, else the cost will be a little expensive.

I suspect this optimization deferring guest PMU state save/restore to 
vCPU task switching boundary would be really landed into KVM since it 
would make host lose the capability to profile KVM and It seems Sean 
object this.


>
> I know little about perf core. However there is PMU HW access in 
> interrupt mode. That means PMU HW access should be irq disabled in 
> general mode, else there may be nested PMU HW access. Is that true?

I had no idea that timer irq handler would access PMU MSRs before. Could 
you please show me the code and I would look at it first. Thanks.


>
>>
>>
>>>
>>> Regards
>>> Bibo Mao
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just 
>>>>>>> like this:
>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks 
>>>>>>> kvm_guest_cbs
>>>>>>> = {
>>>>>>>          .state                  = kvm_guest_state,
>>>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>>>          .handle_intel_pt_intr   = NULL,
>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>   };
>>>>>>>
>>>>>>> By the way, I do not know should the callback handler be 
>>>>>>> triggered in perf
>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is 
>>>>>>> triggered in
>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>
>>>>>> I don't think we want to take the approach of perf and KVM guests 
>>>>>> "fighting" over
>>>>>> the PMU.  That's effectively what we have today, and it's a mess 
>>>>>> for KVM because
>>>>>> it's impossible to provide consistent, deterministic behavior for 
>>>>>> the guest.  And
>>>>>> it's just as messy for perf, which ends up having wierd, 
>>>>>> cumbersome flows that
>>>>>> exists purely to try to play nice with KVM.
>>>>> With existing pmu core code, in tick timer interrupt or IPI 
>>>>> function call interrupt pmu hw may be accessed by host when VM is 
>>>>> running and pmu is already granted to guest. KVM can not intercept 
>>>>> host IPI/timer interrupt, there is no pmu context switch, there 
>>>>> will be problem.
>>>>>
>>>>> Regards
>>>>> Bibo Mao
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  3:55                                   ` maobibo
@ 2024-04-23  4:23                                     ` Mingwei Zhang
  2024-04-23  6:08                                       ` maobibo
  2024-04-23 12:12                                       ` maobibo
  0 siblings, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-23  4:23 UTC (permalink / raw)
  To: maobibo
  Cc: Mi, Dapeng, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
>
>
>
> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
> >
> > On 4/23/2024 10:53 AM, maobibo wrote:
> >>
> >>
> >> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
> >>>
> >>> On 4/23/2024 9:01 AM, maobibo wrote:
> >>>>
> >>>>
> >>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
> >>>>> On Mon, Apr 22, 2024, maobibo wrote:
> >>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
> >>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> >>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
> >>>>>>>> <seanjc@google.com> wrote:
> >>>>>>>>> One my biggest complaints with the current vPMU code is that
> >>>>>>>>> the roles and
> >>>>>>>>> responsibilities between KVM and perf are poorly defined, which
> >>>>>>>>> leads to suboptimal
> >>>>>>>>> and hard to maintain code.
> >>>>>>>>>
> >>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
> >>>>>>>>> _would_ leak guest
> >>>>>>>>> state to userspace processes that have RDPMC permissions, as
> >>>>>>>>> the PMCs might not
> >>>>>>>>> be dirty from perf's perspective (see
> >>>>>>>>> perf_clear_dirty_counters()).
> >>>>>>>>>
> >>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
> >>>>>>>>> doing so makes the
> >>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
> >>>>>>>>> to clear PMCs,
> >>>>>>>>> or if KVM is just being paranoid.
> >>>>>>>>
> >>>>>>>> So once this rolls out, perf and vPMU are clients directly to
> >>>>>>>> PMU HW.
> >>>>>>>
> >>>>>>> I don't think this is a statement we want to make, as it opens a
> >>>>>>> discussion
> >>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
> >>>>>>> KVM doesn't need
> >>>>>>> to be on equal footing with perf in terms of owning/managing PMU
> >>>>>>> hardware, KVM
> >>>>>>> just needs a few APIs to allow faithfully and accurately
> >>>>>>> virtualizing a guest PMU.
> >>>>>>>
> >>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
> >>>>>>>> implementation, until both clients agree to a "deal" between them.
> >>>>>>>> Currently, there is no such deal, but I believe we could have
> >>>>>>>> one via
> >>>>>>>> future discussion.
> >>>>>>>
> >>>>>>> What I am saying is that there needs to be a "deal" in place
> >>>>>>> before this code
> >>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
> >>>>>>> still pave over
> >>>>>>> PMCs it doesn't immediately load, as opposed to using
> >>>>>>> cpu_hw_events.dirty to lazily
> >>>>>>> do the clearing.  But perf and KVM need to work together from the
> >>>>>>> get go, ie. I
> >>>>>>> don't want KVM doing something without regard to what perf does,
> >>>>>>> and vice versa.
> >>>>>>>
> >>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
> >>>>>> hardware
> >>>>>> and pmu hw is shard with guest and host. Besides context switch
> >>>>>> there are
> >>>>>> other places where perf core will access pmu hw, such as tick
> >>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
> >>>>>> context switch.
> >>>>>
> >>>>> Two questions:
> >>>>>
> >>>>>   1) Can KVM prevent the guest from accessing the PMU?
> >>>>>
> >>>>>   2) If so, KVM can grant partial access to the PMU, or is it all
> >>>>> or nothing?
> >>>>>
> >>>>> If the answer to both questions is "yes", then it sounds like
> >>>>> LoongArch *requires*
> >>>>> mediated/passthrough support in order to virtualize its PMU.
> >>>>
> >>>> Hi Sean,
> >>>>
> >>>> Thank for your quick response.
> >>>>
> >>>> yes, kvm can prevent guest from accessing the PMU and grant partial
> >>>> or all to access to the PMU. Only that if one pmu event is granted
> >>>> to VM, host can not access this pmu event again. There must be pmu
> >>>> event switch if host want to.
> >>>
> >>> PMU event is a software entity which won't be shared. did you mean if
> >>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
> >>> counter, right?
> >> yes, if PMU HW counter/control is granted to VM. The value comes from
> >> guest, and is not meaningful for host.  Host pmu core does not know
> >> that it is granted to VM, host still think that it owns pmu.
> >
> > That's one issue this patchset tries to solve. Current new mediated x86
> > vPMU framework doesn't allow Host or Guest own the PMU HW resource
> > simultaneously. Only when there is no !exclude_guest event on host,
> > guest is allowed to exclusively own the PMU HW resource.
> >
> >
> >>
> >> Just like FPU register, it is shared by VM and host during different
> >> time and it is lately switched. But if IPI or timer interrupt uses FPU
> >> register on host, there will be the same issue.
> >
> > I didn't fully get your point. When IPI or timer interrupt reach, a
> > VM-exit is triggered to make CPU traps into host first and then the host
> yes, it is.

This is correct. And this is one of the points that we had debated
internally whether we should do PMU context switch at vcpu loop
boundary or VM Enter/exit boundary. (host-level) timer interrupt can
force VM Exit, which I think happens every 4ms or 1ms, depending on
configuration.

One of the key reasons we currently propose this is because it is the
same boundary as the legacy PMU, i.e., it would be simple to propose
from the perf subsystem perspective.

Performance wise, doing PMU context switch at vcpu boundary would be
way better in general. But the downside is that perf sub-system lose
the capability to profile majority of the KVM code (functions) when
guest PMU is enabled.

>
> > interrupt handler is called. Or are you complaining the executing
> > sequence of switching guest PMU MSRs and these interrupt handler?
> In our vPMU implementation, it is ok if vPMU is switched in vm exit
> path, however there is problem if vPMU is switched during vcpu thread
> sched-out/sched-in path since IPI/timer irq interrupt access pmu
> register in host mode.

Oh, the IPI/timer irq handler will access PMU registers? I thought
only the host-level NMI handler will access the PMU MSRs since PMI is
registered under NMI.

In that case, you should disable  IRQ during vcpu context switch. For
NMI, we prevent its handler from accessing the PMU registers. In
particular, we use a per-cpu variable to guard that. So, the
host-level PMI handler for perf sub-system will check the variable
before proceeding.

>
> In general it will be better if the switch is done in vcpu thread
> sched-out/sched-in, else there is requirement to profile kvm
> hypervisor.Even there is such requirement, it is only one option. In
> most conditions, it will better if time of VM context exit is small.
>
Performance wise, agree, but there will be debate on perf
functionality loss at the host level.

Maybe, (just maybe), it is possible to do PMU context switch at vcpu
boundary normally, but doing it at VM Enter/Exit boundary when host is
profiling KVM kernel module. So, dynamically adjusting PMU context
switch location could be an option.

> >
> >
> >>
> >> Regards
> >> Bibo Mao
> >>>
> >>>
> >>>>
> >>>>>
> >>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
> >>>>>> this:
> >>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
> >>>>>> kvm_guest_cbs
> >>>>>> = {
> >>>>>>          .state                  = kvm_guest_state,
> >>>>>>          .get_ip                 = kvm_guest_get_ip,
> >>>>>>          .handle_intel_pt_intr   = NULL,
> >>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
> >>>>>>   };
> >>>>>>
> >>>>>> By the way, I do not know should the callback handler be triggered
> >>>>>> in perf
> >>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
> >>>>>> triggered in
> >>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
> >>>>>> but I think it will be better if it is done in perf core.
> >>>>>
> >>>>> I don't think we want to take the approach of perf and KVM guests
> >>>>> "fighting" over
> >>>>> the PMU.  That's effectively what we have today, and it's a mess
> >>>>> for KVM because
> >>>>> it's impossible to provide consistent, deterministic behavior for
> >>>>> the guest.  And
> >>>>> it's just as messy for perf, which ends up having wierd, cumbersome
> >>>>> flows that
> >>>>> exists purely to try to play nice with KVM.
> >>>> With existing pmu core code, in tick timer interrupt or IPI function
> >>>> call interrupt pmu hw may be accessed by host when VM is running and
> >>>> pmu is already granted to guest. KVM can not intercept host
> >>>> IPI/timer interrupt, there is no pmu context switch, there will be
> >>>> problem.
> >>>>
> >>>> Regards
> >>>> Bibo Mao
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  4:23                                     ` Mingwei Zhang
@ 2024-04-23  6:08                                       ` maobibo
  2024-04-23  6:45                                         ` Mi, Dapeng
  2024-04-23 12:12                                       ` maobibo
  1 sibling, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-23  6:08 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Mi, Dapeng, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024/4/23 下午12:23, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
>>
>>
>>
>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>
>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>> the roles and
>>>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>
>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>> the PMCs might not
>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>
>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>> doing so makes the
>>>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>
>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>> PMU HW.
>>>>>>>>>
>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>> discussion
>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>> KVM doesn't need
>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>> hardware, KVM
>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>
>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>> one via
>>>>>>>>>> future discussion.
>>>>>>>>>
>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>> before this code
>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>> still pave over
>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>>>>>> get go, ie. I
>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>> and vice versa.
>>>>>>>>>
>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>>>> hardware
>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>> there are
>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>> context switch.
>>>>>>>
>>>>>>> Two questions:
>>>>>>>
>>>>>>>    1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>
>>>>>>>    2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>> or nothing?
>>>>>>>
>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>> LoongArch *requires*
>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thank for your quick response.
>>>>>>
>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>> event switch if host want to.
>>>>>
>>>>> PMU event is a software entity which won't be shared. did you mean if
>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>> counter, right?
>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>> that it is granted to VM, host still think that it owns pmu.
>>>
>>> That's one issue this patchset tries to solve. Current new mediated x86
>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>> simultaneously. Only when there is no !exclude_guest event on host,
>>> guest is allowed to exclusively own the PMU HW resource.
>>>
>>>
>>>>
>>>> Just like FPU register, it is shared by VM and host during different
>>>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>>>> register on host, there will be the same issue.
>>>
>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>> VM-exit is triggered to make CPU traps into host first and then the host
>> yes, it is.
> 
> This is correct. And this is one of the points that we had debated
> internally whether we should do PMU context switch at vcpu loop
> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
> force VM Exit, which I think happens every 4ms or 1ms, depending on
> configuration.
> 
> One of the key reasons we currently propose this is because it is the
> same boundary as the legacy PMU, i.e., it would be simple to propose
> from the perf subsystem perspective.
> 
> Performance wise, doing PMU context switch at vcpu boundary would be
> way better in general. But the downside is that perf sub-system lose
> the capability to profile majority of the KVM code (functions) when
> guest PMU is enabled.
> 
>>
>>> interrupt handler is called. Or are you complaining the executing
>>> sequence of switching guest PMU MSRs and these interrupt handler?
>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>> path, however there is problem if vPMU is switched during vcpu thread
>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>> register in host mode.
> 
> Oh, the IPI/timer irq handler will access PMU registers? I thought
> only the host-level NMI handler will access the PMU MSRs since PMI is
> registered under NMI.
> 
> In that case, you should disable  IRQ during vcpu context switch. For
> NMI, we prevent its handler from accessing the PMU registers. In
> particular, we use a per-cpu variable to guard that. So, the
> host-level PMI handler for perf sub-system will check the variable
> before proceeding.

perf core will access pmu hw in tick timer/hrtimer/ipi function call,
such as function perf_event_task_tick() is called in tick timer, there
are  event_function_call(event, __perf_event_xxx, &value) in  file
kernel/events/core.c.

https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4


> 
>>
>> In general it will be better if the switch is done in vcpu thread
>> sched-out/sched-in, else there is requirement to profile kvm
>> hypervisor.Even there is such requirement, it is only one option. In
>> most conditions, it will better if time of VM context exit is small.
>>
> Performance wise, agree, but there will be debate on perf
> functionality loss at the host level.
> 
> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> boundary normally, but doing it at VM Enter/Exit boundary when host is
> profiling KVM kernel module. So, dynamically adjusting PMU context
> switch location could be an option.
> 
>>>
>>>
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>>>>>> this:
>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>> kvm_guest_cbs
>>>>>>>> = {
>>>>>>>>           .state                  = kvm_guest_state,
>>>>>>>>           .get_ip                 = kvm_guest_get_ip,
>>>>>>>>           .handle_intel_pt_intr   = NULL,
>>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>>    };
>>>>>>>>
>>>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>>>> in perf
>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>> triggered in
>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>
>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>> "fighting" over
>>>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>>>> for KVM because
>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>> the guest.  And
>>>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>>>> flows that
>>>>>>> exists purely to try to play nice with KVM.
>>>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>> problem.
>>>>>>
>>>>>> Regards
>>>>>> Bibo Mao
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  6:08                                       ` maobibo
@ 2024-04-23  6:45                                         ` Mi, Dapeng
  2024-04-23  7:10                                           ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-23  6:45 UTC (permalink / raw)
  To: maobibo, Mingwei Zhang
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/23/2024 2:08 PM, maobibo wrote:
>
>
> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
>>>
>>>
>>>
>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>
>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>
>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>> the roles and
>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined, 
>>>>>>>>>>>> which
>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>
>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>
>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>> overall code brittle because it's not clear whether KVM 
>>>>>>>>>>>> _needs_
>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>
>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>> PMU HW.
>>>>>>>>>>
>>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>>> discussion
>>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>>> KVM doesn't need
>>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>>> hardware, KVM
>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>
>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>> implementation, until both clients agree to a "deal" between 
>>>>>>>>>>> them.
>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>> one via
>>>>>>>>>>> future discussion.
>>>>>>>>>>
>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>> before this code
>>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>>> still pave over
>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>> do the clearing.  But perf and KVM need to work together from 
>>>>>>>>>> the
>>>>>>>>>> get go, ie. I
>>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>>> and vice versa.
>>>>>>>>>>
>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly 
>>>>>>>>> pmu
>>>>>>>>> hardware
>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>> there are
>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>> context switch.
>>>>>>>>
>>>>>>>> Two questions:
>>>>>>>>
>>>>>>>>    1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>
>>>>>>>>    2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>>> or nothing?
>>>>>>>>
>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>> LoongArch *requires*
>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>
>>>>>>> Hi Sean,
>>>>>>>
>>>>>>> Thank for your quick response.
>>>>>>>
>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>>> event switch if host want to.
>>>>>>
>>>>>> PMU event is a software entity which won't be shared. did you 
>>>>>> mean if
>>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>>> counter, right?
>>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>
>>>> That's one issue this patchset tries to solve. Current new mediated 
>>>> x86
>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>
>>>>
>>>>>
>>>>> Just like FPU register, it is shared by VM and host during different
>>>>> time and it is lately switched. But if IPI or timer interrupt uses 
>>>>> FPU
>>>>> register on host, there will be the same issue.
>>>>
>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>> VM-exit is triggered to make CPU traps into host first and then the 
>>>> host
>>> yes, it is.
>>
>> This is correct. And this is one of the points that we had debated
>> internally whether we should do PMU context switch at vcpu loop
>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>> configuration.
>>
>> One of the key reasons we currently propose this is because it is the
>> same boundary as the legacy PMU, i.e., it would be simple to propose
>> from the perf subsystem perspective.
>>
>> Performance wise, doing PMU context switch at vcpu boundary would be
>> way better in general. But the downside is that perf sub-system lose
>> the capability to profile majority of the KVM code (functions) when
>> guest PMU is enabled.
>>
>>>
>>>> interrupt handler is called. Or are you complaining the executing
>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>> path, however there is problem if vPMU is switched during vcpu thread
>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>> register in host mode.
>>
>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>> only the host-level NMI handler will access the PMU MSRs since PMI is
>> registered under NMI.
>>
>> In that case, you should disable  IRQ during vcpu context switch. For
>> NMI, we prevent its handler from accessing the PMU registers. In
>> particular, we use a per-cpu variable to guard that. So, the
>> host-level PMI handler for perf sub-system will check the variable
>> before proceeding.
>
> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
> such as function perf_event_task_tick() is called in tick timer, there
> are  event_function_call(event, __perf_event_xxx, &value) in file
> kernel/events/core.c.
>
> https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4 
>

Just go through functions (not sure if all),  whether 
perf_event_task_tick() or the callbacks of event_function_call() would 
check the event->state first, if the event is in 
PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really. 
In this new proposal, all host events with exclude_guest attribute would 
be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW 
resource. So I think it's fine.


>
>
>>
>>>
>>> In general it will be better if the switch is done in vcpu thread
>>> sched-out/sched-in, else there is requirement to profile kvm
>>> hypervisor.Even there is such requirement, it is only one option. In
>>> most conditions, it will better if time of VM context exit is small.
>>>
>> Performance wise, agree, but there will be debate on perf
>> functionality loss at the host level.
>>
>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>> profiling KVM kernel module. So, dynamically adjusting PMU context
>> switch location could be an option.
>>
>>>>
>>>>
>>>>>
>>>>> Regards
>>>>> Bibo Mao
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just 
>>>>>>>>> like
>>>>>>>>> this:
>>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>>> kvm_guest_cbs
>>>>>>>>> = {
>>>>>>>>>           .state                  = kvm_guest_state,
>>>>>>>>>           .get_ip                 = kvm_guest_get_ip,
>>>>>>>>>           .handle_intel_pt_intr   = NULL,
>>>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>>>    };
>>>>>>>>>
>>>>>>>>> By the way, I do not know should the callback handler be 
>>>>>>>>> triggered
>>>>>>>>> in perf
>>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>>> triggered in
>>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>>
>>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>>> "fighting" over
>>>>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>>>>> for KVM because
>>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>>> the guest.  And
>>>>>>>> it's just as messy for perf, which ends up having wierd, 
>>>>>>>> cumbersome
>>>>>>>> flows that
>>>>>>>> exists purely to try to play nice with KVM.
>>>>>>> With existing pmu core code, in tick timer interrupt or IPI 
>>>>>>> function
>>>>>>> call interrupt pmu hw may be accessed by host when VM is running 
>>>>>>> and
>>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>>> problem.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bibo Mao
>>>>>>>
>>>>>
>>>
>
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  6:45                                         ` Mi, Dapeng
@ 2024-04-23  7:10                                           ` Mingwei Zhang
  2024-04-23  8:24                                             ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-23  7:10 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: maobibo, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Mon, Apr 22, 2024 at 11:45 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/23/2024 2:08 PM, maobibo wrote:
> >
> >
> > On 2024/4/23 下午12:23, Mingwei Zhang wrote:
> >> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
> >>>>
> >>>> On 4/23/2024 10:53 AM, maobibo wrote:
> >>>>>
> >>>>>
> >>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
> >>>>>>
> >>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
> >>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
> >>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
> >>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> >>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
> >>>>>>>>>>> <seanjc@google.com> wrote:
> >>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
> >>>>>>>>>>>> the roles and
> >>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
> >>>>>>>>>>>> which
> >>>>>>>>>>>> leads to suboptimal
> >>>>>>>>>>>> and hard to maintain code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
> >>>>>>>>>>>> _would_ leak guest
> >>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
> >>>>>>>>>>>> the PMCs might not
> >>>>>>>>>>>> be dirty from perf's perspective (see
> >>>>>>>>>>>> perf_clear_dirty_counters()).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
> >>>>>>>>>>>> doing so makes the
> >>>>>>>>>>>> overall code brittle because it's not clear whether KVM
> >>>>>>>>>>>> _needs_
> >>>>>>>>>>>> to clear PMCs,
> >>>>>>>>>>>> or if KVM is just being paranoid.
> >>>>>>>>>>>
> >>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
> >>>>>>>>>>> PMU HW.
> >>>>>>>>>>
> >>>>>>>>>> I don't think this is a statement we want to make, as it opens a
> >>>>>>>>>> discussion
> >>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
> >>>>>>>>>> KVM doesn't need
> >>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
> >>>>>>>>>> hardware, KVM
> >>>>>>>>>> just needs a few APIs to allow faithfully and accurately
> >>>>>>>>>> virtualizing a guest PMU.
> >>>>>>>>>>
> >>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
> >>>>>>>>>>> implementation, until both clients agree to a "deal" between
> >>>>>>>>>>> them.
> >>>>>>>>>>> Currently, there is no such deal, but I believe we could have
> >>>>>>>>>>> one via
> >>>>>>>>>>> future discussion.
> >>>>>>>>>>
> >>>>>>>>>> What I am saying is that there needs to be a "deal" in place
> >>>>>>>>>> before this code
> >>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
> >>>>>>>>>> still pave over
> >>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
> >>>>>>>>>> cpu_hw_events.dirty to lazily
> >>>>>>>>>> do the clearing.  But perf and KVM need to work together from
> >>>>>>>>>> the
> >>>>>>>>>> get go, ie. I
> >>>>>>>>>> don't want KVM doing something without regard to what perf does,
> >>>>>>>>>> and vice versa.
> >>>>>>>>>>
> >>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
> >>>>>>>>> pmu
> >>>>>>>>> hardware
> >>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
> >>>>>>>>> there are
> >>>>>>>>> other places where perf core will access pmu hw, such as tick
> >>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
> >>>>>>>>> context switch.
> >>>>>>>>
> >>>>>>>> Two questions:
> >>>>>>>>
> >>>>>>>>    1) Can KVM prevent the guest from accessing the PMU?
> >>>>>>>>
> >>>>>>>>    2) If so, KVM can grant partial access to the PMU, or is it all
> >>>>>>>> or nothing?
> >>>>>>>>
> >>>>>>>> If the answer to both questions is "yes", then it sounds like
> >>>>>>>> LoongArch *requires*
> >>>>>>>> mediated/passthrough support in order to virtualize its PMU.
> >>>>>>>
> >>>>>>> Hi Sean,
> >>>>>>>
> >>>>>>> Thank for your quick response.
> >>>>>>>
> >>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
> >>>>>>> or all to access to the PMU. Only that if one pmu event is granted
> >>>>>>> to VM, host can not access this pmu event again. There must be pmu
> >>>>>>> event switch if host want to.
> >>>>>>
> >>>>>> PMU event is a software entity which won't be shared. did you
> >>>>>> mean if
> >>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
> >>>>>> counter, right?
> >>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
> >>>>> guest, and is not meaningful for host.  Host pmu core does not know
> >>>>> that it is granted to VM, host still think that it owns pmu.
> >>>>
> >>>> That's one issue this patchset tries to solve. Current new mediated
> >>>> x86
> >>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
> >>>> simultaneously. Only when there is no !exclude_guest event on host,
> >>>> guest is allowed to exclusively own the PMU HW resource.
> >>>>
> >>>>
> >>>>>
> >>>>> Just like FPU register, it is shared by VM and host during different
> >>>>> time and it is lately switched. But if IPI or timer interrupt uses
> >>>>> FPU
> >>>>> register on host, there will be the same issue.
> >>>>
> >>>> I didn't fully get your point. When IPI or timer interrupt reach, a
> >>>> VM-exit is triggered to make CPU traps into host first and then the
> >>>> host
> >>> yes, it is.
> >>
> >> This is correct. And this is one of the points that we had debated
> >> internally whether we should do PMU context switch at vcpu loop
> >> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
> >> force VM Exit, which I think happens every 4ms or 1ms, depending on
> >> configuration.
> >>
> >> One of the key reasons we currently propose this is because it is the
> >> same boundary as the legacy PMU, i.e., it would be simple to propose
> >> from the perf subsystem perspective.
> >>
> >> Performance wise, doing PMU context switch at vcpu boundary would be
> >> way better in general. But the downside is that perf sub-system lose
> >> the capability to profile majority of the KVM code (functions) when
> >> guest PMU is enabled.
> >>
> >>>
> >>>> interrupt handler is called. Or are you complaining the executing
> >>>> sequence of switching guest PMU MSRs and these interrupt handler?
> >>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
> >>> path, however there is problem if vPMU is switched during vcpu thread
> >>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
> >>> register in host mode.
> >>
> >> Oh, the IPI/timer irq handler will access PMU registers? I thought
> >> only the host-level NMI handler will access the PMU MSRs since PMI is
> >> registered under NMI.
> >>
> >> In that case, you should disable  IRQ during vcpu context switch. For
> >> NMI, we prevent its handler from accessing the PMU registers. In
> >> particular, we use a per-cpu variable to guard that. So, the
> >> host-level PMI handler for perf sub-system will check the variable
> >> before proceeding.
> >
> > perf core will access pmu hw in tick timer/hrtimer/ipi function call,
> > such as function perf_event_task_tick() is called in tick timer, there
> > are  event_function_call(event, __perf_event_xxx, &value) in file
> > kernel/events/core.c.
> >
> > https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4
> >
>
> Just go through functions (not sure if all),  whether
> perf_event_task_tick() or the callbacks of event_function_call() would
> check the event->state first, if the event is in
> PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
> In this new proposal, all host events with exclude_guest attribute would
> be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
> resource. So I think it's fine.
>

Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
access the global ctrl MSR.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  7:10                                           ` Mingwei Zhang
@ 2024-04-23  8:24                                             ` Mi, Dapeng
  2024-04-23  8:51                                               ` maobibo
  2024-04-23 16:50                                               ` Mingwei Zhang
  0 siblings, 2 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-23  8:24 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: maobibo, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/23/2024 3:10 PM, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 11:45 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 4/23/2024 2:08 PM, maobibo wrote:
>>>
>>> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>>>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>>>
>>>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>>>
>>>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>>>> the roles and
>>>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>>>>>> _needs_
>>>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>>>> PMU HW.
>>>>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>>>>> discussion
>>>>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>>>>> KVM doesn't need
>>>>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>>>>> hardware, KVM
>>>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>>>
>>>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>>>>>> them.
>>>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>>>> one via
>>>>>>>>>>>>> future discussion.
>>>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>>>> before this code
>>>>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>>>>> still pave over
>>>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>>>> do the clearing.  But perf and KVM need to work together from
>>>>>>>>>>>> the
>>>>>>>>>>>> get go, ie. I
>>>>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>>>>> and vice versa.
>>>>>>>>>>>>
>>>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>>>>>> pmu
>>>>>>>>>>> hardware
>>>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>>>> there are
>>>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>>>> context switch.
>>>>>>>>>> Two questions:
>>>>>>>>>>
>>>>>>>>>>     1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>>>
>>>>>>>>>>     2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>>>>> or nothing?
>>>>>>>>>>
>>>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>>>> LoongArch *requires*
>>>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>>> Hi Sean,
>>>>>>>>>
>>>>>>>>> Thank for your quick response.
>>>>>>>>>
>>>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>>>>> event switch if host want to.
>>>>>>>> PMU event is a software entity which won't be shared. did you
>>>>>>>> mean if
>>>>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>>>>> counter, right?
>>>>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>>> That's one issue this patchset tries to solve. Current new mediated
>>>>>> x86
>>>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>>>
>>>>>>
>>>>>>> Just like FPU register, it is shared by VM and host during different
>>>>>>> time and it is lately switched. But if IPI or timer interrupt uses
>>>>>>> FPU
>>>>>>> register on host, there will be the same issue.
>>>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>>>> VM-exit is triggered to make CPU traps into host first and then the
>>>>>> host
>>>>> yes, it is.
>>>> This is correct. And this is one of the points that we had debated
>>>> internally whether we should do PMU context switch at vcpu loop
>>>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>>>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>>>> configuration.
>>>>
>>>> One of the key reasons we currently propose this is because it is the
>>>> same boundary as the legacy PMU, i.e., it would be simple to propose
>>>> from the perf subsystem perspective.
>>>>
>>>> Performance wise, doing PMU context switch at vcpu boundary would be
>>>> way better in general. But the downside is that perf sub-system lose
>>>> the capability to profile majority of the KVM code (functions) when
>>>> guest PMU is enabled.
>>>>
>>>>>> interrupt handler is called. Or are you complaining the executing
>>>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>>>> path, however there is problem if vPMU is switched during vcpu thread
>>>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>>>> register in host mode.
>>>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>>>> only the host-level NMI handler will access the PMU MSRs since PMI is
>>>> registered under NMI.
>>>>
>>>> In that case, you should disable  IRQ during vcpu context switch. For
>>>> NMI, we prevent its handler from accessing the PMU registers. In
>>>> particular, we use a per-cpu variable to guard that. So, the
>>>> host-level PMI handler for perf sub-system will check the variable
>>>> before proceeding.
>>> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
>>> such as function perf_event_task_tick() is called in tick timer, there
>>> are  event_function_call(event, __perf_event_xxx, &value) in file
>>> kernel/events/core.c.
>>>
>>> https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4
>>>
>> Just go through functions (not sure if all),  whether
>> perf_event_task_tick() or the callbacks of event_function_call() would
>> check the event->state first, if the event is in
>> PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
>> In this new proposal, all host events with exclude_guest attribute would
>> be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
>> resource. So I think it's fine.
>>
> Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
> If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
> access the global ctrl MSR.

I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on 
host when guest owns the PMU HW resource.

In current solution, VM would fail to create if there is any system-wide 
event without exclude_guest attribute. If VM is created successfully and 
when vm-entry happens, the helper perf_guest_enter() would put all host 
events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state 
and block host to create system-wide events without exclude_guest attribute.



^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  8:24                                             ` Mi, Dapeng
@ 2024-04-23  8:51                                               ` maobibo
  2024-04-23 16:50                                               ` Mingwei Zhang
  1 sibling, 0 replies; 181+ messages in thread
From: maobibo @ 2024-04-23  8:51 UTC (permalink / raw)
  To: Mi, Dapeng, Mingwei Zhang
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024/4/23 下午4:24, Mi, Dapeng wrote:
> 
> On 4/23/2024 3:10 PM, Mingwei Zhang wrote:
>> On Mon, Apr 22, 2024 at 11:45 PM Mi, Dapeng 
>> <dapeng1.mi@linux.intel.com> wrote:
>>>
>>> On 4/23/2024 2:08 PM, maobibo wrote:
>>>>
>>>> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>>>>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>>>>
>>>>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>>>>
>>>>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>>>>> the roles and
>>>>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>>>>>>> _needs_
>>>>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>>>>> PMU HW.
>>>>>>>>>>>>> I don't think this is a statement we want to make, as it 
>>>>>>>>>>>>> opens a
>>>>>>>>>>>>> discussion
>>>>>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>>>>>> KVM doesn't need
>>>>>>>>>>>>> to be on equal footing with perf in terms of 
>>>>>>>>>>>>> owning/managing PMU
>>>>>>>>>>>>> hardware, KVM
>>>>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>>>>> one via
>>>>>>>>>>>>>> future discussion.
>>>>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>>>>> before this code
>>>>>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf 
>>>>>>>>>>>>> can
>>>>>>>>>>>>> still pave over
>>>>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>>>>> do the clearing.  But perf and KVM need to work together from
>>>>>>>>>>>>> the
>>>>>>>>>>>>> get go, ie. I
>>>>>>>>>>>>> don't want KVM doing something without regard to what perf 
>>>>>>>>>>>>> does,
>>>>>>>>>>>>> and vice versa.
>>>>>>>>>>>>>
>>>>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>>>>>>> pmu
>>>>>>>>>>>> hardware
>>>>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>>>>> there are
>>>>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>>>>> context switch.
>>>>>>>>>>> Two questions:
>>>>>>>>>>>
>>>>>>>>>>>     1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>>>>
>>>>>>>>>>>     2) If so, KVM can grant partial access to the PMU, or is 
>>>>>>>>>>> it all
>>>>>>>>>>> or nothing?
>>>>>>>>>>>
>>>>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>>>>> LoongArch *requires*
>>>>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>>>> Hi Sean,
>>>>>>>>>>
>>>>>>>>>> Thank for your quick response.
>>>>>>>>>>
>>>>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant 
>>>>>>>>>> partial
>>>>>>>>>> or all to access to the PMU. Only that if one pmu event is 
>>>>>>>>>> granted
>>>>>>>>>> to VM, host can not access this pmu event again. There must be 
>>>>>>>>>> pmu
>>>>>>>>>> event switch if host want to.
>>>>>>>>> PMU event is a software entity which won't be shared. did you
>>>>>>>>> mean if
>>>>>>>>> a PMU HW counter is granted to VM, then Host can't access the 
>>>>>>>>> PMU HW
>>>>>>>>> counter, right?
>>>>>>>> yes, if PMU HW counter/control is granted to VM. The value comes 
>>>>>>>> from
>>>>>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>>>> That's one issue this patchset tries to solve. Current new mediated
>>>>>>> x86
>>>>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>>>>
>>>>>>>
>>>>>>>> Just like FPU register, it is shared by VM and host during 
>>>>>>>> different
>>>>>>>> time and it is lately switched. But if IPI or timer interrupt uses
>>>>>>>> FPU
>>>>>>>> register on host, there will be the same issue.
>>>>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>>>>> VM-exit is triggered to make CPU traps into host first and then the
>>>>>>> host
>>>>>> yes, it is.
>>>>> This is correct. And this is one of the points that we had debated
>>>>> internally whether we should do PMU context switch at vcpu loop
>>>>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>>>>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>>>>> configuration.
>>>>>
>>>>> One of the key reasons we currently propose this is because it is the
>>>>> same boundary as the legacy PMU, i.e., it would be simple to propose
>>>>> from the perf subsystem perspective.
>>>>>
>>>>> Performance wise, doing PMU context switch at vcpu boundary would be
>>>>> way better in general. But the downside is that perf sub-system lose
>>>>> the capability to profile majority of the KVM code (functions) when
>>>>> guest PMU is enabled.
>>>>>
>>>>>>> interrupt handler is called. Or are you complaining the executing
>>>>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>>>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>>>>> path, however there is problem if vPMU is switched during vcpu thread
>>>>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>>>>> register in host mode.
>>>>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>>>>> only the host-level NMI handler will access the PMU MSRs since PMI is
>>>>> registered under NMI.
>>>>>
>>>>> In that case, you should disable  IRQ during vcpu context switch. For
>>>>> NMI, we prevent its handler from accessing the PMU registers. In
>>>>> particular, we use a per-cpu variable to guard that. So, the
>>>>> host-level PMI handler for perf sub-system will check the variable
>>>>> before proceeding.
>>>> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
>>>> such as function perf_event_task_tick() is called in tick timer, there
>>>> are  event_function_call(event, __perf_event_xxx, &value) in file
>>>> kernel/events/core.c.
>>>>
>>>> https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4 
>>>>
>>>>
>>> Just go through functions (not sure if all),  whether
>>> perf_event_task_tick() or the callbacks of event_function_call() would
>>> check the event->state first, if the event is in
>>> PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
>>> In this new proposal, all host events with exclude_guest attribute would
>>> be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
>>> resource. So I think it's fine.
>>>
>> Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
>> If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
>> access the global ctrl MSR.
> 
> I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on 
> host when guest owns the PMU HW resource.
> 
> In current solution, VM would fail to create if there is any system-wide 
> event without exclude_guest attribute. If VM is created successfully and 
> when vm-entry happens, the helper perf_guest_enter() would put all host 
> events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state 
> and block host to create system-wide events without exclude_guest 
> attribute.
I do not know perf subsytem, Can the perf event state kept unchanged? 
After VM enters, hw perf counter is allocated to VM.  HW counter 
function for host should be stopped already. It seems that host perf 
core needs not perceive VM enter/exit.


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  4:23                                     ` Mingwei Zhang
  2024-04-23  6:08                                       ` maobibo
@ 2024-04-23 12:12                                       ` maobibo
  2024-04-23 17:02                                         ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: maobibo @ 2024-04-23 12:12 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Mi, Dapeng, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024/4/23 下午12:23, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@loongson.cn> wrote:
>>
>>
>>
>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>
>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>> <seanjc@google.com> wrote:
>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>> the roles and
>>>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>
>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>> the PMCs might not
>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>
>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>> doing so makes the
>>>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>
>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>> PMU HW.
>>>>>>>>>
>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>> discussion
>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>> KVM doesn't need
>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>> hardware, KVM
>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>
>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>> one via
>>>>>>>>>> future discussion.
>>>>>>>>>
>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>> before this code
>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>> still pave over
>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>>>>>> get go, ie. I
>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>> and vice versa.
>>>>>>>>>
>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>>>> hardware
>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>> there are
>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>> context switch.
>>>>>>>
>>>>>>> Two questions:
>>>>>>>
>>>>>>>    1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>
>>>>>>>    2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>> or nothing?
>>>>>>>
>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>> LoongArch *requires*
>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thank for your quick response.
>>>>>>
>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>> event switch if host want to.
>>>>>
>>>>> PMU event is a software entity which won't be shared. did you mean if
>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>> counter, right?
>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>> that it is granted to VM, host still think that it owns pmu.
>>>
>>> That's one issue this patchset tries to solve. Current new mediated x86
>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>> simultaneously. Only when there is no !exclude_guest event on host,
>>> guest is allowed to exclusively own the PMU HW resource.
>>>
>>>
>>>>
>>>> Just like FPU register, it is shared by VM and host during different
>>>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>>>> register on host, there will be the same issue.
>>>
>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>> VM-exit is triggered to make CPU traps into host first and then the host
>> yes, it is.
> 
> This is correct. And this is one of the points that we had debated
> internally whether we should do PMU context switch at vcpu loop
> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
> force VM Exit, which I think happens every 4ms or 1ms, depending on
> configuration.
> 
> One of the key reasons we currently propose this is because it is the
> same boundary as the legacy PMU, i.e., it would be simple to propose
> from the perf subsystem perspective.
> 
> Performance wise, doing PMU context switch at vcpu boundary would be
> way better in general. But the downside is that perf sub-system lose
> the capability to profile majority of the KVM code (functions) when
> guest PMU is enabled.
> 
>>
>>> interrupt handler is called. Or are you complaining the executing
>>> sequence of switching guest PMU MSRs and these interrupt handler?
>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>> path, however there is problem if vPMU is switched during vcpu thread
>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>> register in host mode.
> 
> Oh, the IPI/timer irq handler will access PMU registers? I thought
> only the host-level NMI handler will access the PMU MSRs since PMI is
> registered under NMI.
> 
> In that case, you should disable  IRQ during vcpu context switch. For
> NMI, we prevent its handler from accessing the PMU registers. In
> particular, we use a per-cpu variable to guard that. So, the
> host-level PMI handler for perf sub-system will check the variable
> before proceeding.
> 
>>
>> In general it will be better if the switch is done in vcpu thread
>> sched-out/sched-in, else there is requirement to profile kvm
>> hypervisor.Even there is such requirement, it is only one option. In
>> most conditions, it will better if time of VM context exit is small.
>>
> Performance wise, agree, but there will be debate on perf
> functionality loss at the host level.
> 
> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> boundary normally, but doing it at VM Enter/Exit boundary when host is
> profiling KVM kernel module. So, dynamically adjusting PMU context
> switch location could be an option.
If there are two VMs with pmu enabled both, however host PMU is not 
enabled. PMU context switch should be done in vcpu thread sched-out path.

If host pmu is used also, we can choose whether PMU switch should be 
done in vm exit path or vcpu thread sched-out path.

> 
>>>
>>>
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>>>>>> this:
>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>> kvm_guest_cbs
>>>>>>>> = {
>>>>>>>>           .state                  = kvm_guest_state,
>>>>>>>>           .get_ip                 = kvm_guest_get_ip,
>>>>>>>>           .handle_intel_pt_intr   = NULL,
>>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>>    };
>>>>>>>>
>>>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>>>> in perf
>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>> triggered in
>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>
>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>> "fighting" over
>>>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>>>> for KVM because
>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>> the guest.  And
>>>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>>>> flows that
>>>>>>> exists purely to try to play nice with KVM.
>>>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>> problem.
>>>>>>
>>>>>> Regards
>>>>>> Bibo Mao
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23  8:24                                             ` Mi, Dapeng
  2024-04-23  8:51                                               ` maobibo
@ 2024-04-23 16:50                                               ` Mingwei Zhang
  1 sibling, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-23 16:50 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: maobibo, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

> > Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
> > If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
> > access the global ctrl MSR.
>
> I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on
> host when guest owns the PMU HW resource.
>
> In current solution, VM would fail to create if there is any system-wide
> event without exclude_guest attribute. If VM is created successfully and
> when vm-entry happens, the helper perf_guest_enter() would put all host
> events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state
> and block host to create system-wide events without exclude_guest attribute.
>

Yeah, that's perfect.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23 12:12                                       ` maobibo
@ 2024-04-23 17:02                                         ` Mingwei Zhang
  2024-04-24  1:07                                           ` maobibo
  2024-04-24  8:18                                           ` Mi, Dapeng
  0 siblings, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-23 17:02 UTC (permalink / raw)
  To: maobibo
  Cc: Mi, Dapeng, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

> >
> > Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> > boundary normally, but doing it at VM Enter/Exit boundary when host is
> > profiling KVM kernel module. So, dynamically adjusting PMU context
> > switch location could be an option.
> If there are two VMs with pmu enabled both, however host PMU is not
> enabled. PMU context switch should be done in vcpu thread sched-out path.
>
> If host pmu is used also, we can choose whether PMU switch should be
> done in vm exit path or vcpu thread sched-out path.
>

host PMU is always enabled, ie., Linux currently does not support KVM
PMU running standalone. I guess what you mean is there are no active
perf_events on the host side. Allowing a PMU context switch drifting
from vm-enter/exit boundary to vcpu loop boundary by checking host
side events might be a good option. We can keep the discussion, but I
won't propose that in v2.

I guess we are off topic. Sean's suggestion is that we should put
"perf" and "kvm" together while doing the context switch. I think this
is quite reasonable regardless of the PMU context switch location.

To execute this, I am thinking about adding a parameter or return
value to perf_guest_enter() so that once it returns back to KVM, KVM
gets to know which counters are active/inactive/cleared from the host
side. Knowing that, KVM can do the context switch more efficiently.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23 17:02                                         ` Mingwei Zhang
@ 2024-04-24  1:07                                           ` maobibo
  2024-04-24  8:18                                           ` Mi, Dapeng
  1 sibling, 0 replies; 181+ messages in thread
From: maobibo @ 2024-04-24  1:07 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Mi, Dapeng, Sean Christopherson, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024/4/24 上午1:02, Mingwei Zhang wrote:
>>>
>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>> switch location could be an option.
>> If there are two VMs with pmu enabled both, however host PMU is not
>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>
>> If host pmu is used also, we can choose whether PMU switch should be
>> done in vm exit path or vcpu thread sched-out path.
>>
> 
> host PMU is always enabled, ie., Linux currently does not support KVM
> PMU running standalone. I guess what you mean is there are no active
> perf_events on the host side. Allowing a PMU context switch drifting
> from vm-enter/exit boundary to vcpu loop boundary by checking host
> side events might be a good option. We can keep the discussion, but I
> won't propose that in v2.
> 
> I guess we are off topic. Sean's suggestion is that we should put
> "perf" and "kvm" together while doing the context switch. I think this
> is quite reasonable regardless of the PMU context switch location.
> 
> To execute this, I am thinking about adding a parameter or return
> value to perf_guest_enter() so that once it returns back to KVM, KVM
> gets to know which counters are active/inactive/cleared from the host
> side. Knowing that, KVM can do the context switch more efficiently.
yeap, that sounds great.

Regards
Bibo Mao

> 
> Thanks.
> -Mingwei
> 


^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-23 17:02                                         ` Mingwei Zhang
  2024-04-24  1:07                                           ` maobibo
@ 2024-04-24  8:18                                           ` Mi, Dapeng
  2024-04-24 15:00                                             ` Sean Christopherson
  1 sibling, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-24  8:18 UTC (permalink / raw)
  To: Mingwei Zhang, maobibo
  Cc: Sean Christopherson, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>> switch location could be an option.
>> If there are two VMs with pmu enabled both, however host PMU is not
>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>
>> If host pmu is used also, we can choose whether PMU switch should be
>> done in vm exit path or vcpu thread sched-out path.
>>
> host PMU is always enabled, ie., Linux currently does not support KVM
> PMU running standalone. I guess what you mean is there are no active
> perf_events on the host side. Allowing a PMU context switch drifting
> from vm-enter/exit boundary to vcpu loop boundary by checking host
> side events might be a good option. We can keep the discussion, but I
> won't propose that in v2.

I suspect if it's really doable to do this deferring. This still makes 
host lose the most of capability to profile KVM. Per my understanding, 
most of KVM overhead happens in the vcpu loop, exactly speaking in 
VM-exit handling. We have no idea when host want to create perf event to 
profile KVM, it could be at any time.


>
> I guess we are off topic. Sean's suggestion is that we should put
> "perf" and "kvm" together while doing the context switch. I think this
> is quite reasonable regardless of the PMU context switch location.
>
> To execute this, I am thinking about adding a parameter or return
> value to perf_guest_enter() so that once it returns back to KVM, KVM
> gets to know which counters are active/inactive/cleared from the host
> side. Knowing that, KVM can do the context switch more efficiently.
>
> Thanks.
> -Mingwei
>

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-24  8:18                                           ` Mi, Dapeng
@ 2024-04-24 15:00                                             ` Sean Christopherson
  2024-04-25  3:55                                               ` Mi, Dapeng
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-24 15:00 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, maobibo, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Wed, Apr 24, 2024, Dapeng Mi wrote:
> 
> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
> > > > Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> > > > boundary normally, but doing it at VM Enter/Exit boundary when host is
> > > > profiling KVM kernel module. So, dynamically adjusting PMU context
> > > > switch location could be an option.
> > > If there are two VMs with pmu enabled both, however host PMU is not
> > > enabled. PMU context switch should be done in vcpu thread sched-out path.
> > > 
> > > If host pmu is used also, we can choose whether PMU switch should be
> > > done in vm exit path or vcpu thread sched-out path.
> > > 
> > host PMU is always enabled, ie., Linux currently does not support KVM
> > PMU running standalone. I guess what you mean is there are no active
> > perf_events on the host side. Allowing a PMU context switch drifting
> > from vm-enter/exit boundary to vcpu loop boundary by checking host
> > side events might be a good option. We can keep the discussion, but I
> > won't propose that in v2.
> 
> I suspect if it's really doable to do this deferring. This still makes host
> lose the most of capability to profile KVM. Per my understanding, most of
> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
> We have no idea when host want to create perf event to profile KVM, it could
> be at any time.

No, the idea is that KVM will load host PMU state asap, but only when host PMU
state actually needs to be loaded, i.e. only when there are relevant host events.

If there are no host perf events, KVM keeps guest PMU state loaded for the entire
KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
for the guest while host perf events are active.

My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@google.com

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-24 15:00                                             ` Sean Christopherson
@ 2024-04-25  3:55                                               ` Mi, Dapeng
  2024-04-25  4:24                                                 ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-25  3:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, maobibo, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/24/2024 11:00 PM, Sean Christopherson wrote:
> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>> switch location could be an option.
>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>
>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>> done in vm exit path or vcpu thread sched-out path.
>>>>
>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>> PMU running standalone. I guess what you mean is there are no active
>>> perf_events on the host side. Allowing a PMU context switch drifting
>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>> side events might be a good option. We can keep the discussion, but I
>>> won't propose that in v2.
>> I suspect if it's really doable to do this deferring. This still makes host
>> lose the most of capability to profile KVM. Per my understanding, most of
>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>> We have no idea when host want to create perf event to profile KVM, it could
>> be at any time.
> No, the idea is that KVM will load host PMU state asap, but only when host PMU
> state actually needs to be loaded, i.e. only when there are relevant host events.
>
> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
> for the guest while host perf events are active.

I see. So KVM needs to provide a callback which needs to be called in 
the IPI handler. The KVM callback needs to be called to switch PMU state 
before perf really enabling host event and touching PMU MSRs. And only 
the perf event with exclude_guest attribute is allowed to create on 
host. Thanks.


>
> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@google.com

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25  3:55                                               ` Mi, Dapeng
@ 2024-04-25  4:24                                                 ` Mingwei Zhang
  2024-04-25 16:13                                                   ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-25  4:24 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
> > On Wed, Apr 24, 2024, Dapeng Mi wrote:
> >> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
> >>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> >>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
> >>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
> >>>>> switch location could be an option.
> >>>> If there are two VMs with pmu enabled both, however host PMU is not
> >>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
> >>>>
> >>>> If host pmu is used also, we can choose whether PMU switch should be
> >>>> done in vm exit path or vcpu thread sched-out path.
> >>>>
> >>> host PMU is always enabled, ie., Linux currently does not support KVM
> >>> PMU running standalone. I guess what you mean is there are no active
> >>> perf_events on the host side. Allowing a PMU context switch drifting
> >>> from vm-enter/exit boundary to vcpu loop boundary by checking host
> >>> side events might be a good option. We can keep the discussion, but I
> >>> won't propose that in v2.
> >> I suspect if it's really doable to do this deferring. This still makes host
> >> lose the most of capability to profile KVM. Per my understanding, most of
> >> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
> >> We have no idea when host want to create perf event to profile KVM, it could
> >> be at any time.
> > No, the idea is that KVM will load host PMU state asap, but only when host PMU
> > state actually needs to be loaded, i.e. only when there are relevant host events.
> >
> > If there are no host perf events, KVM keeps guest PMU state loaded for the entire
> > KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
> > events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
> > i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
> > for the guest while host perf events are active.
>
> I see. So KVM needs to provide a callback which needs to be called in
> the IPI handler. The KVM callback needs to be called to switch PMU state
> before perf really enabling host event and touching PMU MSRs. And only
> the perf event with exclude_guest attribute is allowed to create on
> host. Thanks.

Do we really need a KVM callback? I think that is one option.

Immediately after VMEXIT, KVM will check whether there are "host perf
events". If so, do the PMU context switch immediately. Otherwise, keep
deferring the context switch to the end of vPMU loop.

Detecting if there are "host perf events" would be interesting. The
"host perf events" refer to the perf_events on the host that are
active and assigned with HW counters and that are saved when context
switching to the guest PMU. I think getting those events could be done
by fetching the bitmaps in cpuc. I have to look into the details. But
at the time of VMEXIT, kvm should already have that information, so it
can immediately decide whether to do the PMU context switch or not.

oh, but when the control is executing within the run loop, a
host-level profiling starts, say 'perf record -a ...', it will
generate an IPI to all CPUs. Maybe that's when we need a callback so
the KVM guest PMU context gets preempted for the host-level profiling.
Gah..

hmm, not a fan of that. That means the host can poke the guest PMU
context at any time and cause higher overhead. But I admit it is much
better than the current approach.

The only thing is that: any command like 'perf record/stat -a' shot in
dark corners of the host can preempt guest PMUs of _all_ running VMs.
So, to alleviate that, maybe a module parameter that disables this
"preemption" is possible? This should fit scenarios where we don't
want guest PMU to be preempted outside of the vCPU loop?

Thanks. Regards
-Mingwei

-Mingwei

>
>
> >
> > My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@google.com

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25  4:24                                                 ` Mingwei Zhang
@ 2024-04-25 16:13                                                   ` Liang, Kan
  2024-04-25 20:16                                                     ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-25 16:13 UTC (permalink / raw)
  To: Mingwei Zhang, Mi, Dapeng
  Cc: Sean Christopherson, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>>
>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>> switch location could be an option.
>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>
>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>
>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>> side events might be a good option. We can keep the discussion, but I
>>>>> won't propose that in v2.
>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>> be at any time.
>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>
>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>> for the guest while host perf events are active.
>>
>> I see. So KVM needs to provide a callback which needs to be called in
>> the IPI handler. The KVM callback needs to be called to switch PMU state
>> before perf really enabling host event and touching PMU MSRs. And only
>> the perf event with exclude_guest attribute is allowed to create on
>> host. Thanks.
> 
> Do we really need a KVM callback? I think that is one option.
> 
> Immediately after VMEXIT, KVM will check whether there are "host perf
> events". If so, do the PMU context switch immediately. Otherwise, keep
> deferring the context switch to the end of vPMU loop.
> 
> Detecting if there are "host perf events" would be interesting. The
> "host perf events" refer to the perf_events on the host that are
> active and assigned with HW counters and that are saved when context
> switching to the guest PMU. I think getting those events could be done
> by fetching the bitmaps in cpuc.

The cpuc is ARCH specific structure. I don't think it can be get in the
generic code. You probably have to implement ARCH specific functions to
fetch the bitmaps. It probably won't worth it.

You may check the pinned_groups and flexible_groups to understand if
there are host perf events which may be scheduled when VM-exit. But it
will not tell the idx of the counters which can only be got when the
host event is really scheduled.

> I have to look into the details. But
> at the time of VMEXIT, kvm should already have that information, so it
> can immediately decide whether to do the PMU context switch or not.
> 
> oh, but when the control is executing within the run loop, a
> host-level profiling starts, say 'perf record -a ...', it will
> generate an IPI to all CPUs. Maybe that's when we need a callback so
> the KVM guest PMU context gets preempted for the host-level profiling.
> Gah..
> 
> hmm, not a fan of that. That means the host can poke the guest PMU
> context at any time and cause higher overhead. But I admit it is much
> better than the current approach.
> 
> The only thing is that: any command like 'perf record/stat -a' shot in
> dark corners of the host can preempt guest PMUs of _all_ running VMs.
> So, to alleviate that, maybe a module parameter that disables this
> "preemption" is possible? This should fit scenarios where we don't
> want guest PMU to be preempted outside of the vCPU loop?
> 

It should not happen. For the current implementation, perf rejects all
the !exclude_guest system-wide event creation if a guest with the vPMU
is running.
However, it's possible to create an exclude_guest system-wide event at
any time. KVM cannot use the information from the VM-entry to decide if
there will be active perf events in the VM-exit.

The perf_guest_exit() will reload the host state. It's impossible to
save the guest state after that. We may need a KVM callback. So perf can
tell KVM whether to save the guest state before perf reloads the host state.

Thanks,
Kan
>>
>>
>>>
>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25 16:13                                                   ` Liang, Kan
@ 2024-04-25 20:16                                                     ` Mingwei Zhang
  2024-04-25 20:43                                                       ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-25 20:16 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mi, Dapeng, Sean Christopherson, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao

On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
> > On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >>
> >> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
> >>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
> >>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
> >>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> >>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
> >>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
> >>>>>>> switch location could be an option.
> >>>>>> If there are two VMs with pmu enabled both, however host PMU is not
> >>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
> >>>>>>
> >>>>>> If host pmu is used also, we can choose whether PMU switch should be
> >>>>>> done in vm exit path or vcpu thread sched-out path.
> >>>>>>
> >>>>> host PMU is always enabled, ie., Linux currently does not support KVM
> >>>>> PMU running standalone. I guess what you mean is there are no active
> >>>>> perf_events on the host side. Allowing a PMU context switch drifting
> >>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
> >>>>> side events might be a good option. We can keep the discussion, but I
> >>>>> won't propose that in v2.
> >>>> I suspect if it's really doable to do this deferring. This still makes host
> >>>> lose the most of capability to profile KVM. Per my understanding, most of
> >>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
> >>>> We have no idea when host want to create perf event to profile KVM, it could
> >>>> be at any time.
> >>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
> >>> state actually needs to be loaded, i.e. only when there are relevant host events.
> >>>
> >>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
> >>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
> >>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
> >>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
> >>> for the guest while host perf events are active.
> >>
> >> I see. So KVM needs to provide a callback which needs to be called in
> >> the IPI handler. The KVM callback needs to be called to switch PMU state
> >> before perf really enabling host event and touching PMU MSRs. And only
> >> the perf event with exclude_guest attribute is allowed to create on
> >> host. Thanks.
> >
> > Do we really need a KVM callback? I think that is one option.
> >
> > Immediately after VMEXIT, KVM will check whether there are "host perf
> > events". If so, do the PMU context switch immediately. Otherwise, keep
> > deferring the context switch to the end of vPMU loop.
> >
> > Detecting if there are "host perf events" would be interesting. The
> > "host perf events" refer to the perf_events on the host that are
> > active and assigned with HW counters and that are saved when context
> > switching to the guest PMU. I think getting those events could be done
> > by fetching the bitmaps in cpuc.
>
> The cpuc is ARCH specific structure. I don't think it can be get in the
> generic code. You probably have to implement ARCH specific functions to
> fetch the bitmaps. It probably won't worth it.
>
> You may check the pinned_groups and flexible_groups to understand if
> there are host perf events which may be scheduled when VM-exit. But it
> will not tell the idx of the counters which can only be got when the
> host event is really scheduled.
>
> > I have to look into the details. But
> > at the time of VMEXIT, kvm should already have that information, so it
> > can immediately decide whether to do the PMU context switch or not.
> >
> > oh, but when the control is executing within the run loop, a
> > host-level profiling starts, say 'perf record -a ...', it will
> > generate an IPI to all CPUs. Maybe that's when we need a callback so
> > the KVM guest PMU context gets preempted for the host-level profiling.
> > Gah..
> >
> > hmm, not a fan of that. That means the host can poke the guest PMU
> > context at any time and cause higher overhead. But I admit it is much
> > better than the current approach.
> >
> > The only thing is that: any command like 'perf record/stat -a' shot in
> > dark corners of the host can preempt guest PMUs of _all_ running VMs.
> > So, to alleviate that, maybe a module parameter that disables this
> > "preemption" is possible? This should fit scenarios where we don't
> > want guest PMU to be preempted outside of the vCPU loop?
> >
>
> It should not happen. For the current implementation, perf rejects all
> the !exclude_guest system-wide event creation if a guest with the vPMU
> is running.
> However, it's possible to create an exclude_guest system-wide event at
> any time. KVM cannot use the information from the VM-entry to decide if
> there will be active perf events in the VM-exit.

Hmm, why not? If there is any exclude_guest system-wide event,
perf_guest_enter() can return something to tell KVM "hey, some active
host events are swapped out. they are originally in counter #2 and
#3". If so, at the time when perf_guest_enter() returns, KVM will ack
that and keep it in its pmu data structure.

Now, when doing context switching back to host at just VMEXIT, KVM
will check this data and see if host perf context has something active
(of course, they are all exclude_guest events). If not, deferring the
context switch to vcpu boundary. Otherwise, do the proper PMU context
switching by respecting the occupied counter positions on the host
side, i.e., avoid doubling the work on the KVM side.

Kan, any suggestion on the above approach? Totally understand that
there might be some difficulty, since perf subsystem works in several
layers and obviously fetching low-level mapping is arch specific work.
If that is difficult, we can split the work in two phases: 1) phase
#1, just ask perf to tell kvm if there are active exclude_guest events
swapped out; 2) phase #2, ask perf to tell their (low-level) counter
indices.

Thanks.
-Mingwei

>
> The perf_guest_exit() will reload the host state. It's impossible to
> save the guest state after that. We may need a KVM callback. So perf can
> tell KVM whether to save the guest state before perf reloads the host state.
>
> Thanks,
> Kan
> >>
> >>
> >>>
> >>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
> >

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25 20:16                                                     ` Mingwei Zhang
@ 2024-04-25 20:43                                                       ` Liang, Kan
  2024-04-25 21:46                                                         ` Sean Christopherson
  2024-04-26  1:50                                                         ` Mi, Dapeng
  0 siblings, 2 replies; 181+ messages in thread
From: Liang, Kan @ 2024-04-25 20:43 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Mi, Dapeng, Sean Christopherson, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao



On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>
>>>>
>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>> switch location could be an option.
>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>
>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>
>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>> won't propose that in v2.
>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>> be at any time.
>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>
>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>> for the guest while host perf events are active.
>>>>
>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>> the perf event with exclude_guest attribute is allowed to create on
>>>> host. Thanks.
>>>
>>> Do we really need a KVM callback? I think that is one option.
>>>
>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>> deferring the context switch to the end of vPMU loop.
>>>
>>> Detecting if there are "host perf events" would be interesting. The
>>> "host perf events" refer to the perf_events on the host that are
>>> active and assigned with HW counters and that are saved when context
>>> switching to the guest PMU. I think getting those events could be done
>>> by fetching the bitmaps in cpuc.
>>
>> The cpuc is ARCH specific structure. I don't think it can be get in the
>> generic code. You probably have to implement ARCH specific functions to
>> fetch the bitmaps. It probably won't worth it.
>>
>> You may check the pinned_groups and flexible_groups to understand if
>> there are host perf events which may be scheduled when VM-exit. But it
>> will not tell the idx of the counters which can only be got when the
>> host event is really scheduled.
>>
>>> I have to look into the details. But
>>> at the time of VMEXIT, kvm should already have that information, so it
>>> can immediately decide whether to do the PMU context switch or not.
>>>
>>> oh, but when the control is executing within the run loop, a
>>> host-level profiling starts, say 'perf record -a ...', it will
>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>> Gah..
>>>
>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>> context at any time and cause higher overhead. But I admit it is much
>>> better than the current approach.
>>>
>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>> So, to alleviate that, maybe a module parameter that disables this
>>> "preemption" is possible? This should fit scenarios where we don't
>>> want guest PMU to be preempted outside of the vCPU loop?
>>>
>>
>> It should not happen. For the current implementation, perf rejects all
>> the !exclude_guest system-wide event creation if a guest with the vPMU
>> is running.
>> However, it's possible to create an exclude_guest system-wide event at
>> any time. KVM cannot use the information from the VM-entry to decide if
>> there will be active perf events in the VM-exit.
> 
> Hmm, why not? If there is any exclude_guest system-wide event,
> perf_guest_enter() can return something to tell KVM "hey, some active
> host events are swapped out. they are originally in counter #2 and
> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> that and keep it in its pmu data structure.

I think it's possible that someone creates !exclude_guest event after
the perf_guest_enter(). The stale information is saved in the KVM. Perf
will schedule the event in the next perf_guest_exit(). KVM will not know it.

> 
> Now, when doing context switching back to host at just VMEXIT, KVM
> will check this data and see if host perf context has something active
> (of course, they are all exclude_guest events). If not, deferring the
> context switch to vcpu boundary. Otherwise, do the proper PMU context
> switching by respecting the occupied counter positions on the host
> side, i.e., avoid doubling the work on the KVM side.
> 
> Kan, any suggestion on the above approach? 

I think we can only know the accurate event list at perf_guest_exit().
You may check the pinned_groups and flexible_groups, which tell if there
are candidate events.

> Totally understand that
> there might be some difficulty, since perf subsystem works in several
> layers and obviously fetching low-level mapping is arch specific work.
> If that is difficult, we can split the work in two phases: 1) phase
> #1, just ask perf to tell kvm if there are active exclude_guest events
> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
> indices.
>

If you want an accurate counter mask, the changes in the arch specific
code is required. Two phases sound good to me.

Besides perf changes, I think the KVM should also track which counters
need to be saved/restored. The information can be get from the EventSel
interception.

Thanks,
Kan
>>
>> The perf_guest_exit() will reload the host state. It's impossible to
>> save the guest state after that. We may need a KVM callback. So perf can
>> tell KVM whether to save the guest state before perf reloads the host state.
>>
>> Thanks,
>> Kan
>>>>
>>>>
>>>>>
>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
>>>
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25 20:43                                                       ` Liang, Kan
@ 2024-04-25 21:46                                                         ` Sean Christopherson
  2024-04-26  1:46                                                           ` Mi, Dapeng
  2024-04-26 13:53                                                           ` Liang, Kan
  2024-04-26  1:50                                                         ` Mi, Dapeng
  1 sibling, 2 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-04-25 21:46 UTC (permalink / raw)
  To: Kan Liang
  Cc: Mingwei Zhang, Dapeng Mi, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Thu, Apr 25, 2024, Kan Liang wrote:
> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> > On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> It should not happen. For the current implementation, perf rejects all
> >> the !exclude_guest system-wide event creation if a guest with the vPMU
> >> is running.
> >> However, it's possible to create an exclude_guest system-wide event at
> >> any time. KVM cannot use the information from the VM-entry to decide if
> >> there will be active perf events in the VM-exit.
> > 
> > Hmm, why not? If there is any exclude_guest system-wide event,
> > perf_guest_enter() can return something to tell KVM "hey, some active
> > host events are swapped out. they are originally in counter #2 and
> > #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> > that and keep it in its pmu data structure.
> 
> I think it's possible that someone creates !exclude_guest event after

I assume you mean an exclude_guest=1 event?  Because perf should be in a state
where it rejects exclude_guest=0 events.

> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> will schedule the event in the next perf_guest_exit(). KVM will not know it.

Ya, the creation of an event on a CPU that currently has guest PMU state loaded
is what I had in mind when I suggested a callback in my sketch:

 :  D. Add a perf callback that is invoked from IRQ context when perf wants to
 :     configure a new PMU-based events, *before* actually programming the MSRs,
 :     and have KVM's callback put the guest PMU state

It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
into KVM.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25 21:46                                                         ` Sean Christopherson
@ 2024-04-26  1:46                                                           ` Mi, Dapeng
  2024-04-26  3:12                                                             ` Mingwei Zhang
  2024-04-26 13:53                                                           ` Liang, Kan
  1 sibling, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-26  1:46 UTC (permalink / raw)
  To: Sean Christopherson, Kan Liang
  Cc: Mingwei Zhang, maobibo, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/26/2024 5:46 AM, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Kan Liang wrote:
>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> It should not happen. For the current implementation, perf rejects all
>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>> is running.
>>>> However, it's possible to create an exclude_guest system-wide event at
>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>> there will be active perf events in the VM-exit.
>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>> host events are swapped out. they are originally in counter #2 and
>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>> that and keep it in its pmu data structure.
>> I think it's possible that someone creates !exclude_guest event after
> I assume you mean an exclude_guest=1 event?  Because perf should be in a state
> where it rejects exclude_guest=0 events.

Suppose should be exclude_guest=1 event, the perf event without 
exclude_guest attribute would be blocked to create in the v2 patches 
which we are working on.


>
>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> is what I had in mind when I suggested a callback in my sketch:
>
>   :  D. Add a perf callback that is invoked from IRQ context when perf wants to
>   :     configure a new PMU-based events, *before* actually programming the MSRs,
>   :     and have KVM's callback put the guest PMU state


when host creates a perf event with exclude_guest attribute which is 
used to profile KVM/VMM user space, the vCPU process could work at three 
places.

1. in guest state (non-root mode)

2. inside vcpu-loop

3. outside vcpu-loop

Since the PMU state has already been switched to host state, we don't 
need to consider the case 3 and only care about cases 1 and 2.

when host creates a perf event with exclude_guest attribute to profile 
KVM/VMM user space,  an IPI is triggered to enable the perf event 
eventually like the following code shows.

event_function_call(event, __perf_event_enable, NULL);

For case 1,  a vm-exit is triggered and KVM starts to process the 
vm-exit and then run IPI irq handler, exactly speaking 
__perf_event_enable() to enable the perf event.

For case 2, the IPI irq handler would preempt the vcpu-loop and call 
__perf_event_enable() to enable the perf event.

So IMO KVM just needs to provide a callback to switch guest/host PMU 
state, and __perf_event_enable() calls this callback before really 
touching PMU MSRs.

>
> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
> into KVM.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25 20:43                                                       ` Liang, Kan
  2024-04-25 21:46                                                         ` Sean Christopherson
@ 2024-04-26  1:50                                                         ` Mi, Dapeng
  1 sibling, 0 replies; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-26  1:50 UTC (permalink / raw)
  To: Liang, Kan, Mingwei Zhang
  Cc: Sean Christopherson, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/26/2024 4:43 AM, Liang, Kan wrote:
>
> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>
>>>
>>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>
>>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>>> switch location could be an option.
>>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>>> won't propose that in v2.
>>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>>> be at any time.
>>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>>
>>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
>>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>>> for the guest while host perf events are active.
>>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>>> the perf event with exclude_guest attribute is allowed to create on
>>>>> host. Thanks.
>>>> Do we really need a KVM callback? I think that is one option.
>>>>
>>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>>> deferring the context switch to the end of vPMU loop.
>>>>
>>>> Detecting if there are "host perf events" would be interesting. The
>>>> "host perf events" refer to the perf_events on the host that are
>>>> active and assigned with HW counters and that are saved when context
>>>> switching to the guest PMU. I think getting those events could be done
>>>> by fetching the bitmaps in cpuc.
>>> The cpuc is ARCH specific structure. I don't think it can be get in the
>>> generic code. You probably have to implement ARCH specific functions to
>>> fetch the bitmaps. It probably won't worth it.
>>>
>>> You may check the pinned_groups and flexible_groups to understand if
>>> there are host perf events which may be scheduled when VM-exit. But it
>>> will not tell the idx of the counters which can only be got when the
>>> host event is really scheduled.
>>>
>>>> I have to look into the details. But
>>>> at the time of VMEXIT, kvm should already have that information, so it
>>>> can immediately decide whether to do the PMU context switch or not.
>>>>
>>>> oh, but when the control is executing within the run loop, a
>>>> host-level profiling starts, say 'perf record -a ...', it will
>>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>>> Gah..
>>>>
>>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>>> context at any time and cause higher overhead. But I admit it is much
>>>> better than the current approach.
>>>>
>>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>>> So, to alleviate that, maybe a module parameter that disables this
>>>> "preemption" is possible? This should fit scenarios where we don't
>>>> want guest PMU to be preempted outside of the vCPU loop?
>>>>
>>> It should not happen. For the current implementation, perf rejects all
>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>> is running.
>>> However, it's possible to create an exclude_guest system-wide event at
>>> any time. KVM cannot use the information from the VM-entry to decide if
>>> there will be active perf events in the VM-exit.
>> Hmm, why not? If there is any exclude_guest system-wide event,
>> perf_guest_enter() can return something to tell KVM "hey, some active
>> host events are swapped out. they are originally in counter #2 and
>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>> that and keep it in its pmu data structure.
> I think it's possible that someone creates !exclude_guest event after
> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>
>> Now, when doing context switching back to host at just VMEXIT, KVM
>> will check this data and see if host perf context has something active
>> (of course, they are all exclude_guest events). If not, deferring the
>> context switch to vcpu boundary. Otherwise, do the proper PMU context
>> switching by respecting the occupied counter positions on the host
>> side, i.e., avoid doubling the work on the KVM side.
>>
>> Kan, any suggestion on the above approach?
> I think we can only know the accurate event list at perf_guest_exit().
> You may check the pinned_groups and flexible_groups, which tell if there
> are candidate events.
>
>> Totally understand that
>> there might be some difficulty, since perf subsystem works in several
>> layers and obviously fetching low-level mapping is arch specific work.
>> If that is difficult, we can split the work in two phases: 1) phase
>> #1, just ask perf to tell kvm if there are active exclude_guest events
>> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
>> indices.
>>
> If you want an accurate counter mask, the changes in the arch specific
> code is required. Two phases sound good to me.
>
> Besides perf changes, I think the KVM should also track which counters
> need to be saved/restored. The information can be get from the EventSel
> interception.

Yes, that's another optimization from guest point view. It's in our 
to-do list.


>
> Thanks,
> Kan
>>> The perf_guest_exit() will reload the host state. It's impossible to
>>> save the guest state after that. We may need a KVM callback. So perf can
>>> tell KVM whether to save the guest state before perf reloads the host state.
>>>
>>> Thanks,
>>> Kan
>>>>>
>>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26  1:46                                                           ` Mi, Dapeng
@ 2024-04-26  3:12                                                             ` Mingwei Zhang
  2024-04-26  4:02                                                               ` Mi, Dapeng
  2024-04-26 14:09                                                               ` Liang, Kan
  0 siblings, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-26  3:12 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Kan Liang, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao

On Thu, Apr 25, 2024 at 6:46 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/26/2024 5:46 AM, Sean Christopherson wrote:
> > On Thu, Apr 25, 2024, Kan Liang wrote:
> >> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> >>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>> It should not happen. For the current implementation, perf rejects all
> >>>> the !exclude_guest system-wide event creation if a guest with the vPMU
> >>>> is running.
> >>>> However, it's possible to create an exclude_guest system-wide event at
> >>>> any time. KVM cannot use the information from the VM-entry to decide if
> >>>> there will be active perf events in the VM-exit.
> >>> Hmm, why not? If there is any exclude_guest system-wide event,
> >>> perf_guest_enter() can return something to tell KVM "hey, some active
> >>> host events are swapped out. they are originally in counter #2 and
> >>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> >>> that and keep it in its pmu data structure.
> >> I think it's possible that someone creates !exclude_guest event after
> > I assume you mean an exclude_guest=1 event?  Because perf should be in a state
> > where it rejects exclude_guest=0 events.
>
> Suppose should be exclude_guest=1 event, the perf event without
> exclude_guest attribute would be blocked to create in the v2 patches
> which we are working on.
>
>
> >
> >> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> >> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> > Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> > is what I had in mind when I suggested a callback in my sketch:
> >
> >   :  D. Add a perf callback that is invoked from IRQ context when perf wants to
> >   :     configure a new PMU-based events, *before* actually programming the MSRs,
> >   :     and have KVM's callback put the guest PMU state
>
>
> when host creates a perf event with exclude_guest attribute which is
> used to profile KVM/VMM user space, the vCPU process could work at three
> places.
>
> 1. in guest state (non-root mode)
>
> 2. inside vcpu-loop
>
> 3. outside vcpu-loop
>
> Since the PMU state has already been switched to host state, we don't
> need to consider the case 3 and only care about cases 1 and 2.
>
> when host creates a perf event with exclude_guest attribute to profile
> KVM/VMM user space,  an IPI is triggered to enable the perf event
> eventually like the following code shows.
>
> event_function_call(event, __perf_event_enable, NULL);
>
> For case 1,  a vm-exit is triggered and KVM starts to process the
> vm-exit and then run IPI irq handler, exactly speaking
> __perf_event_enable() to enable the perf event.
>
> For case 2, the IPI irq handler would preempt the vcpu-loop and call
> __perf_event_enable() to enable the perf event.
>
> So IMO KVM just needs to provide a callback to switch guest/host PMU
> state, and __perf_event_enable() calls this callback before really
> touching PMU MSRs.

ok, in this case, do we still need KVM to query perf if there are
active exclude_guest events? yes? Because there is an ordering issue.
The above suggests that the host-level perf profiling comes when a VM
is already running, there is an IPI that can invoke the callback and
trigger preemption. In this case, KVM should switch the context from
guest to host. What if it is the other way around, ie., host-level
profiling runs first and then VM runs?

In this case, just before entering the vcpu loop, kvm should check
whether there is an active host event and save that into a pmu data
structure. If none, do the context switch early (so that KVM saves a
huge amount of unnecessary PMU context switches in the future).
Otherwise, keep the host PMU context until vm-enter. At the time of
vm-exit, do the check again using the data stored in pmu structure. If
there is an active event do the context switch to the host PMU,
otherwise defer that until exiting the vcpu loop. Of course, in the
meantime, if there is any perf profiling started causing the IPI, the
irq handler calls the callback, preempting the guest PMU context. If
that happens, at the time of exiting the vcpu boundary, PMU context
switch is skipped since it is already done. Of course, note that the
irq could come at any time, so the PMU context switch in all 4
locations need to check the state flag (and skip the context switch if
needed).

So this requires vcpu->pmu has two pieces of state information: 1) the
flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
just a boolean; phase #2, bitmap of occupied counters).

This is a non-trivial optimization on the PMU context switch. I am
thinking about splitting them into the following phases:

1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
for the 1st time.
2) fast PMU context switch on KVM side, i.e., KVM checking event
selector value (enable/disable) and selectively switch PMU state
(reducing rd/wr msrs)
3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
context switch boundary depending on existing active host-level
events.
3.1) more accurate dynamic PMU context switch, ie., KVM checking
host-level counter position and further reduces the number of msr
accesses.
4) guest PMU context preemption, i.e., any new host-level perf
profiling can immediately preempt the guest PMU in the vcpu loop
(instead of waiting for the next PMU context switch in KVM).

Thanks.
-Mingwei
>
> >
> > It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
> > kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
> > into KVM.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26  3:12                                                             ` Mingwei Zhang
@ 2024-04-26  4:02                                                               ` Mi, Dapeng
  2024-04-26  4:46                                                                 ` Mingwei Zhang
  2024-04-26 14:09                                                               ` Liang, Kan
  1 sibling, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-26  4:02 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Kan Liang, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao


On 4/26/2024 11:12 AM, Mingwei Zhang wrote:
> On Thu, Apr 25, 2024 at 6:46 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 4/26/2024 5:46 AM, Sean Christopherson wrote:
>>> On Thu, Apr 25, 2024, Kan Liang wrote:
>>>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>> It should not happen. For the current implementation, perf rejects all
>>>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>>>> is running.
>>>>>> However, it's possible to create an exclude_guest system-wide event at
>>>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>>>> there will be active perf events in the VM-exit.
>>>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>>>> host events are swapped out. they are originally in counter #2 and
>>>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>>>> that and keep it in its pmu data structure.
>>>> I think it's possible that someone creates !exclude_guest event after
>>> I assume you mean an exclude_guest=1 event?  Because perf should be in a state
>>> where it rejects exclude_guest=0 events.
>> Suppose should be exclude_guest=1 event, the perf event without
>> exclude_guest attribute would be blocked to create in the v2 patches
>> which we are working on.
>>
>>
>>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
>>> is what I had in mind when I suggested a callback in my sketch:
>>>
>>>    :  D. Add a perf callback that is invoked from IRQ context when perf wants to
>>>    :     configure a new PMU-based events, *before* actually programming the MSRs,
>>>    :     and have KVM's callback put the guest PMU state
>>
>> when host creates a perf event with exclude_guest attribute which is
>> used to profile KVM/VMM user space, the vCPU process could work at three
>> places.
>>
>> 1. in guest state (non-root mode)
>>
>> 2. inside vcpu-loop
>>
>> 3. outside vcpu-loop
>>
>> Since the PMU state has already been switched to host state, we don't
>> need to consider the case 3 and only care about cases 1 and 2.
>>
>> when host creates a perf event with exclude_guest attribute to profile
>> KVM/VMM user space,  an IPI is triggered to enable the perf event
>> eventually like the following code shows.
>>
>> event_function_call(event, __perf_event_enable, NULL);
>>
>> For case 1,  a vm-exit is triggered and KVM starts to process the
>> vm-exit and then run IPI irq handler, exactly speaking
>> __perf_event_enable() to enable the perf event.
>>
>> For case 2, the IPI irq handler would preempt the vcpu-loop and call
>> __perf_event_enable() to enable the perf event.
>>
>> So IMO KVM just needs to provide a callback to switch guest/host PMU
>> state, and __perf_event_enable() calls this callback before really
>> touching PMU MSRs.
> ok, in this case, do we still need KVM to query perf if there are
> active exclude_guest events? yes? Because there is an ordering issue.
> The above suggests that the host-level perf profiling comes when a VM
> is already running, there is an IPI that can invoke the callback and
> trigger preemption. In this case, KVM should switch the context from
> guest to host. What if it is the other way around, ie., host-level
> profiling runs first and then VM runs?
>
> In this case, just before entering the vcpu loop, kvm should check
> whether there is an active host event and save that into a pmu data
> structure. If none, do the context switch early (so that KVM saves a
> huge amount of unnecessary PMU context switches in the future).
> Otherwise, keep the host PMU context until vm-enter. At the time of
> vm-exit, do the check again using the data stored in pmu structure. If
> there is an active event do the context switch to the host PMU,
> otherwise defer that until exiting the vcpu loop. Of course, in the
> meantime, if there is any perf profiling started causing the IPI, the
> irq handler calls the callback, preempting the guest PMU context. If
> that happens, at the time of exiting the vcpu boundary, PMU context
> switch is skipped since it is already done. Of course, note that the
> irq could come at any time, so the PMU context switch in all 4
> locations need to check the state flag (and skip the context switch if
> needed).
>
> So this requires vcpu->pmu has two pieces of state information: 1) the
> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> just a boolean; phase #2, bitmap of occupied counters).

I still had no chance to look at the details about vFPU implementation, 
currently I have no idea what we need exactly on vPMU side, a flag or a 
callback. Anyway, that's just implementation details, we can look at it 
when starting to implement it.

>
> This is a non-trivial optimization on the PMU context switch. I am
> thinking about splitting them into the following phases:
>
> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> for the 1st time.
> 2) fast PMU context switch on KVM side, i.e., KVM checking event
> selector value (enable/disable) and selectively switch PMU state
> (reducing rd/wr msrs)
> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> context switch boundary depending on existing active host-level
> events.
> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> host-level counter position and further reduces the number of msr
> accesses.
> 4) guest PMU context preemption, i.e., any new host-level perf
> profiling can immediately preempt the guest PMU in the vcpu loop
> (instead of waiting for the next PMU context switch in KVM).

Great! we have a whole clear picture about the optimization right now. 
BTW, the optimization 1 and 2 are already on our original to-do list. We 
plan to do it after RFC v2 is ready.


>
> Thanks.
> -Mingwei
>>> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
>>> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
>>> into KVM.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces
  2024-04-11 19:53     ` Liang, Kan
  2024-04-12 19:17       ` Sean Christopherson
@ 2024-04-26  4:09       ` Zhang, Xiong Y
  1 sibling, 0 replies; 181+ messages in thread
From: Zhang, Xiong Y @ 2024-04-26  4:09 UTC (permalink / raw)
  To: Liang, Kan, Sean Christopherson
  Cc: pbonzini, peterz, mizhang, kan.liang, zhenyuw, dapeng1.mi,
	jmattson, kvm, linux-perf-users, linux-kernel, zhiyuan.lv,
	eranian, irogers, samantha.alt, like.xu.linux, chao.gao


>>> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
>>> +						 int cpu, struct task_struct *task)
>>> +{
>>> +	bool *force_exclude_guest = NULL;
>>> +
>>> +	if (!has_vpmu_passthrough_cap(event->pmu))
>>> +		return 0;
>>> +
>>> +	if (event->attr.exclude_guest)
>>> +		return 0;
>>> +
>>> +	if (cpu != -1) {
>>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
>>> +	} else if (task && (task->flags & PF_VCPU)) {
>>> +		/*
>>> +		 * Just need to check the running CPU in the event creation. If the
>>> +		 * task is moved to another CPU which supports the force_exclude_guest.
>>> +		 * The event will filtered out and be moved to the error stage. See
>>> +		 * merge_sched_in().
>>> +		 */
>>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
>>> +	}
>>
>> These checks are extremely racy, I don't see how this can possibly do the
>> right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
>> to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
>> include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
>> time slices).
>>
> 
> This is to reject an !exclude_guest event creation for a running
> "passthrough" guest from host perf tool.
> Could you please suggest a way to detect it via the struct task_struct?
Here PF_VCPU is used to distinguish a perf event profiling userspace VMM
process, like perf record -e {} -p $QEMU_PID. A lot of emails have
discussed how to handle system wide perf event which has
perf_event.attr.task == NULL. But perf event for user space VMM should be
handled the same as system wide perf event, perf need a method to identify
a process perf event is for user space VMM. PF_VCPU isn't the right one,
then an open how to handle this ?

thanks
> 
> 
>> Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
>> perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
>> per-CPU context mutex.
> 
> Do you mean that the perf_guest_enter() check could be happened right
> after the perf_force_exclude_guest_check()?
> It's possible. For this case, the event can still be created. It will be
> treated as an existing event and handled in merge_sched_in(). It will
> never be scheduled when a guest is running.
> 
> The perf_force_exclude_guest_check() is to make sure most of the cases
> can be rejected at the creation place. For the corner cases, they will
> be rejected in the schedule stage.
> 
>>
>>> +	if (force_exclude_guest && *force_exclude_guest)
>>> +		return -EBUSY;
>>> +	return 0;
>>> +}
>>> +
>>>  /*
>>>   * Holding the top-level event's child_mutex means that any
>>>   * descendant process that has inherited this event will block
>>> @@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>>  		goto err_ns;
>>>  	}
>>>  
>>> +	if (perf_force_exclude_guest_check(event, cpu, task)) {
>>
>> This should be:
>>
>> 	err = perf_force_exclude_guest_check(event, cpu, task);
>> 	if (err)
>> 		goto err_pmu;
>>
>> i.e. shouldn't effectively ignore/override the return result.
>>
> 
> Sure.
> 
> Thanks,
> Kan
> 
>>> +		err = -EBUSY;
>>> +		goto err_pmu;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
>>>  	 * events (they don't make sense as the cgroup will be different
>>> -- 
>>> 2.34.1
>>>
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26  4:02                                                               ` Mi, Dapeng
@ 2024-04-26  4:46                                                                 ` Mingwei Zhang
  0 siblings, 0 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-26  4:46 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Kan Liang, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao

On Thu, Apr 25, 2024 at 9:03 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/26/2024 11:12 AM, Mingwei Zhang wrote:
> > On Thu, Apr 25, 2024 at 6:46 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 4/26/2024 5:46 AM, Sean Christopherson wrote:
> >>> On Thu, Apr 25, 2024, Kan Liang wrote:
> >>>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> >>>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>>> It should not happen. For the current implementation, perf rejects all
> >>>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
> >>>>>> is running.
> >>>>>> However, it's possible to create an exclude_guest system-wide event at
> >>>>>> any time. KVM cannot use the information from the VM-entry to decide if
> >>>>>> there will be active perf events in the VM-exit.
> >>>>> Hmm, why not? If there is any exclude_guest system-wide event,
> >>>>> perf_guest_enter() can return something to tell KVM "hey, some active
> >>>>> host events are swapped out. they are originally in counter #2 and
> >>>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> >>>>> that and keep it in its pmu data structure.
> >>>> I think it's possible that someone creates !exclude_guest event after
> >>> I assume you mean an exclude_guest=1 event?  Because perf should be in a state
> >>> where it rejects exclude_guest=0 events.
> >> Suppose should be exclude_guest=1 event, the perf event without
> >> exclude_guest attribute would be blocked to create in the v2 patches
> >> which we are working on.
> >>
> >>
> >>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> >>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> >>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> >>> is what I had in mind when I suggested a callback in my sketch:
> >>>
> >>>    :  D. Add a perf callback that is invoked from IRQ context when perf wants to
> >>>    :     configure a new PMU-based events, *before* actually programming the MSRs,
> >>>    :     and have KVM's callback put the guest PMU state
> >>
> >> when host creates a perf event with exclude_guest attribute which is
> >> used to profile KVM/VMM user space, the vCPU process could work at three
> >> places.
> >>
> >> 1. in guest state (non-root mode)
> >>
> >> 2. inside vcpu-loop
> >>
> >> 3. outside vcpu-loop
> >>
> >> Since the PMU state has already been switched to host state, we don't
> >> need to consider the case 3 and only care about cases 1 and 2.
> >>
> >> when host creates a perf event with exclude_guest attribute to profile
> >> KVM/VMM user space,  an IPI is triggered to enable the perf event
> >> eventually like the following code shows.
> >>
> >> event_function_call(event, __perf_event_enable, NULL);
> >>
> >> For case 1,  a vm-exit is triggered and KVM starts to process the
> >> vm-exit and then run IPI irq handler, exactly speaking
> >> __perf_event_enable() to enable the perf event.
> >>
> >> For case 2, the IPI irq handler would preempt the vcpu-loop and call
> >> __perf_event_enable() to enable the perf event.
> >>
> >> So IMO KVM just needs to provide a callback to switch guest/host PMU
> >> state, and __perf_event_enable() calls this callback before really
> >> touching PMU MSRs.
> > ok, in this case, do we still need KVM to query perf if there are
> > active exclude_guest events? yes? Because there is an ordering issue.
> > The above suggests that the host-level perf profiling comes when a VM
> > is already running, there is an IPI that can invoke the callback and
> > trigger preemption. In this case, KVM should switch the context from
> > guest to host. What if it is the other way around, ie., host-level
> > profiling runs first and then VM runs?
> >
> > In this case, just before entering the vcpu loop, kvm should check
> > whether there is an active host event and save that into a pmu data
> > structure. If none, do the context switch early (so that KVM saves a
> > huge amount of unnecessary PMU context switches in the future).
> > Otherwise, keep the host PMU context until vm-enter. At the time of
> > vm-exit, do the check again using the data stored in pmu structure. If
> > there is an active event do the context switch to the host PMU,
> > otherwise defer that until exiting the vcpu loop. Of course, in the
> > meantime, if there is any perf profiling started causing the IPI, the
> > irq handler calls the callback, preempting the guest PMU context. If
> > that happens, at the time of exiting the vcpu boundary, PMU context
> > switch is skipped since it is already done. Of course, note that the
> > irq could come at any time, so the PMU context switch in all 4
> > locations need to check the state flag (and skip the context switch if
> > needed).
> >
> > So this requires vcpu->pmu has two pieces of state information: 1) the
> > flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> > just a boolean; phase #2, bitmap of occupied counters).
>
> I still had no chance to look at the details about vFPU implementation,
> currently I have no idea what we need exactly on vPMU side, a flag or a
> callback. Anyway, that's just implementation details, we can look at it
> when starting to implement it.

I think both. The flag helps to decide whether the context switch has
already been done. The callback will always trigger the context
switch, but the context switch code should always check if the switch
has already been done.

FPU context switch is similar but slightly different. That is done at
the host-level context switch boundary and even crossing that boundary
as long as the next process/thread is not using FPU and/or not going
back to userspace. I don't think we want to defer it that far.
Instead, the PMU context switch should still happen within the range
of KVM.

>
> >
> > This is a non-trivial optimization on the PMU context switch. I am
> > thinking about splitting them into the following phases:
> >
> > 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> > for the 1st time.
> > 2) fast PMU context switch on KVM side, i.e., KVM checking event
> > selector value (enable/disable) and selectively switch PMU state
> > (reducing rd/wr msrs)
> > 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> > context switch boundary depending on existing active host-level
> > events.
> > 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> > host-level counter position and further reduces the number of msr
> > accesses.
> > 4) guest PMU context preemption, i.e., any new host-level perf
> > profiling can immediately preempt the guest PMU in the vcpu loop
> > (instead of waiting for the next PMU context switch in KVM).
>
> Great! we have a whole clear picture about the optimization right now.
> BTW, the optimization 1 and 2 are already on our original to-do list. We
> plan to do it after RFC v2 is ready.
>

I am going to summarize that into a design doc. It has been 50 emails
in this thread. I am sure no one has the patience to read our garbage
unless they are involved at the very beginning :)

Any of the implementations are very welcome. 1) and 2) are low hanging
fruit and we can finish that quickly after v2. 3-4 is error prone and
needs further discussions so let's not rush for that.

On the other hand, how do we test this is the question we need to think about.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-25 21:46                                                         ` Sean Christopherson
  2024-04-26  1:46                                                           ` Mi, Dapeng
@ 2024-04-26 13:53                                                           ` Liang, Kan
  1 sibling, 0 replies; 181+ messages in thread
From: Liang, Kan @ 2024-04-26 13:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Dapeng Mi, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-25 5:46 p.m., Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Kan Liang wrote:
>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> It should not happen. For the current implementation, perf rejects all
>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>> is running.
>>>> However, it's possible to create an exclude_guest system-wide event at
>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>> there will be active perf events in the VM-exit.
>>>
>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>> host events are swapped out. they are originally in counter #2 and
>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>> that and keep it in its pmu data structure.
>>
>> I think it's possible that someone creates !exclude_guest event after
> 
> I assume you mean an exclude_guest=1 event?  Because perf should be in a state
> where it rejects exclude_guest=0 events.
>

Right.

>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> 
> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> is what I had in mind when I suggested a callback in my sketch:
> 
>  :  D. Add a perf callback that is invoked from IRQ context when perf wants to
>  :     configure a new PMU-based events, *before* actually programming the MSRs,
>  :     and have KVM's callback put the guest PMU state
> 
> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
> into KVM.

Yes, a callback should be required. I think it should be done right
before switching back to the host perf events, so there are an accurate
active event list.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26  3:12                                                             ` Mingwei Zhang
  2024-04-26  4:02                                                               ` Mi, Dapeng
@ 2024-04-26 14:09                                                               ` Liang, Kan
  2024-04-26 18:41                                                                 ` Mingwei Zhang
  1 sibling, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-26 14:09 UTC (permalink / raw)
  To: Mingwei Zhang, Mi, Dapeng
  Cc: Sean Christopherson, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-25 11:12 p.m., Mingwei Zhang wrote:
>>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
>>> is what I had in mind when I suggested a callback in my sketch:
>>>
>>>   :  D. Add a perf callback that is invoked from IRQ context when perf wants to
>>>   :     configure a new PMU-based events, *before* actually programming the MSRs,
>>>   :     and have KVM's callback put the guest PMU state
>>
>> when host creates a perf event with exclude_guest attribute which is
>> used to profile KVM/VMM user space, the vCPU process could work at three
>> places.
>>
>> 1. in guest state (non-root mode)
>>
>> 2. inside vcpu-loop
>>
>> 3. outside vcpu-loop
>>
>> Since the PMU state has already been switched to host state, we don't
>> need to consider the case 3 and only care about cases 1 and 2.
>>
>> when host creates a perf event with exclude_guest attribute to profile
>> KVM/VMM user space,  an IPI is triggered to enable the perf event
>> eventually like the following code shows.
>>
>> event_function_call(event, __perf_event_enable, NULL);
>>
>> For case 1,  a vm-exit is triggered and KVM starts to process the
>> vm-exit and then run IPI irq handler, exactly speaking
>> __perf_event_enable() to enable the perf event.
>>
>> For case 2, the IPI irq handler would preempt the vcpu-loop and call
>> __perf_event_enable() to enable the perf event.
>>
>> So IMO KVM just needs to provide a callback to switch guest/host PMU
>> state, and __perf_event_enable() calls this callback before really
>> touching PMU MSRs.
> ok, in this case, do we still need KVM to query perf if there are
> active exclude_guest events? yes? Because there is an ordering issue.
> The above suggests that the host-level perf profiling comes when a VM
> is already running, there is an IPI that can invoke the callback and
> trigger preemption. In this case, KVM should switch the context from
> guest to host. What if it is the other way around, ie., host-level
> profiling runs first and then VM runs?
> 
> In this case, just before entering the vcpu loop, kvm should check
> whether there is an active host event and save that into a pmu data
> structure. 

KVM doesn't need to save/restore the host state. Host perf has the
information and will reload the values whenever the host events are
rescheduled. But I think KVM should clear the registers used by the host
to prevent the value leaks to the guest.

> If none, do the context switch early (so that KVM saves a
> huge amount of unnecessary PMU context switches in the future).
> Otherwise, keep the host PMU context until vm-enter. At the time of
> vm-exit, do the check again using the data stored in pmu structure. If
> there is an active event do the context switch to the host PMU,
> otherwise defer that until exiting the vcpu loop. Of course, in the
> meantime, if there is any perf profiling started causing the IPI, the
> irq handler calls the callback, preempting the guest PMU context. If
> that happens, at the time of exiting the vcpu boundary, PMU context
> switch is skipped since it is already done. Of course, note that the
> irq could come at any time, so the PMU context switch in all 4
> locations need to check the state flag (and skip the context switch if
> needed).
> 
> So this requires vcpu->pmu has two pieces of state information: 1) the
> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> just a boolean; phase #2, bitmap of occupied counters).
> 
> This is a non-trivial optimization on the PMU context switch. I am
> thinking about splitting them into the following phases:
> 
> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> for the 1st time.
> 2) fast PMU context switch on KVM side, i.e., KVM checking event
> selector value (enable/disable) and selectively switch PMU state
> (reducing rd/wr msrs)
> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> context switch boundary depending on existing active host-level
> events.
> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> host-level counter position and further reduces the number of msr
> accesses.
> 4) guest PMU context preemption, i.e., any new host-level perf
> profiling can immediately preempt the guest PMU in the vcpu loop
> (instead of waiting for the next PMU context switch in KVM).

I'm not quit sure about the 4.
The new host-level perf must be an exclude_guest event. It should not be
scheduled when a guest is using the PMU. Why do we want to preempt the
guest PMU? The current implementation in perf doesn't schedule any
exclude_guest events when a guest is running.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26 14:09                                                               ` Liang, Kan
@ 2024-04-26 18:41                                                                 ` Mingwei Zhang
  2024-04-26 19:06                                                                   ` Liang, Kan
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-26 18:41 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mi, Dapeng, Sean Christopherson, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao

On Fri, Apr 26, 2024 at 7:10 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2024-04-25 11:12 p.m., Mingwei Zhang wrote:
> >>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> >>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> >>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> >>> is what I had in mind when I suggested a callback in my sketch:
> >>>
> >>>   :  D. Add a perf callback that is invoked from IRQ context when perf wants to
> >>>   :     configure a new PMU-based events, *before* actually programming the MSRs,
> >>>   :     and have KVM's callback put the guest PMU state
> >>
> >> when host creates a perf event with exclude_guest attribute which is
> >> used to profile KVM/VMM user space, the vCPU process could work at three
> >> places.
> >>
> >> 1. in guest state (non-root mode)
> >>
> >> 2. inside vcpu-loop
> >>
> >> 3. outside vcpu-loop
> >>
> >> Since the PMU state has already been switched to host state, we don't
> >> need to consider the case 3 and only care about cases 1 and 2.
> >>
> >> when host creates a perf event with exclude_guest attribute to profile
> >> KVM/VMM user space,  an IPI is triggered to enable the perf event
> >> eventually like the following code shows.
> >>
> >> event_function_call(event, __perf_event_enable, NULL);
> >>
> >> For case 1,  a vm-exit is triggered and KVM starts to process the
> >> vm-exit and then run IPI irq handler, exactly speaking
> >> __perf_event_enable() to enable the perf event.
> >>
> >> For case 2, the IPI irq handler would preempt the vcpu-loop and call
> >> __perf_event_enable() to enable the perf event.
> >>
> >> So IMO KVM just needs to provide a callback to switch guest/host PMU
> >> state, and __perf_event_enable() calls this callback before really
> >> touching PMU MSRs.
> > ok, in this case, do we still need KVM to query perf if there are
> > active exclude_guest events? yes? Because there is an ordering issue.
> > The above suggests that the host-level perf profiling comes when a VM
> > is already running, there is an IPI that can invoke the callback and
> > trigger preemption. In this case, KVM should switch the context from
> > guest to host. What if it is the other way around, ie., host-level
> > profiling runs first and then VM runs?
> >
> > In this case, just before entering the vcpu loop, kvm should check
> > whether there is an active host event and save that into a pmu data
> > structure.
>
> KVM doesn't need to save/restore the host state. Host perf has the
> information and will reload the values whenever the host events are
> rescheduled. But I think KVM should clear the registers used by the host
> to prevent the value leaks to the guest.

Right, KVM needs to know about the host state to optimize its own PMU
context switch. If the host is using a counter of index, say 1, then
KVM may not need to zap the value of counter #1, since perf side will
override it.

>
> > If none, do the context switch early (so that KVM saves a
> > huge amount of unnecessary PMU context switches in the future).
> > Otherwise, keep the host PMU context until vm-enter. At the time of
> > vm-exit, do the check again using the data stored in pmu structure. If
> > there is an active event do the context switch to the host PMU,
> > otherwise defer that until exiting the vcpu loop. Of course, in the
> > meantime, if there is any perf profiling started causing the IPI, the
> > irq handler calls the callback, preempting the guest PMU context. If
> > that happens, at the time of exiting the vcpu boundary, PMU context
> > switch is skipped since it is already done. Of course, note that the
> > irq could come at any time, so the PMU context switch in all 4
> > locations need to check the state flag (and skip the context switch if
> > needed).
> >
> > So this requires vcpu->pmu has two pieces of state information: 1) the
> > flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> > just a boolean; phase #2, bitmap of occupied counters).
> >
> > This is a non-trivial optimization on the PMU context switch. I am
> > thinking about splitting them into the following phases:
> >
> > 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> > for the 1st time.
> > 2) fast PMU context switch on KVM side, i.e., KVM checking event
> > selector value (enable/disable) and selectively switch PMU state
> > (reducing rd/wr msrs)
> > 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> > context switch boundary depending on existing active host-level
> > events.
> > 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> > host-level counter position and further reduces the number of msr
> > accesses.
> > 4) guest PMU context preemption, i.e., any new host-level perf
> > profiling can immediately preempt the guest PMU in the vcpu loop
> > (instead of waiting for the next PMU context switch in KVM).
>
> I'm not quit sure about the 4.
> The new host-level perf must be an exclude_guest event. It should not be
> scheduled when a guest is using the PMU. Why do we want to preempt the
> guest PMU? The current implementation in perf doesn't schedule any
> exclude_guest events when a guest is running.

right. The grey area is the code within the KVM_RUN loop, but
_outside_ of the guest. This part of the code is on the "host" side.
However, for efficiency reasons, KVM defers the PMU context switch by
retaining the guest PMU MSR values within the loop. Optimization 4
allows the host side to immediately profiling this part instead of
waiting for vcpu to reach to PMU context switch locations. Doing so
will generate more accurate results.

Do we want to preempt that? I think it depends. For regular cloud
usage, we don't. But for any other usages where we want to prioritize
KVM/VMM profiling over guest vPMU, it is useful.

My current opinion is that optimization 4 is something nice to have.
But we should allow people to turn it off just like we could choose to
disable preempt kernel.

Thanks.
-Mingwei
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26 18:41                                                                 ` Mingwei Zhang
@ 2024-04-26 19:06                                                                   ` Liang, Kan
  2024-04-26 19:46                                                                     ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Liang, Kan @ 2024-04-26 19:06 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Mi, Dapeng, Sean Christopherson, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao



On 2024-04-26 2:41 p.m., Mingwei Zhang wrote:
>>> So this requires vcpu->pmu has two pieces of state information: 1) the
>>> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
>>> just a boolean; phase #2, bitmap of occupied counters).
>>>
>>> This is a non-trivial optimization on the PMU context switch. I am
>>> thinking about splitting them into the following phases:
>>>
>>> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
>>> for the 1st time.
>>> 2) fast PMU context switch on KVM side, i.e., KVM checking event
>>> selector value (enable/disable) and selectively switch PMU state
>>> (reducing rd/wr msrs)
>>> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
>>> context switch boundary depending on existing active host-level
>>> events.
>>> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
>>> host-level counter position and further reduces the number of msr
>>> accesses.
>>> 4) guest PMU context preemption, i.e., any new host-level perf
>>> profiling can immediately preempt the guest PMU in the vcpu loop
>>> (instead of waiting for the next PMU context switch in KVM).
>> I'm not quit sure about the 4.
>> The new host-level perf must be an exclude_guest event. It should not be
>> scheduled when a guest is using the PMU. Why do we want to preempt the
>> guest PMU? The current implementation in perf doesn't schedule any
>> exclude_guest events when a guest is running.
> right. The grey area is the code within the KVM_RUN loop, but
> _outside_ of the guest. This part of the code is on the "host" side.
> However, for efficiency reasons, KVM defers the PMU context switch by
> retaining the guest PMU MSR values within the loop. 

I assume you mean the optimization of moving the context switch from
VM-exit/entry boundary to the vCPU boundary.

> Optimization 4
> allows the host side to immediately profiling this part instead of
> waiting for vcpu to reach to PMU context switch locations. Doing so
> will generate more accurate results.

If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
definition of the exclude_guest. Without 4, it brings some random blind
spots, right?

> 
> Do we want to preempt that? I think it depends. For regular cloud
> usage, we don't. But for any other usages where we want to prioritize
> KVM/VMM profiling over guest vPMU, it is useful.
> 
> My current opinion is that optimization 4 is something nice to have.
> But we should allow people to turn it off just like we could choose to
> disable preempt kernel.

The exclude_guest means everything but the guest. I don't see a reason
why people want to turn it off and get some random blind spots.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26 19:06                                                                   ` Liang, Kan
@ 2024-04-26 19:46                                                                     ` Sean Christopherson
  2024-04-27  3:04                                                                       ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-26 19:46 UTC (permalink / raw)
  To: Kan Liang
  Cc: Mingwei Zhang, Dapeng Mi, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Fri, Apr 26, 2024, Kan Liang wrote:
> > Optimization 4
> > allows the host side to immediately profiling this part instead of
> > waiting for vcpu to reach to PMU context switch locations. Doing so
> > will generate more accurate results.
> 
> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> definition of the exclude_guest. Without 4, it brings some random blind
> spots, right?

+1, I view it as a hard requirement.  It's not an optimization, it's about
accuracy and functional correctness.

What _is_ an optimization is keeping guest state loaded while KVM is in its
run loop, i.e. initial mediated/passthrough PMU support could land upstream with
unconditional switches at entry/exit.  The performance of KVM would likely be
unacceptable for any production use cases, but that would give us motivation to
finish the job, and it doesn't result in random, hard to diagnose issues for
userspace.
 
> > Do we want to preempt that? I think it depends. For regular cloud
> > usage, we don't. But for any other usages where we want to prioritize
> > KVM/VMM profiling over guest vPMU, it is useful.
> > 
> > My current opinion is that optimization 4 is something nice to have.
> > But we should allow people to turn it off just like we could choose to
> > disable preempt kernel.
> 
> The exclude_guest means everything but the guest. I don't see a reason
> why people want to turn it off and get some random blind spots.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-26 19:46                                                                     ` Sean Christopherson
@ 2024-04-27  3:04                                                                       ` Mingwei Zhang
  2024-04-28  0:58                                                                         ` Mi, Dapeng
  2024-04-29 13:08                                                                         ` Liang, Kan
  0 siblings, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-27  3:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kan Liang, Dapeng Mi, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 26, 2024, Kan Liang wrote:
> > > Optimization 4
> > > allows the host side to immediately profiling this part instead of
> > > waiting for vcpu to reach to PMU context switch locations. Doing so
> > > will generate more accurate results.
> >
> > If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> > definition of the exclude_guest. Without 4, it brings some random blind
> > spots, right?
>
> +1, I view it as a hard requirement.  It's not an optimization, it's about
> accuracy and functional correctness.

Well. Does it have to be a _hard_ requirement? no? The irq handler
triggered by "perf record -a" could just inject a "state". Instead of
immediately preempting the guest PMU context, perf subsystem could
allow KVM defer the context switch when it reaches the next PMU
context switch location.

This is the same as the preemption kernel logic. Do you want me to
stop the work immediately? Yes (if you enable preemption), or No, let
me finish my job and get to the scheduling point.

Implementing this might be more difficult to debug. That's my real
concern. If we do not enable preemption, the PMU context switch will
only happen at the 2 pairs of locations. If we enable preemption, it
could happen at any time.

>
> What _is_ an optimization is keeping guest state loaded while KVM is in its
> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
> unconditional switches at entry/exit.  The performance of KVM would likely be
> unacceptable for any production use cases, but that would give us motivation to
> finish the job, and it doesn't result in random, hard to diagnose issues for
> userspace.

That's true. I agree with that.

>
> > > Do we want to preempt that? I think it depends. For regular cloud
> > > usage, we don't. But for any other usages where we want to prioritize
> > > KVM/VMM profiling over guest vPMU, it is useful.
> > >
> > > My current opinion is that optimization 4 is something nice to have.
> > > But we should allow people to turn it off just like we could choose to
> > > disable preempt kernel.
> >
> > The exclude_guest means everything but the guest. I don't see a reason
> > why people want to turn it off and get some random blind spots.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-27  3:04                                                                       ` Mingwei Zhang
@ 2024-04-28  0:58                                                                         ` Mi, Dapeng
  2024-04-28  6:01                                                                           ` Mingwei Zhang
  2024-04-29 13:08                                                                         ` Liang, Kan
  1 sibling, 1 reply; 181+ messages in thread
From: Mi, Dapeng @ 2024-04-28  0:58 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson
  Cc: Kan Liang, maobibo, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao


On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@google.com> wrote:
>> On Fri, Apr 26, 2024, Kan Liang wrote:
>>>> Optimization 4
>>>> allows the host side to immediately profiling this part instead of
>>>> waiting for vcpu to reach to PMU context switch locations. Doing so
>>>> will generate more accurate results.
>>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
>>> definition of the exclude_guest. Without 4, it brings some random blind
>>> spots, right?
>> +1, I view it as a hard requirement.  It's not an optimization, it's about
>> accuracy and functional correctness.
> Well. Does it have to be a _hard_ requirement? no? The irq handler
> triggered by "perf record -a" could just inject a "state". Instead of
> immediately preempting the guest PMU context, perf subsystem could
> allow KVM defer the context switch when it reaches the next PMU
> context switch location.
>
> This is the same as the preemption kernel logic. Do you want me to
> stop the work immediately? Yes (if you enable preemption), or No, let
> me finish my job and get to the scheduling point.
>
> Implementing this might be more difficult to debug. That's my real
> concern. If we do not enable preemption, the PMU context switch will
> only happen at the 2 pairs of locations. If we enable preemption, it
> could happen at any time.

IMO I don't prefer to add a switch to enable/disable the preemption. I 
think current implementation is already complicated enough and 
unnecessary to introduce an new parameter to confuse users. Furthermore, 
the switch could introduce an uncertainty and may mislead the perf user 
to read the perf stats incorrectly.  As for debug, it won't bring any 
difference as long as no host event is created.


>
>> What _is_ an optimization is keeping guest state loaded while KVM is in its
>> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
>> unconditional switches at entry/exit.  The performance of KVM would likely be
>> unacceptable for any production use cases, but that would give us motivation to
>> finish the job, and it doesn't result in random, hard to diagnose issues for
>> userspace.
> That's true. I agree with that.
>
>>>> Do we want to preempt that? I think it depends. For regular cloud
>>>> usage, we don't. But for any other usages where we want to prioritize
>>>> KVM/VMM profiling over guest vPMU, it is useful.
>>>>
>>>> My current opinion is that optimization 4 is something nice to have.
>>>> But we should allow people to turn it off just like we could choose to
>>>> disable preempt kernel.
>>> The exclude_guest means everything but the guest. I don't see a reason
>>> why people want to turn it off and get some random blind spots.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-28  0:58                                                                         ` Mi, Dapeng
@ 2024-04-28  6:01                                                                           ` Mingwei Zhang
  2024-04-29 17:44                                                                             ` Sean Christopherson
  0 siblings, 1 reply; 181+ messages in thread
From: Mingwei Zhang @ 2024-04-28  6:01 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Kan Liang, maobibo, Xiong Zhang, pbonzini,
	peterz, kan.liang, zhenyuw, jmattson, kvm, linux-perf-users,
	linux-kernel, zhiyuan.lv, eranian, irogers, samantha.alt,
	like.xu.linux, chao.gao

On Sat, Apr 27, 2024 at 5:59 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> > On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@google.com> wrote:
> >> On Fri, Apr 26, 2024, Kan Liang wrote:
> >>>> Optimization 4
> >>>> allows the host side to immediately profiling this part instead of
> >>>> waiting for vcpu to reach to PMU context switch locations. Doing so
> >>>> will generate more accurate results.
> >>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> >>> definition of the exclude_guest. Without 4, it brings some random blind
> >>> spots, right?
> >> +1, I view it as a hard requirement.  It's not an optimization, it's about
> >> accuracy and functional correctness.
> > Well. Does it have to be a _hard_ requirement? no? The irq handler
> > triggered by "perf record -a" could just inject a "state". Instead of
> > immediately preempting the guest PMU context, perf subsystem could
> > allow KVM defer the context switch when it reaches the next PMU
> > context switch location.
> >
> > This is the same as the preemption kernel logic. Do you want me to
> > stop the work immediately? Yes (if you enable preemption), or No, let
> > me finish my job and get to the scheduling point.
> >
> > Implementing this might be more difficult to debug. That's my real
> > concern. If we do not enable preemption, the PMU context switch will
> > only happen at the 2 pairs of locations. If we enable preemption, it
> > could happen at any time.
>
> IMO I don't prefer to add a switch to enable/disable the preemption. I
> think current implementation is already complicated enough and
> unnecessary to introduce an new parameter to confuse users. Furthermore,
> the switch could introduce an uncertainty and may mislead the perf user
> to read the perf stats incorrectly.  As for debug, it won't bring any
> difference as long as no host event is created.
>
That's ok. It is about opinions and brainstorming. Adding a parameter
to disable preemption is from the cloud usage perspective. The
conflict of opinions is which one you prioritize: guest PMU or the
host PMU? If you stand on the guest vPMU usage perspective, do you
want anyone on the host to shoot a profiling command and generate
turbulence? no. If you stand on the host PMU perspective and you want
to profile VMM/KVM, you definitely want accuracy and no delay at all.

Thanks.
-Mingwei
>
> >
> >> What _is_ an optimization is keeping guest state loaded while KVM is in its
> >> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
> >> unconditional switches at entry/exit.  The performance of KVM would likely be
> >> unacceptable for any production use cases, but that would give us motivation to
> >> finish the job, and it doesn't result in random, hard to diagnose issues for
> >> userspace.
> > That's true. I agree with that.
> >
> >>>> Do we want to preempt that? I think it depends. For regular cloud
> >>>> usage, we don't. But for any other usages where we want to prioritize
> >>>> KVM/VMM profiling over guest vPMU, it is useful.
> >>>>
> >>>> My current opinion is that optimization 4 is something nice to have.
> >>>> But we should allow people to turn it off just like we could choose to
> >>>> disable preempt kernel.
> >>> The exclude_guest means everything but the guest. I don't see a reason
> >>> why people want to turn it off and get some random blind spots.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-27  3:04                                                                       ` Mingwei Zhang
  2024-04-28  0:58                                                                         ` Mi, Dapeng
@ 2024-04-29 13:08                                                                         ` Liang, Kan
  1 sibling, 0 replies; 181+ messages in thread
From: Liang, Kan @ 2024-04-29 13:08 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson
  Cc: Dapeng Mi, maobibo, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-04-26 11:04 p.m., Mingwei Zhang wrote:
> On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Fri, Apr 26, 2024, Kan Liang wrote:
>>>> Optimization 4
>>>> allows the host side to immediately profiling this part instead of
>>>> waiting for vcpu to reach to PMU context switch locations. Doing so
>>>> will generate more accurate results.
>>>
>>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
>>> definition of the exclude_guest. Without 4, it brings some random blind
>>> spots, right?
>>
>> +1, I view it as a hard requirement.  It's not an optimization, it's about
>> accuracy and functional correctness.
> 
> Well. Does it have to be a _hard_ requirement? no? The irq handler
> triggered by "perf record -a" could just inject a "state". Instead of
> immediately preempting the guest PMU context, perf subsystem could
> allow KVM defer the context switch when it reaches the next PMU
> context switch location.

It depends on what is the upcoming PMU context switch location.
If it's the upcoming VM-exit/entry, the defer should be fine. Because
it's a exclude_guest event, nothing should be counted when a VM is running.
If it's the upcoming vCPU boundary, no. I think there may be several
VM-exit/entry before the upcoming vCPU switch. We may lose some results.
> 
> This is the same as the preemption kernel logic. Do you want me to
> stop the work immediately? Yes (if you enable preemption), or No, let
> me finish my job and get to the scheduling point.

I don't think it's necessary. Just make sure that the counters are
scheduled in the upcoming VM-exit/entry boundary should be fine.

Thanks,
Kan
> 
> Implementing this might be more difficult to debug. That's my real
> concern. If we do not enable preemption, the PMU context switch will
> only happen at the 2 pairs of locations. If we enable preemption, it
> could happen at any time.
> 
>>
>> What _is_ an optimization is keeping guest state loaded while KVM is in its
>> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
>> unconditional switches at entry/exit.  The performance of KVM would likely be
>> unacceptable for any production use cases, but that would give us motivation to
>> finish the job, and it doesn't result in random, hard to diagnose issues for
>> userspace.
> 
> That's true. I agree with that.
> 
>>
>>>> Do we want to preempt that? I think it depends. For regular cloud
>>>> usage, we don't. But for any other usages where we want to prioritize
>>>> KVM/VMM profiling over guest vPMU, it is useful.
>>>>
>>>> My current opinion is that optimization 4 is something nice to have.
>>>> But we should allow people to turn it off just like we could choose to
>>>> disable preempt kernel.
>>>
>>> The exclude_guest means everything but the guest. I don't see a reason
>>> why people want to turn it off and get some random blind spots.
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-28  6:01                                                                           ` Mingwei Zhang
@ 2024-04-29 17:44                                                                             ` Sean Christopherson
  2024-05-01 17:43                                                                               ` Mingwei Zhang
  0 siblings, 1 reply; 181+ messages in thread
From: Sean Christopherson @ 2024-04-29 17:44 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Dapeng Mi, Kan Liang, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Sat, Apr 27, 2024, Mingwei Zhang wrote:
> On Sat, Apr 27, 2024 at 5:59 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >
> >
> > On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> > > On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > >> On Fri, Apr 26, 2024, Kan Liang wrote:
> > >>>> Optimization 4
> > >>>> allows the host side to immediately profiling this part instead of
> > >>>> waiting for vcpu to reach to PMU context switch locations. Doing so
> > >>>> will generate more accurate results.
> > >>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> > >>> definition of the exclude_guest. Without 4, it brings some random blind
> > >>> spots, right?
> > >> +1, I view it as a hard requirement.  It's not an optimization, it's about
> > >> accuracy and functional correctness.
> > > Well. Does it have to be a _hard_ requirement? no?

Assuming I understand how perf_event_open() works, which may be a fairly big
assumption, for me, yes, this is a hard requirement.

> > > The irq handler triggered by "perf record -a" could just inject a
> > > "state". Instead of immediately preempting the guest PMU context, perf
> > > subsystem could allow KVM defer the context switch when it reaches the
> > > next PMU context switch location.

FWIW, forcefully interrupting the guest isn't a hard requirement, but practically
speaking I think that will yield the simplest, most robust implementation.

> > > This is the same as the preemption kernel logic. Do you want me to
> > > stop the work immediately? Yes (if you enable preemption), or No, let
> > > me finish my job and get to the scheduling point.

Not really.  Task scheduling is by its nature completely exclusive, i.e. it's
not possible to concurrently run multiple tasks on a single logical CPU.  Given
a single CPU, to run task B, task A _must_ be scheduled out.

That is not the case here.  Profiling the host with exclude_guest=1 isn't mutually
exclusive with the guest using the PMU.  There's obviously the additional overhead
of switching PMU context, but the two uses themselves are not mutually exclusive.

And more importantly, perf_event_open() already has well-established ABI where it
can install events across CPUs.  And when perf_event_open() returns, userspace can
rely on the event being active and counting (assuming it wasn't disabled by default).

> > > Implementing this might be more difficult to debug. That's my real
> > > concern. If we do not enable preemption, the PMU context switch will
> > > only happen at the 2 pairs of locations. If we enable preemption, it
> > > could happen at any time.

Yes and no.  I agree that swapping guest/host state from IRQ context could lead
to hard to debug issues, but NOT doing so could also lead to hard to debug issues.
And even worse, those issues would likely be unique to specific kernel and/or
system configurations.

E.g. userspace creates an event, but sometimes it randomly doesn't count correctly.
Is the problem simply that it took a while for KVM to get to a scheduling point,
or is there a race lurking?  And what happens if the vCPU is the only runnable task
on its pCPU, i.e. never gets scheduled out?

Mix in all of the possible preemption and scheduler models, and other sources of
forced rescheduling, e.g. RCU, and the number of factors to account for becomes
quite terrifying.

> > IMO I don't prefer to add a switch to enable/disable the preemption. I
> > think current implementation is already complicated enough and
> > unnecessary to introduce an new parameter to confuse users. Furthermore,
> > the switch could introduce an uncertainty and may mislead the perf user
> > to read the perf stats incorrectly.

+1000.

> > As for debug, it won't bring any difference as long as no host event is created.
> >
> That's ok. It is about opinions and brainstorming. Adding a parameter
> to disable preemption is from the cloud usage perspective. The
> conflict of opinions is which one you prioritize: guest PMU or the
> host PMU? If you stand on the guest vPMU usage perspective, do you
> want anyone on the host to shoot a profiling command and generate
> turbulence? no. If you stand on the host PMU perspective and you want
> to profile VMM/KVM, you definitely want accuracy and no delay at all.

Hard no from me.  Attempting to support two fundamentally different models means
twice the maintenance burden.  The *best* case scenario is that usage is roughly
a 50/50 spit.  The worst case scenario is that the majority of users favor one
model over the other, thus resulting in extremely limited tested of the minority
model.

KVM already has this problem with scheduler preemption models, and it's painful.
The overwhelming majority of KVM users run non-preemptible kernels, and so our
test coverage for preemtible kernels is abysmal.

E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went
unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic
preemption models resulted in users running code that was supposed to be specific
to preemtible kernels.

[* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-04-29 17:44                                                                             ` Sean Christopherson
@ 2024-05-01 17:43                                                                               ` Mingwei Zhang
  2024-05-01 18:00                                                                                 ` Liang, Kan
  2024-05-01 20:36                                                                                 ` Sean Christopherson
  0 siblings, 2 replies; 181+ messages in thread
From: Mingwei Zhang @ 2024-05-01 17:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dapeng Mi, Kan Liang, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Mon, Apr 29, 2024 at 10:44 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sat, Apr 27, 2024, Mingwei Zhang wrote:
> > On Sat, Apr 27, 2024 at 5:59 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> > >
> > >
> > > On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> > > > On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >> On Fri, Apr 26, 2024, Kan Liang wrote:
> > > >>>> Optimization 4
> > > >>>> allows the host side to immediately profiling this part instead of
> > > >>>> waiting for vcpu to reach to PMU context switch locations. Doing so
> > > >>>> will generate more accurate results.
> > > >>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> > > >>> definition of the exclude_guest. Without 4, it brings some random blind
> > > >>> spots, right?
> > > >> +1, I view it as a hard requirement.  It's not an optimization, it's about
> > > >> accuracy and functional correctness.
> > > > Well. Does it have to be a _hard_ requirement? no?
>
> Assuming I understand how perf_event_open() works, which may be a fairly big
> assumption, for me, yes, this is a hard requirement.
>
> > > > The irq handler triggered by "perf record -a" could just inject a
> > > > "state". Instead of immediately preempting the guest PMU context, perf
> > > > subsystem could allow KVM defer the context switch when it reaches the
> > > > next PMU context switch location.
>
> FWIW, forcefully interrupting the guest isn't a hard requirement, but practically
> speaking I think that will yield the simplest, most robust implementation.
>
> > > > This is the same as the preemption kernel logic. Do you want me to
> > > > stop the work immediately? Yes (if you enable preemption), or No, let
> > > > me finish my job and get to the scheduling point.
>
> Not really.  Task scheduling is by its nature completely exclusive, i.e. it's
> not possible to concurrently run multiple tasks on a single logical CPU.  Given
> a single CPU, to run task B, task A _must_ be scheduled out.
>
> That is not the case here.  Profiling the host with exclude_guest=1 isn't mutually
> exclusive with the guest using the PMU.  There's obviously the additional overhead
> of switching PMU context, but the two uses themselves are not mutually exclusive.
>
> And more importantly, perf_event_open() already has well-established ABI where it
> can install events across CPUs.  And when perf_event_open() returns, userspace can
> rely on the event being active and counting (assuming it wasn't disabled by default).
>
> > > > Implementing this might be more difficult to debug. That's my real
> > > > concern. If we do not enable preemption, the PMU context switch will
> > > > only happen at the 2 pairs of locations. If we enable preemption, it
> > > > could happen at any time.
>
> Yes and no.  I agree that swapping guest/host state from IRQ context could lead
> to hard to debug issues, but NOT doing so could also lead to hard to debug issues.
> And even worse, those issues would likely be unique to specific kernel and/or
> system configurations.
>
> E.g. userspace creates an event, but sometimes it randomly doesn't count correctly.
> Is the problem simply that it took a while for KVM to get to a scheduling point,
> or is there a race lurking?  And what happens if the vCPU is the only runnable task
> on its pCPU, i.e. never gets scheduled out?
>
> Mix in all of the possible preemption and scheduler models, and other sources of
> forced rescheduling, e.g. RCU, and the number of factors to account for becomes
> quite terrifying.
>
> > > IMO I don't prefer to add a switch to enable/disable the preemption. I
> > > think current implementation is already complicated enough and
> > > unnecessary to introduce an new parameter to confuse users. Furthermore,
> > > the switch could introduce an uncertainty and may mislead the perf user
> > > to read the perf stats incorrectly.
>
> +1000.
>
> > > As for debug, it won't bring any difference as long as no host event is created.
> > >
> > That's ok. It is about opinions and brainstorming. Adding a parameter
> > to disable preemption is from the cloud usage perspective. The
> > conflict of opinions is which one you prioritize: guest PMU or the
> > host PMU? If you stand on the guest vPMU usage perspective, do you
> > want anyone on the host to shoot a profiling command and generate
> > turbulence? no. If you stand on the host PMU perspective and you want
> > to profile VMM/KVM, you definitely want accuracy and no delay at all.
>
> Hard no from me.  Attempting to support two fundamentally different models means
> twice the maintenance burden.  The *best* case scenario is that usage is roughly
> a 50/50 spit.  The worst case scenario is that the majority of users favor one
> model over the other, thus resulting in extremely limited tested of the minority
> model.
>
> KVM already has this problem with scheduler preemption models, and it's painful.
> The overwhelming majority of KVM users run non-preemptible kernels, and so our
> test coverage for preemtible kernels is abysmal.
>
> E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went
> unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic
> preemption models resulted in users running code that was supposed to be specific
> to preemtible kernels.
>
> [* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
>

I hear your voice, Sean.

In our cloud, we have a host-level profiling going on for all cores
periodically. It will be profiling X seconds every Y minute. Having
the host-level profiling using exclude_guest is fine, but stopping the
host-level profiling is a no no. Tweaking the X and Y is theoretically
possible, but highly likely out of the scope of virtualization. Now,
some of the VMs might be actively using vPMU at the same time. How can
we properly ensure the guest vPMU has consistent performance? Instead
of letting the VM suffer from the high overhead of PMU for X seconds
of every Y minute?

Any thought/help is appreciated. I see the logic of having preemption
there for correctness of the profiling on the host level. Doing this,
however, negatively impacts the above business usage.

One of the things on top of the mind is that: there seems to be no way
for the perf subsystem to express this: "no, your host-level profiling
is not interested in profiling the KVM_RUN loop when our guest vPMU is
actively running".

Thanks.
-Mingwei

-Mingwei

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-05-01 17:43                                                                               ` Mingwei Zhang
@ 2024-05-01 18:00                                                                                 ` Liang, Kan
  2024-05-01 20:36                                                                                 ` Sean Christopherson
  1 sibling, 0 replies; 181+ messages in thread
From: Liang, Kan @ 2024-05-01 18:00 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson
  Cc: Dapeng Mi, maobibo, Xiong Zhang, pbonzini, peterz, kan.liang,
	zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao



On 2024-05-01 1:43 p.m., Mingwei Zhang wrote:
> One of the things on top of the mind is that: there seems to be no way
> for the perf subsystem to express this: "no, your host-level profiling
> is not interested in profiling the KVM_RUN loop when our guest vPMU is
> actively running".

exclude_hv? Although it seems the option is not well supported on X86
for now.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-05-01 17:43                                                                               ` Mingwei Zhang
  2024-05-01 18:00                                                                                 ` Liang, Kan
@ 2024-05-01 20:36                                                                                 ` Sean Christopherson
  1 sibling, 0 replies; 181+ messages in thread
From: Sean Christopherson @ 2024-05-01 20:36 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Dapeng Mi, Kan Liang, maobibo, Xiong Zhang, pbonzini, peterz,
	kan.liang, zhenyuw, jmattson, kvm, linux-perf-users, linux-kernel,
	zhiyuan.lv, eranian, irogers, samantha.alt, like.xu.linux,
	chao.gao

On Wed, May 01, 2024, Mingwei Zhang wrote:
> On Mon, Apr 29, 2024 at 10:44 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Sat, Apr 27, 2024, Mingwei Zhang wrote:
> > > That's ok. It is about opinions and brainstorming. Adding a parameter
> > > to disable preemption is from the cloud usage perspective. The
> > > conflict of opinions is which one you prioritize: guest PMU or the
> > > host PMU? If you stand on the guest vPMU usage perspective, do you
> > > want anyone on the host to shoot a profiling command and generate
> > > turbulence? no. If you stand on the host PMU perspective and you want
> > > to profile VMM/KVM, you definitely want accuracy and no delay at all.
> >
> > Hard no from me.  Attempting to support two fundamentally different models means
> > twice the maintenance burden.  The *best* case scenario is that usage is roughly
> > a 50/50 spit.  The worst case scenario is that the majority of users favor one
> > model over the other, thus resulting in extremely limited tested of the minority
> > model.
> >
> > KVM already has this problem with scheduler preemption models, and it's painful.
> > The overwhelming majority of KVM users run non-preemptible kernels, and so our
> > test coverage for preemtible kernels is abysmal.
> >
> > E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went
> > unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic
> > preemption models resulted in users running code that was supposed to be specific
> > to preemtible kernels.
> >
> > [* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
> >
> 
> I hear your voice, Sean.
> 
> In our cloud, we have a host-level profiling going on for all cores
> periodically. It will be profiling X seconds every Y minute. Having
> the host-level profiling using exclude_guest is fine, but stopping the
> host-level profiling is a no no. Tweaking the X and Y is theoretically
> possible, but highly likely out of the scope of virtualization. Now,
> some of the VMs might be actively using vPMU at the same time. How can
> we properly ensure the guest vPMU has consistent performance? Instead
> of letting the VM suffer from the high overhead of PMU for X seconds
> of every Y minute?
> 
> Any thought/help is appreciated. I see the logic of having preemption
> there for correctness of the profiling on the host level. Doing this,
> however, negatively impacts the above business usage.
> 
> One of the things on top of the mind is that: there seems to be no way
> for the perf subsystem to express this: "no, your host-level profiling
> is not interested in profiling the KVM_RUN loop when our guest vPMU is
> actively running".

For good reason, IMO.  The KVM_RUN loop can reach _far_ outside of KVM, especially
when IRQs and NMIs are involved.  I don't think anyone can reasonably say that
profiling is never interested in what happens while a task in KVM_RUN.  E.g. if
there's a bottleneck in some memory allocation flow that happens to be triggered
in the greater KVM_RUN loop, that's something we'd want to show up in our profiling
data.

And if our systems our properly configured, for VMs with a mediated/passthrough
PMU, 99.99999% of their associated pCPU's time should be spent in KVM_RUN.  If
that's our reality, what's the point of profiling if KVM_RUN is out of scope?

We could make the context switching logic more sophisticated, e.g. trigger a
context switch when control leaves KVM, a la the ASI concepts, but that's all but
guaranteed to be overkill, and would have a very high maintenance cost.

But we can likely get what we want (low observed overhead from the guest) while
still context switching PMU state in vcpu_enter_guest().  KVM already handles the
hottest VM-Exit reasons in its fastpath, i.e without triggering a PMU context
switch.  For a variety of reason, I think we should be more aggressive and handle
more VM-Exits in the fastpath, e.g. I can't think of any reason KVM can't handle
fast page faults in the fastpath.

If we handle that overwhelming majority of VM-Exits in the fastpath when the guest
is already booted, e.g. when vCPUs aren't taking a high number of "slow" VM-Exits,
then the fact that slow VM-Exits trigger a PMU context switch should be a non-issue,
because taking a slow exit would be a rare operation.

I.e. rather than solving the overhead problem by moving around the context switch
logic, solve the problem by moving KVM code inside the "guest PMU" section.  It's
essentially a different way of doing the same thing, with the critical difference
being that only hand-selected flows are excluded from profiling, i.e. only the
flows that need to be blazing fast and should be uninteresting from a profiling
perspective are excluded.

^ permalink raw reply	[flat|nested] 181+ messages in thread

end of thread, other threads:[~2024-05-01 20:36 UTC | newest]

Thread overview: 181+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-26  8:54 [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 01/41] perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH Xiong Zhang
2024-04-11 17:04   ` Sean Christopherson
2024-04-11 17:21     ` Liang, Kan
2024-04-11 17:24       ` Jim Mattson
2024-04-11 17:46         ` Sean Christopherson
2024-04-11 19:13           ` Liang, Kan
2024-04-11 20:43             ` Sean Christopherson
2024-04-11 21:04               ` Liang, Kan
2024-04-11 19:32           ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 02/41] perf: Support guest enter/exit interfaces Xiong Zhang
2024-03-20 16:40   ` Raghavendra Rao Ananta
2024-03-20 17:12     ` Liang, Kan
2024-04-11 18:06   ` Sean Christopherson
2024-04-11 19:53     ` Liang, Kan
2024-04-12 19:17       ` Sean Christopherson
2024-04-12 20:56         ` Liang, Kan
2024-04-15 16:03           ` Liang, Kan
2024-04-16  5:34             ` Zhang, Xiong Y
2024-04-16 12:48               ` Liang, Kan
2024-04-17  9:42                 ` Zhang, Xiong Y
2024-04-18 16:11                   ` Sean Christopherson
2024-04-19  1:37                     ` Zhang, Xiong Y
2024-04-26  4:09       ` Zhang, Xiong Y
2024-01-26  8:54 ` [RFC PATCH 03/41] perf: Set exclude_guest onto nmi_watchdog Xiong Zhang
2024-04-11 18:56   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 04/41] perf: core/x86: Add support to register a new vector for PMI handling Xiong Zhang
2024-04-11 17:10   ` Sean Christopherson
2024-04-11 19:05     ` Sean Christopherson
2024-04-12  3:56     ` Zhang, Xiong Y
2024-04-13  1:17       ` Mi, Dapeng
2024-01-26  8:54 ` [RFC PATCH 05/41] KVM: x86/pmu: Register PMI handler for passthrough PMU Xiong Zhang
2024-04-11 19:07   ` Sean Christopherson
2024-04-12  5:44     ` Zhang, Xiong Y
2024-01-26  8:54 ` [RFC PATCH 06/41] perf: x86: Add function to switch PMI handler Xiong Zhang
2024-04-11 19:17   ` Sean Christopherson
2024-04-11 19:34     ` Sean Christopherson
2024-04-12  6:03       ` Zhang, Xiong Y
2024-04-12  5:57     ` Zhang, Xiong Y
2024-01-26  8:54 ` [RFC PATCH 07/41] perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW Xiong Zhang
2024-04-11 19:21   ` Sean Christopherson
2024-04-12  6:17     ` Zhang, Xiong Y
2024-01-26  8:54 ` [RFC PATCH 08/41] KVM: x86/pmu: Add get virtual LVTPC_MASK bit function Xiong Zhang
2024-04-11 19:22   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 09/41] perf: core/x86: Forbid PMI handler when guest own PMU Xiong Zhang
2024-04-11 19:26   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 10/41] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 11/41] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and propage to KVM instance Xiong Zhang
2024-04-11 20:54   ` Sean Christopherson
2024-04-11 21:03   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 12/41] KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs Xiong Zhang
2024-04-11 20:57   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 13/41] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 14/41] KVM: x86/pmu: Allow RDPMC pass through Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 15/41] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Xiong Zhang
2024-04-11 21:21   ` Sean Christopherson
2024-04-11 22:30     ` Jim Mattson
2024-04-11 23:27       ` Sean Christopherson
2024-04-13  2:10       ` Mi, Dapeng
2024-01-26  8:54 ` [RFC PATCH 16/41] KVM: x86/pmu: Create a function prototype to disable MSR interception Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 17/41] KVM: x86/pmu: Implement pmu function for Intel CPU " Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 18/41] KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with perf capabilities Xiong Zhang
2024-04-11 21:23   ` Sean Christopherson
2024-04-11 21:50     ` Jim Mattson
2024-04-12 16:01       ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 19/41] KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 20/41] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 21/41] KVM: x86/pmu: Introduce function prototype for Intel CPU to " Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 22/41] x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Xiong Zhang
2024-04-11 21:26   ` Sean Christopherson
2024-04-13  2:29     ` Mi, Dapeng
2024-04-11 21:44   ` Sean Christopherson
2024-04-11 22:19     ` Jim Mattson
2024-04-11 23:31       ` Sean Christopherson
2024-04-13  3:19         ` Mi, Dapeng
2024-04-13  3:03     ` Mi, Dapeng
2024-04-13  3:34       ` Mingwei Zhang
2024-04-13  4:25         ` Mi, Dapeng
2024-04-15  6:06           ` Mingwei Zhang
2024-04-15 10:04             ` Mi, Dapeng
2024-04-15 16:44               ` Mingwei Zhang
2024-04-15 17:38                 ` Sean Christopherson
2024-04-15 17:54                   ` Mingwei Zhang
2024-04-15 22:45                     ` Sean Christopherson
2024-04-22  2:14                       ` maobibo
2024-04-22 17:01                         ` Sean Christopherson
2024-04-23  1:01                           ` maobibo
2024-04-23  2:44                             ` Mi, Dapeng
2024-04-23  2:53                               ` maobibo
2024-04-23  3:13                                 ` Mi, Dapeng
2024-04-23  3:26                                   ` maobibo
2024-04-23  3:59                                     ` Mi, Dapeng
2024-04-23  3:55                                   ` maobibo
2024-04-23  4:23                                     ` Mingwei Zhang
2024-04-23  6:08                                       ` maobibo
2024-04-23  6:45                                         ` Mi, Dapeng
2024-04-23  7:10                                           ` Mingwei Zhang
2024-04-23  8:24                                             ` Mi, Dapeng
2024-04-23  8:51                                               ` maobibo
2024-04-23 16:50                                               ` Mingwei Zhang
2024-04-23 12:12                                       ` maobibo
2024-04-23 17:02                                         ` Mingwei Zhang
2024-04-24  1:07                                           ` maobibo
2024-04-24  8:18                                           ` Mi, Dapeng
2024-04-24 15:00                                             ` Sean Christopherson
2024-04-25  3:55                                               ` Mi, Dapeng
2024-04-25  4:24                                                 ` Mingwei Zhang
2024-04-25 16:13                                                   ` Liang, Kan
2024-04-25 20:16                                                     ` Mingwei Zhang
2024-04-25 20:43                                                       ` Liang, Kan
2024-04-25 21:46                                                         ` Sean Christopherson
2024-04-26  1:46                                                           ` Mi, Dapeng
2024-04-26  3:12                                                             ` Mingwei Zhang
2024-04-26  4:02                                                               ` Mi, Dapeng
2024-04-26  4:46                                                                 ` Mingwei Zhang
2024-04-26 14:09                                                               ` Liang, Kan
2024-04-26 18:41                                                                 ` Mingwei Zhang
2024-04-26 19:06                                                                   ` Liang, Kan
2024-04-26 19:46                                                                     ` Sean Christopherson
2024-04-27  3:04                                                                       ` Mingwei Zhang
2024-04-28  0:58                                                                         ` Mi, Dapeng
2024-04-28  6:01                                                                           ` Mingwei Zhang
2024-04-29 17:44                                                                             ` Sean Christopherson
2024-05-01 17:43                                                                               ` Mingwei Zhang
2024-05-01 18:00                                                                                 ` Liang, Kan
2024-05-01 20:36                                                                                 ` Sean Christopherson
2024-04-29 13:08                                                                         ` Liang, Kan
2024-04-26 13:53                                                           ` Liang, Kan
2024-04-26  1:50                                                         ` Mi, Dapeng
2024-04-18 21:21                   ` Mingwei Zhang
2024-04-18 21:41                     ` Mingwei Zhang
2024-04-19  1:02                     ` Mi, Dapeng
2024-01-26  8:54 ` [RFC PATCH 24/41] KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid information leakage Xiong Zhang
2024-04-11 21:36   ` Sean Christopherson
2024-04-11 21:56     ` Jim Mattson
2024-01-26  8:54 ` [RFC PATCH 25/41] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 26/41] KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU capability Xiong Zhang
2024-04-11 21:49   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 27/41] KVM: x86/pmu: Clear PERF_METRICS MSR for guest Xiong Zhang
2024-04-11 21:50   ` Sean Christopherson
2024-04-13  3:29     ` Mi, Dapeng
2024-01-26  8:54 ` [RFC PATCH 28/41] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Xiong Zhang
2024-04-11 21:54   ` Sean Christopherson
2024-04-11 22:10     ` Jim Mattson
2024-04-11 22:54       ` Sean Christopherson
2024-04-11 23:08         ` Jim Mattson
2024-01-26  8:54 ` [RFC PATCH 29/41] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 30/41] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 31/41] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 32/41] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 33/41] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 34/41] KVM: x86/pmu: Intercept EVENT_SELECT MSR Xiong Zhang
2024-04-11 21:55   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 35/41] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 36/41] KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR Xiong Zhang
2024-04-11 21:56   ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 37/41] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Xiong Zhang
2024-04-11 22:03   ` Sean Christopherson
2024-04-13  4:12     ` Mi, Dapeng
2024-01-26  8:54 ` [RFC PATCH 38/41] KVM: x86/pmu: Introduce PMU helper to increment counter Xiong Zhang
2024-01-26  8:54 ` [RFC PATCH 39/41] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Xiong Zhang
2024-04-11 23:12   ` Sean Christopherson
2024-04-11 23:17     ` Sean Christopherson
2024-01-26  8:54 ` [RFC PATCH 40/41] KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from non-passthrough vPMU Xiong Zhang
2024-04-11 23:18   ` Sean Christopherson
2024-04-18 21:54     ` Mingwei Zhang
2024-01-26  8:54 ` [RFC PATCH 41/41] KVM: nVMX: Add nested virtualization support for passthrough PMU Xiong Zhang
2024-04-11 23:21   ` Sean Christopherson
2024-04-11 17:03 ` [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM Sean Christopherson
2024-04-12  2:19   ` Zhang, Xiong Y
2024-04-12 18:32     ` Sean Christopherson
2024-04-15  1:06       ` Zhang, Xiong Y
2024-04-15 15:05         ` Sean Christopherson
2024-04-16  5:11           ` Zhang, Xiong Y
2024-04-18 20:46   ` Mingwei Zhang
2024-04-18 21:52     ` Mingwei Zhang
2024-04-19 19:14     ` Sean Christopherson
2024-04-19 22:02       ` Mingwei Zhang
2024-04-11 23:25 ` Sean Christopherson
2024-04-11 23:56   ` Mingwei Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).