Re: [PATCH v2] KVM: Mark a vCPU as preempted/ready iff it's scheduled out while running

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: David Matlack <dmatlack@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org
Subject: Re: [PATCH v2] KVM: Mark a vCPU as preempted/ready iff it's scheduled out while running
Date: Fri, 26 Apr 2024 14:01:15 -0700	[thread overview]
Message-ID: <ZiwWG4iHQYREwFP2@google.com> (raw)
In-Reply-To: <20240307163541.92138-1-dmatlack@google.com>

On Thu, Mar 07, 2024, David Matlack wrote:
> Mark a vCPU as preempted/ready if-and-only-if it's scheduled out while
> running. i.e. Do not mark a vCPU preempted/ready if it's scheduled out
> during a non-KVM_RUN ioctl() or when userspace is doing KVM_RUN with
> immediate_exit.
> 
> Commit 54aa83c90198 ("KVM: x86: do not set st->preempted when going back
> to user space") stopped marking a vCPU as preempted when returning to
> userspace, but if userspace then invokes a KVM vCPU ioctl() that gets
> preempted, the vCPU will be marked preempted/ready. This is arguably
> incorrect behavior since the vCPU was not actually preempted while the
> guest was running, it was preempted while doing something on behalf of
> userspace.
> 
> This commit also avoids KVM dirtying guest memory after userspace has
> paused vCPUs, e.g. for Live Migration, which allows userspace to collect
> the final dirty bitmap before or in parallel with saving vCPU state
> without having to worry about saving vCPU state triggering writes to
> guest memory.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
> v2:
>  - Drop Google-specific "PRODKERNEL: " shortlog prefix
> 
> v1: https://lore.kernel.org/kvm/20231218185850.1659570-1-dmatlack@google.com/
> 
>  include/linux/kvm_host.h | 1 +
>  virt/kvm/kvm_main.c      | 5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..5b2300614d22 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -378,6 +378,7 @@ struct kvm_vcpu {
>  		bool dy_eligible;
>  	} spin_loop;
>  #endif
> +	bool wants_to_run;
>  	bool preempted;
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff588677beb7..3da1b2e3785d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4438,7 +4438,10 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  				synchronize_rcu();
>  			put_pid(oldpid);
>  		}
> +		vcpu->wants_to_run = !vcpu->run->immediate_exit;

>  		r = kvm_arch_vcpu_ioctl_run(vcpu);
> +		vcpu->wants_to_run = false;
> +
>  		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>  		break;
>  	}
> @@ -6312,7 +6315,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  {
>  	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
>  
> -	if (current->on_rq) {
> +	if (current->on_rq && vcpu->wants_to_run) {
>  		WRITE_ONCE(vcpu->preempted, true);
>  		WRITE_ONCE(vcpu->ready, true);
>  	}
> 
> base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3

Long story short, I was playing around with wants_to_run for a few hairbrained
ideas, and realized that there's a TOCTOU bug here.  Userspace can toggle
run->immediate_exit at will, e.g. can clear it after the kernel loads it to
compute vcpu->wants_to_run.

That's not fatal for this use case, since userspace would only be shooting itself
in the foot, but it leaves a very dangerous landmine, e.g. if something else in
KVM keys off of vcpu->wants_to_run to detect that KVM is in its run loop, i.e.
relies on wants_to_run being set if KVM is in its core run loop.

To address that, I think we should have all architectures check wants_to_run, not
immediate_exit.  And loading immediate_exit needs to be done with READ_ONCE().

E.g. for x86 (every other arch has similar code)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e9ef1fa4b90b..1a2f6bf14fb2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11396,7 +11396,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
        kvm_vcpu_srcu_read_lock(vcpu);
        if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
-               if (kvm_run->immediate_exit) {
+               if (!vcpu->wants_to_run) {
                        r = -EINTR;
                        goto out;
                }
@@ -11474,7 +11474,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
                WARN_ON_ONCE(vcpu->mmio_needed);
        }
 
-       if (kvm_run->immediate_exit) {
+       if (!vcpu->wants_to_run) {
                r = -EINTR;
                goto out;
        }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f9b9ce0c3cd9..0c0aae224000 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1497,9 +1497,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
                                        struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);
 
-void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
-
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9501fbd5dfd2..4384bbdba65c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4410,7 +4410,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
                                synchronize_rcu();
                        put_pid(oldpid);
                }
-               vcpu->wants_to_run = !vcpu->run->immediate_exit;
+               vcpu->wants_to_run = !READ_ONCE(vcpu->run->immediate_exit);
                r = kvm_arch_vcpu_ioctl_run(vcpu);
                vcpu->wants_to_run = false;


---

Hmm, and we should probably go a step further and actively prevent using
immediate_exit from the kernel, e.g. rename it to something scary like:

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..9c5fe1dae744 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -196,7 +196,11 @@ struct kvm_xen_exit {
 struct kvm_run {
        /* in */
        __u8 request_interrupt_window;
+#ifndef __KERNEL__
        __u8 immediate_exit;
+#else
+       __u8 hidden_do_not_touch;
+#endif
        __u8 padding1[6];
 
        /* out */

next prev parent reply	other threads:[~2024-04-26 21:01 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-07 16:35 [PATCH v2] KVM: Mark a vCPU as preempted/ready iff it's scheduled out while running David Matlack
2024-04-02 16:41 ` David Matlack
2024-04-26 21:01 ` Sean Christopherson [this message]
2024-04-29 17:22   ` David Matlack
2024-04-29 18:05     ` Sean Christopherson

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:e9ef1fa4b90 dfblob:f9b9ce0c3cd dfblob:9501fbd5dfd
dfblob:2190adbe300 dfblob:1a2f6bf14fb dfblob:0c0aae22400
dfblob:4384bbdba65 dfblob:9c5fe1dae74
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZiwWG4iHQYREwFP2@google.com \
    --to=seanjc@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.