KVM: arm64: Mark the VM as dead for failed initializations

Message ID	20241025221220.2985227-1-rananta@google.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> Date: Fri, 25 Oct 2024 22:12:20 +0000 Mime-Version: 1.0 Message-ID: <20241025221220.2985227-1-rananta@google.com> Subject: [PATCH] KVM: arm64: Mark the VM as dead for failed initializations From: Raghavendra Rao Ananta <rananta@google.com> To: Oliver Upton <oliver.upton@linux.dev>, Marc Zyngier <maz@kernel.org> Cc: Raghavendra Rao Anata <rananta@google.com>, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, stable@vger.kernel.org, syzbot <syzkaller@googlegroups.com> Content-Type: text/plain; charset="UTF-8" Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	KVM: arm64: Mark the VM as dead for failed initializations \| expand KVM: arm64: Mark the VM as dead for failed initializations

Raghavendra Rao Ananta Oct. 25, 2024, 10:12 p.m. UTC

Syzbot hit the following WARN_ON() in kvm_timer_update_irq():

WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
  kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
  kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
  kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
  kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
  kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
  kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
  kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
  __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
  invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
  el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
  do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
  el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
  el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
  el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598

The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is
invoked after a failed first KVM_RUN. In a general sense though, since
kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past
initiatializations, it's possible that the VM's state could be left
half-baked. Any upcoming ioctls could behave erroneously because of
this.

Since these late vCPU initializations is past the point of attributing
the failures to any ioctl, instead of tearing down each of the previous
setups, simply mark the VM as dead, gving an opportunity for the
userspace to close and try again.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
---
 arch/arm64/kvm/arm.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Oliver Upton Oct. 26, 2024, 5:34 a.m. UTC | #1

Hi Raghu,

Thanks for posting this fix.

On Fri, Oct 25, 2024 at 10:12:20PM +0000, Raghavendra Rao Ananta wrote:
> Syzbot hit the following WARN_ON() in kvm_timer_update_irq():
> 
> WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
> kvm_timer_update_irq+0x21c/0x394
> Call trace:
>   kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
>   kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
>   kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
>   kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
>   kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
>   kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
>   kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
>   vfs_ioctl fs/ioctl.c:51 [inline]
>   __do_sys_ioctl fs/ioctl.c:907 [inline]
>   __se_sys_ioctl fs/ioctl.c:893 [inline]
>   __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
>   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>   invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
>   el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
>   do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
>   el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
>   el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
>   el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
> 
> The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is
> invoked after a failed first KVM_RUN. In a general sense though, since
> kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past
> initiatializations, it's possible that the VM's state could be left

typo: initializations

> half-baked. Any upcoming ioctls could behave erroneously because of
> this.

You may want to highlight a bit more strongly that, despite the name,
we do a lot of late *VM* state initialization in kvm_arch_vcpu_run_pid_change().

When that goes sideways we're left with few choices besides bugging the
VM or gracefully tearing down state, potentially w/ concurrent users.

> Since these late vCPU initializations is past the point of attributing
> the failures to any ioctl, instead of tearing down each of the previous
> setups, simply mark the VM as dead, gving an opportunity for the
> userspace to close and try again.
> 
> Cc: <stable@vger.kernel.org>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Suggested-by: Oliver Upton <oliver.upton@linux.dev>

I definitely recommended this to you, so blame *me* for imposing some
toil on you with the following.

> @@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
>  
>  	ret = kvm_timer_enable(vcpu);
>  	if (ret)
> -		return ret;
> +		goto out_err;
>  
>  	ret = kvm_arm_pmu_v3_enable(vcpu);
>  	if (ret)
> -		return ret;
> +		goto out_err;
>  
>  	if (is_protected_kvm_enabled()) {
>  		ret = pkvm_create_hyp_vm(kvm);
>  		if (ret)
> -			return ret;
> +			goto out_err;
>  	}
>  
>  	if (!irqchip_in_kernel(kvm)) {
> @@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
>  	mutex_unlock(&kvm->arch.config_lock);
>  
>  	return ret;
> +
> +out_err:
> +	kvm_vm_dead(kvm);
> +	return ret;
>  }

After rereading, I think we could benefit from a more distinct separation
of late VM vs. vCPU state initialization.

Bugging the VM is a big hammer, we should probably only resort to that
when the VM state is screwed up badly.

Otherwise, for screwed up vCPU state we could uninitialize the vCPU and
let userspace try again. An example of this is how we deal with VMs that
run 32 bit userspace when KVM tries to hide the feature.

WDYT?

Marc Zyngier Oct. 26, 2024, 7:43 a.m. UTC | #2

Hi both,

On Sat, 26 Oct 2024 06:34:23 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Hi Raghu,
> 
> Thanks for posting this fix.
> 
> On Fri, Oct 25, 2024 at 10:12:20PM +0000, Raghavendra Rao Ananta wrote:
> > Syzbot hit the following WARN_ON() in kvm_timer_update_irq():
> > 
> > WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
> > kvm_timer_update_irq+0x21c/0x394
> > Call trace:
> >   kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
> >   kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
> >   kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
> >   kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
> >   kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
> >   kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
> >   kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
> >   vfs_ioctl fs/ioctl.c:51 [inline]
> >   __do_sys_ioctl fs/ioctl.c:907 [inline]
> >   __se_sys_ioctl fs/ioctl.c:893 [inline]
> >   __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
> >   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
> >   invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
> >   el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
> >   do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
> >   el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
> >   el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
> >   el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
> > 
> > The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is
> > invoked after a failed first KVM_RUN. In a general sense though, since
> > kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past
> > initiatializations, it's possible that the VM's state could be left
> 
> typo: initializations
> 
> > half-baked. Any upcoming ioctls could behave erroneously because of
> > this.
> 
> You may want to highlight a bit more strongly that, despite the name,
> we do a lot of late *VM* state initialization in kvm_arch_vcpu_run_pid_change().
> 
> When that goes sideways we're left with few choices besides bugging the
> VM or gracefully tearing down state, potentially w/ concurrent users.
> 
> > Since these late vCPU initializations is past the point of attributing
> > the failures to any ioctl, instead of tearing down each of the previous
> > setups, simply mark the VM as dead, gving an opportunity for the
> > userspace to close and try again.
> > 
> > Cc: <stable@vger.kernel.org>
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Suggested-by: Oliver Upton <oliver.upton@linux.dev>
> 
> I definitely recommended this to you, so blame *me* for imposing some
> toil on you with the following.
> 
> > @@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
> >  
> >  	ret = kvm_timer_enable(vcpu);
> >  	if (ret)
> > -		return ret;
> > +		goto out_err;
> >  
> >  	ret = kvm_arm_pmu_v3_enable(vcpu);
> >  	if (ret)
> > -		return ret;
> > +		goto out_err;
> >  
> >  	if (is_protected_kvm_enabled()) {
> >  		ret = pkvm_create_hyp_vm(kvm);
> >  		if (ret)
> > -			return ret;
> > +			goto out_err;
> >  	}
> >  
> >  	if (!irqchip_in_kernel(kvm)) {
> > @@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
> >  	mutex_unlock(&kvm->arch.config_lock);
> >  
> >  	return ret;
> > +
> > +out_err:
> > +	kvm_vm_dead(kvm);
> > +	return ret;
> >  }
> 
> After rereading, I think we could benefit from a more distinct separation
> of late VM vs. vCPU state initialization.
> 
> Bugging the VM is a big hammer, we should probably only resort to that
> when the VM state is screwed up badly.
>
> Otherwise, for screwed up vCPU state we could uninitialize the vCPU and
> let userspace try again. An example of this is how we deal with VMs that
> run 32 bit userspace when KVM tries to hide the feature.

I tend to agree. We shouldn't make a misconfiguration fatal unless we
know for sure that the state is not recoverable (such as a screwed up
memory map).

When it comes to this particular issue, I wonder why the (maybe overly
simplistic) hack below isn't enough. The userspace_irqchip_in_use
static key brings more problems than it is worth it, and the feature
is mostly nonsense anyway.

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bf64fed9820e..c315bc1a4e9a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void);
 static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; };
 #endif
 
-DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
-
 extern unsigned int __ro_after_init kvm_sve_max_vl;
 extern unsigned int __ro_after_init kvm_host_sve_max_vl;
 int __init kvm_arm_init_sve(void);
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 879982b1cc73..1215df590418 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map)
 
 static inline bool userspace_irqchip(struct kvm *kvm)
 {
-	return static_branch_unlikely(&userspace_irqchip_in_use) &&
-		unlikely(!irqchip_in_kernel(kvm));
+	return unlikely(!irqchip_in_kernel(kvm));
 }
 
 static void soft_timer_start(struct hrtimer *hrt, u64 ns)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 48cafb65d6ac..70ff9a20ef3a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
 static bool vgic_present, kvm_arm_initialised;
 
 static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized);
-DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
 bool is_kvm_arm_initialised(void)
 {
@@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
-	if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
-		static_branch_dec(&userspace_irqchip_in_use);
-
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
 	kvm_timer_vcpu_terminate(vcpu);
 	kvm_pmu_vcpu_destroy(vcpu);
@@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	if (!irqchip_in_kernel(kvm)) {
-		/*
-		 * Tell the rest of the code that there are userspace irqchip
-		 * VMs in the wild.
-		 */
-		static_branch_inc(&userspace_irqchip_in_use);
-	}
-
 	/*
 	 * Initialize traps for protected VMs.
 	 * NOTE: Move to run in EL2 directly, rather than via a hypercall, once
@@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
 	 * state gets updated in kvm_timer_update_run and
 	 * kvm_pmu_update_run below).
 	 */
-	if (static_branch_unlikely(&userspace_irqchip_in_use)) {
+	if (unlikely(!irqchip_in_kernel(vcpu->kvm))) {
 		if (kvm_timer_should_notify_user(vcpu) ||
 		    kvm_pmu_should_notify_user(vcpu)) {
 			*ret = -EINTR;
@@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 			vcpu->mode = OUTSIDE_GUEST_MODE;
 			isb(); /* Ensure work in x_flush_hwstate is committed */
 			kvm_pmu_sync_hwstate(vcpu);
-			if (static_branch_unlikely(&userspace_irqchip_in_use))
+			if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
 				kvm_timer_sync_user(vcpu);
 			kvm_vgic_sync_hwstate(vcpu);
 			local_irq_enable();
@@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * we don't want vtimer interrupts to race with syncing the
 		 * timer virtual interrupt state.
 		 */
-		if (static_branch_unlikely(&userspace_irqchip_in_use))
+		if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
 			kvm_timer_sync_user(vcpu);
 
 		kvm_arch_vcpu_ctxsync_fp(vcpu);

I think this would fix the problem you're seeing without changing the
userspace view of an erroneous configuration. It would also pave the
way for the complete removal of the interrupt notification to
userspace, which I claim has no user and is just a shit idea.

Thanks,

	M.

Oliver Upton Oct. 26, 2024, 2:53 p.m. UTC | #3

On Sat, Oct 26, 2024 at 08:43:21AM +0100, Marc Zyngier wrote:
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index bf64fed9820e..c315bc1a4e9a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void);
>  static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; };
>  #endif
>  
> -DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
> -
>  extern unsigned int __ro_after_init kvm_sve_max_vl;
>  extern unsigned int __ro_after_init kvm_host_sve_max_vl;
>  int __init kvm_arm_init_sve(void);
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 879982b1cc73..1215df590418 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map)
>  
>  static inline bool userspace_irqchip(struct kvm *kvm)
>  {
> -	return static_branch_unlikely(&userspace_irqchip_in_use) &&
> -		unlikely(!irqchip_in_kernel(kvm));
> +	return unlikely(!irqchip_in_kernel(kvm));
>  }
>  
>  static void soft_timer_start(struct hrtimer *hrt, u64 ns)
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 48cafb65d6ac..70ff9a20ef3a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
>  static bool vgic_present, kvm_arm_initialised;
>  
>  static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized);
> -DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>  
>  bool is_kvm_arm_initialised(void)
>  {
> @@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>  
>  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
> -		static_branch_dec(&userspace_irqchip_in_use);
> -
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
>  	kvm_timer_vcpu_terminate(vcpu);
>  	kvm_pmu_vcpu_destroy(vcpu);
> @@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
>  			return ret;
>  	}
>  
> -	if (!irqchip_in_kernel(kvm)) {
> -		/*
> -		 * Tell the rest of the code that there are userspace irqchip
> -		 * VMs in the wild.
> -		 */
> -		static_branch_inc(&userspace_irqchip_in_use);
> -	}
> -
>  	/*
>  	 * Initialize traps for protected VMs.
>  	 * NOTE: Move to run in EL2 directly, rather than via a hypercall, once
> @@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
>  	 * state gets updated in kvm_timer_update_run and
>  	 * kvm_pmu_update_run below).
>  	 */
> -	if (static_branch_unlikely(&userspace_irqchip_in_use)) {
> +	if (unlikely(!irqchip_in_kernel(vcpu->kvm))) {
>  		if (kvm_timer_should_notify_user(vcpu) ||
>  		    kvm_pmu_should_notify_user(vcpu)) {
>  			*ret = -EINTR;
> @@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  			vcpu->mode = OUTSIDE_GUEST_MODE;
>  			isb(); /* Ensure work in x_flush_hwstate is committed */
>  			kvm_pmu_sync_hwstate(vcpu);
> -			if (static_branch_unlikely(&userspace_irqchip_in_use))
> +			if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
>  				kvm_timer_sync_user(vcpu);
>  			kvm_vgic_sync_hwstate(vcpu);
>  			local_irq_enable();
> @@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		 * we don't want vtimer interrupts to race with syncing the
>  		 * timer virtual interrupt state.
>  		 */
> -		if (static_branch_unlikely(&userspace_irqchip_in_use))
> +		if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
>  			kvm_timer_sync_user(vcpu);
>  
>  		kvm_arch_vcpu_ctxsync_fp(vcpu);
> 
> I think this would fix the problem you're seeing without changing the
> userspace view of an erroneous configuration. It would also pave the
> way for the complete removal of the interrupt notification to
> userspace, which I claim has no user and is just a shit idea.

Yeah, looks like this ought to get it done.

Even with a fix for this particular issue I do wonder if we should
categorically harden against late initialization failures and un-init
the vCPU (or bug VM, where necessary) to avoid dealing with half-baked
vCPUs/VMs across our UAPI surfaces.

A sane userspace will probably crash when KVM_RUN returns EINVAL anyway.

Raghavendra Rao Ananta Oct. 28, 2024, 4:43 p.m. UTC | #4

Hi Oliver,


On Sat, Oct 26, 2024 at 7:53 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Sat, Oct 26, 2024 at 08:43:21AM +0100, Marc Zyngier wrote:
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index bf64fed9820e..c315bc1a4e9a 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void);
> >  static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; };
> >  #endif
> >
> > -DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
> > -
> >  extern unsigned int __ro_after_init kvm_sve_max_vl;
> >  extern unsigned int __ro_after_init kvm_host_sve_max_vl;
> >  int __init kvm_arm_init_sve(void);
> > diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> > index 879982b1cc73..1215df590418 100644
> > --- a/arch/arm64/kvm/arch_timer.c
> > +++ b/arch/arm64/kvm/arch_timer.c
> > @@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map)
> >
> >  static inline bool userspace_irqchip(struct kvm *kvm)
> >  {
> > -     return static_branch_unlikely(&userspace_irqchip_in_use) &&
> > -             unlikely(!irqchip_in_kernel(kvm));
> > +     return unlikely(!irqchip_in_kernel(kvm));
> >  }
> >
> >  static void soft_timer_start(struct hrtimer *hrt, u64 ns)
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 48cafb65d6ac..70ff9a20ef3a 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
> >  static bool vgic_present, kvm_arm_initialised;
> >
> >  static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized);
> > -DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
> >
> >  bool is_kvm_arm_initialised(void)
> >  {
> > @@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
> >
> >  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> >  {
> > -     if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
> > -             static_branch_dec(&userspace_irqchip_in_use);
> > -
> >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
> >       kvm_timer_vcpu_terminate(vcpu);
> >       kvm_pmu_vcpu_destroy(vcpu);
> > @@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
> >                       return ret;
> >       }
> >
> > -     if (!irqchip_in_kernel(kvm)) {
> > -             /*
> > -              * Tell the rest of the code that there are userspace irqchip
> > -              * VMs in the wild.
> > -              */
> > -             static_branch_inc(&userspace_irqchip_in_use);
> > -     }
> > -
> >       /*
> >        * Initialize traps for protected VMs.
> >        * NOTE: Move to run in EL2 directly, rather than via a hypercall, once
> > @@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
> >        * state gets updated in kvm_timer_update_run and
> >        * kvm_pmu_update_run below).
> >        */
> > -     if (static_branch_unlikely(&userspace_irqchip_in_use)) {
> > +     if (unlikely(!irqchip_in_kernel(vcpu->kvm))) {
> >               if (kvm_timer_should_notify_user(vcpu) ||
> >                   kvm_pmu_should_notify_user(vcpu)) {
> >                       *ret = -EINTR;
> > @@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >                       vcpu->mode = OUTSIDE_GUEST_MODE;
> >                       isb(); /* Ensure work in x_flush_hwstate is committed */
> >                       kvm_pmu_sync_hwstate(vcpu);
> > -                     if (static_branch_unlikely(&userspace_irqchip_in_use))
> > +                     if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
> >                               kvm_timer_sync_user(vcpu);
> >                       kvm_vgic_sync_hwstate(vcpu);
> >                       local_irq_enable();
> > @@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >                * we don't want vtimer interrupts to race with syncing the
> >                * timer virtual interrupt state.
> >                */
> > -             if (static_branch_unlikely(&userspace_irqchip_in_use))
> > +             if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
> >                       kvm_timer_sync_user(vcpu);
> >
> >               kvm_arch_vcpu_ctxsync_fp(vcpu);
> >
> > I think this would fix the problem you're seeing without changing the
> > userspace view of an erroneous configuration. It would also pave the
> > way for the complete removal of the interrupt notification to
> > userspace, which I claim has no user and is just a shit idea.
>
> Yeah, looks like this ought to get it done.
>
> Even with a fix for this particular issue I do wonder if we should
> categorically harden against late initialization failures and un-init
> the vCPU (or bug VM, where necessary) to avoid dealing with half-baked
> vCPUs/VMs across our UAPI surfaces.
>
> A sane userspace will probably crash when KVM_RUN returns EINVAL anyway.

Thanks for the suggestion. Sure, I'll take another look at the
possible things that we can uninitialize and try to re-spin the patch.

Marc,

If you feel userspace_irqchip_in_use is not necessary anymore, and as
a quick fix to this issue, we can get rid of that independent of the
un-init effort.

Thank you.
Raghavendra

Oliver Upton Oct. 28, 2024, 5:12 p.m. UTC | #5

On Mon, Oct 28, 2024 at 09:43:45AM -0700, Raghavendra Rao Ananta wrote:
> On Sat, Oct 26, 2024 at 7:53 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > On Sat, Oct 26, 2024 at 08:43:21AM +0100, Marc Zyngier wrote:
> > > I think this would fix the problem you're seeing without changing the
> > > userspace view of an erroneous configuration. It would also pave the
> > > way for the complete removal of the interrupt notification to
> > > userspace, which I claim has no user and is just a shit idea.
> >
> > Yeah, looks like this ought to get it done.
> >
> > Even with a fix for this particular issue I do wonder if we should
> > categorically harden against late initialization failures and un-init
> > the vCPU (or bug VM, where necessary) to avoid dealing with half-baked
> > vCPUs/VMs across our UAPI surfaces.
> >
> > A sane userspace will probably crash when KVM_RUN returns EINVAL anyway.
> 
> Thanks for the suggestion. Sure, I'll take another look at the
> possible things that we can uninitialize and try to re-spin the patch.
> 
> Marc,
> 
> If you feel userspace_irqchip_in_use is not necessary anymore, and as
> a quick fix to this issue, we can get rid of that independent of the
> un-init effort.

It's a good cleanup to begin with, even better that it fixes a genuine
bug.

Raghu, could you please test Marc's diff and send it as a patch (w/
correct attribution) if it works? I'm willing to bet that we have more
init/uninit bugs lurking, so we can still follow up w/ robustness
improvements once we're happy w/ the shape of them.

KVM: arm64: Mark the VM as dead for failed initializations

Commit Message

Comments

Patch