Message ID | 20241025221220.2985227-1-rananta@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Mark the VM as dead for failed initializations | expand |
Hi Raghu, Thanks for posting this fix. On Fri, Oct 25, 2024 at 10:12:20PM +0000, Raghavendra Rao Ananta wrote: > Syzbot hit the following WARN_ON() in kvm_timer_update_irq(): > > WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459 > kvm_timer_update_irq+0x21c/0x394 > Call trace: > kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459 > kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968 > kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264 > kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline] > kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline] > kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695 > kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:907 [inline] > __se_sys_ioctl fs/ioctl.c:893 [inline] > __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893 > __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] > invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49 > el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132 > do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151 > el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712 > el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 > el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 > > The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is > invoked after a failed first KVM_RUN. In a general sense though, since > kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past > initiatializations, it's possible that the VM's state could be left typo: initializations > half-baked. Any upcoming ioctls could behave erroneously because of > this. You may want to highlight a bit more strongly that, despite the name, we do a lot of late *VM* state initialization in kvm_arch_vcpu_run_pid_change(). When that goes sideways we're left with few choices besides bugging the VM or gracefully tearing down state, potentially w/ concurrent users. > Since these late vCPU initializations is past the point of attributing > the failures to any ioctl, instead of tearing down each of the previous > setups, simply mark the VM as dead, gving an opportunity for the > userspace to close and try again. > > Cc: <stable@vger.kernel.org> > Reported-by: syzbot <syzkaller@googlegroups.com> > Suggested-by: Oliver Upton <oliver.upton@linux.dev> I definitely recommended this to you, so blame *me* for imposing some toil on you with the following. > @@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) > > ret = kvm_timer_enable(vcpu); > if (ret) > - return ret; > + goto out_err; > > ret = kvm_arm_pmu_v3_enable(vcpu); > if (ret) > - return ret; > + goto out_err; > > if (is_protected_kvm_enabled()) { > ret = pkvm_create_hyp_vm(kvm); > if (ret) > - return ret; > + goto out_err; > } > > if (!irqchip_in_kernel(kvm)) { > @@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) > mutex_unlock(&kvm->arch.config_lock); > > return ret; > + > +out_err: > + kvm_vm_dead(kvm); > + return ret; > } After rereading, I think we could benefit from a more distinct separation of late VM vs. vCPU state initialization. Bugging the VM is a big hammer, we should probably only resort to that when the VM state is screwed up badly. Otherwise, for screwed up vCPU state we could uninitialize the vCPU and let userspace try again. An example of this is how we deal with VMs that run 32 bit userspace when KVM tries to hide the feature. WDYT?
Hi both, On Sat, 26 Oct 2024 06:34:23 +0100, Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Raghu, > > Thanks for posting this fix. > > On Fri, Oct 25, 2024 at 10:12:20PM +0000, Raghavendra Rao Ananta wrote: > > Syzbot hit the following WARN_ON() in kvm_timer_update_irq(): > > > > WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459 > > kvm_timer_update_irq+0x21c/0x394 > > Call trace: > > kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459 > > kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968 > > kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264 > > kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline] > > kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline] > > kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695 > > kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658 > > vfs_ioctl fs/ioctl.c:51 [inline] > > __do_sys_ioctl fs/ioctl.c:907 [inline] > > __se_sys_ioctl fs/ioctl.c:893 [inline] > > __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893 > > __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] > > invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49 > > el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132 > > do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151 > > el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712 > > el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 > > el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 > > > > The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is > > invoked after a failed first KVM_RUN. In a general sense though, since > > kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past > > initiatializations, it's possible that the VM's state could be left > > typo: initializations > > > half-baked. Any upcoming ioctls could behave erroneously because of > > this. > > You may want to highlight a bit more strongly that, despite the name, > we do a lot of late *VM* state initialization in kvm_arch_vcpu_run_pid_change(). > > When that goes sideways we're left with few choices besides bugging the > VM or gracefully tearing down state, potentially w/ concurrent users. > > > Since these late vCPU initializations is past the point of attributing > > the failures to any ioctl, instead of tearing down each of the previous > > setups, simply mark the VM as dead, gving an opportunity for the > > userspace to close and try again. > > > > Cc: <stable@vger.kernel.org> > > Reported-by: syzbot <syzkaller@googlegroups.com> > > Suggested-by: Oliver Upton <oliver.upton@linux.dev> > > I definitely recommended this to you, so blame *me* for imposing some > toil on you with the following. > > > @@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) > > > > ret = kvm_timer_enable(vcpu); > > if (ret) > > - return ret; > > + goto out_err; > > > > ret = kvm_arm_pmu_v3_enable(vcpu); > > if (ret) > > - return ret; > > + goto out_err; > > > > if (is_protected_kvm_enabled()) { > > ret = pkvm_create_hyp_vm(kvm); > > if (ret) > > - return ret; > > + goto out_err; > > } > > > > if (!irqchip_in_kernel(kvm)) { > > @@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) > > mutex_unlock(&kvm->arch.config_lock); > > > > return ret; > > + > > +out_err: > > + kvm_vm_dead(kvm); > > + return ret; > > } > > After rereading, I think we could benefit from a more distinct separation > of late VM vs. vCPU state initialization. > > Bugging the VM is a big hammer, we should probably only resort to that > when the VM state is screwed up badly. > > Otherwise, for screwed up vCPU state we could uninitialize the vCPU and > let userspace try again. An example of this is how we deal with VMs that > run 32 bit userspace when KVM tries to hide the feature. I tend to agree. We shouldn't make a misconfiguration fatal unless we know for sure that the state is not recoverable (such as a screwed up memory map). When it comes to this particular issue, I wonder why the (maybe overly simplistic) hack below isn't enough. The userspace_irqchip_in_use static key brings more problems than it is worth it, and the feature is mostly nonsense anyway. diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index bf64fed9820e..c315bc1a4e9a 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void); static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; }; #endif -DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use); - extern unsigned int __ro_after_init kvm_sve_max_vl; extern unsigned int __ro_after_init kvm_host_sve_max_vl; int __init kvm_arm_init_sve(void); diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 879982b1cc73..1215df590418 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) static inline bool userspace_irqchip(struct kvm *kvm) { - return static_branch_unlikely(&userspace_irqchip_in_use) && - unlikely(!irqchip_in_kernel(kvm)); + return unlikely(!irqchip_in_kernel(kvm)); } static void soft_timer_start(struct hrtimer *hrt, u64 ns) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 48cafb65d6ac..70ff9a20ef3a 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt); static bool vgic_present, kvm_arm_initialised; static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized); -DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use); bool is_kvm_arm_initialised(void) { @@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) { - if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm))) - static_branch_dec(&userspace_irqchip_in_use); - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache); kvm_timer_vcpu_terminate(vcpu); kvm_pmu_vcpu_destroy(vcpu); @@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) return ret; } - if (!irqchip_in_kernel(kvm)) { - /* - * Tell the rest of the code that there are userspace irqchip - * VMs in the wild. - */ - static_branch_inc(&userspace_irqchip_in_use); - } - /* * Initialize traps for protected VMs. * NOTE: Move to run in EL2 directly, rather than via a hypercall, once @@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret) * state gets updated in kvm_timer_update_run and * kvm_pmu_update_run below). */ - if (static_branch_unlikely(&userspace_irqchip_in_use)) { + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) { if (kvm_timer_should_notify_user(vcpu) || kvm_pmu_should_notify_user(vcpu)) { *ret = -EINTR; @@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) vcpu->mode = OUTSIDE_GUEST_MODE; isb(); /* Ensure work in x_flush_hwstate is committed */ kvm_pmu_sync_hwstate(vcpu); - if (static_branch_unlikely(&userspace_irqchip_in_use)) + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) kvm_timer_sync_user(vcpu); kvm_vgic_sync_hwstate(vcpu); local_irq_enable(); @@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) * we don't want vtimer interrupts to race with syncing the * timer virtual interrupt state. */ - if (static_branch_unlikely(&userspace_irqchip_in_use)) + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) kvm_timer_sync_user(vcpu); kvm_arch_vcpu_ctxsync_fp(vcpu); I think this would fix the problem you're seeing without changing the userspace view of an erroneous configuration. It would also pave the way for the complete removal of the interrupt notification to userspace, which I claim has no user and is just a shit idea. Thanks, M.
On Sat, Oct 26, 2024 at 08:43:21AM +0100, Marc Zyngier wrote: > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index bf64fed9820e..c315bc1a4e9a 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void); > static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; }; > #endif > > -DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use); > - > extern unsigned int __ro_after_init kvm_sve_max_vl; > extern unsigned int __ro_after_init kvm_host_sve_max_vl; > int __init kvm_arm_init_sve(void); > diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c > index 879982b1cc73..1215df590418 100644 > --- a/arch/arm64/kvm/arch_timer.c > +++ b/arch/arm64/kvm/arch_timer.c > @@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) > > static inline bool userspace_irqchip(struct kvm *kvm) > { > - return static_branch_unlikely(&userspace_irqchip_in_use) && > - unlikely(!irqchip_in_kernel(kvm)); > + return unlikely(!irqchip_in_kernel(kvm)); > } > > static void soft_timer_start(struct hrtimer *hrt, u64 ns) > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index 48cafb65d6ac..70ff9a20ef3a 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt); > static bool vgic_present, kvm_arm_initialised; > > static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized); > -DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use); > > bool is_kvm_arm_initialised(void) > { > @@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) > > void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) > { > - if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm))) > - static_branch_dec(&userspace_irqchip_in_use); > - > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache); > kvm_timer_vcpu_terminate(vcpu); > kvm_pmu_vcpu_destroy(vcpu); > @@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) > return ret; > } > > - if (!irqchip_in_kernel(kvm)) { > - /* > - * Tell the rest of the code that there are userspace irqchip > - * VMs in the wild. > - */ > - static_branch_inc(&userspace_irqchip_in_use); > - } > - > /* > * Initialize traps for protected VMs. > * NOTE: Move to run in EL2 directly, rather than via a hypercall, once > @@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret) > * state gets updated in kvm_timer_update_run and > * kvm_pmu_update_run below). > */ > - if (static_branch_unlikely(&userspace_irqchip_in_use)) { > + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) { > if (kvm_timer_should_notify_user(vcpu) || > kvm_pmu_should_notify_user(vcpu)) { > *ret = -EINTR; > @@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > vcpu->mode = OUTSIDE_GUEST_MODE; > isb(); /* Ensure work in x_flush_hwstate is committed */ > kvm_pmu_sync_hwstate(vcpu); > - if (static_branch_unlikely(&userspace_irqchip_in_use)) > + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) > kvm_timer_sync_user(vcpu); > kvm_vgic_sync_hwstate(vcpu); > local_irq_enable(); > @@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > * we don't want vtimer interrupts to race with syncing the > * timer virtual interrupt state. > */ > - if (static_branch_unlikely(&userspace_irqchip_in_use)) > + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) > kvm_timer_sync_user(vcpu); > > kvm_arch_vcpu_ctxsync_fp(vcpu); > > I think this would fix the problem you're seeing without changing the > userspace view of an erroneous configuration. It would also pave the > way for the complete removal of the interrupt notification to > userspace, which I claim has no user and is just a shit idea. Yeah, looks like this ought to get it done. Even with a fix for this particular issue I do wonder if we should categorically harden against late initialization failures and un-init the vCPU (or bug VM, where necessary) to avoid dealing with half-baked vCPUs/VMs across our UAPI surfaces. A sane userspace will probably crash when KVM_RUN returns EINVAL anyway.
Hi Oliver, On Sat, Oct 26, 2024 at 7:53 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Sat, Oct 26, 2024 at 08:43:21AM +0100, Marc Zyngier wrote: > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > index bf64fed9820e..c315bc1a4e9a 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void); > > static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; }; > > #endif > > > > -DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use); > > - > > extern unsigned int __ro_after_init kvm_sve_max_vl; > > extern unsigned int __ro_after_init kvm_host_sve_max_vl; > > int __init kvm_arm_init_sve(void); > > diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c > > index 879982b1cc73..1215df590418 100644 > > --- a/arch/arm64/kvm/arch_timer.c > > +++ b/arch/arm64/kvm/arch_timer.c > > @@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) > > > > static inline bool userspace_irqchip(struct kvm *kvm) > > { > > - return static_branch_unlikely(&userspace_irqchip_in_use) && > > - unlikely(!irqchip_in_kernel(kvm)); > > + return unlikely(!irqchip_in_kernel(kvm)); > > } > > > > static void soft_timer_start(struct hrtimer *hrt, u64 ns) > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index 48cafb65d6ac..70ff9a20ef3a 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt); > > static bool vgic_present, kvm_arm_initialised; > > > > static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized); > > -DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use); > > > > bool is_kvm_arm_initialised(void) > > { > > @@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) > > > > void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) > > { > > - if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm))) > > - static_branch_dec(&userspace_irqchip_in_use); > > - > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache); > > kvm_timer_vcpu_terminate(vcpu); > > kvm_pmu_vcpu_destroy(vcpu); > > @@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) > > return ret; > > } > > > > - if (!irqchip_in_kernel(kvm)) { > > - /* > > - * Tell the rest of the code that there are userspace irqchip > > - * VMs in the wild. > > - */ > > - static_branch_inc(&userspace_irqchip_in_use); > > - } > > - > > /* > > * Initialize traps for protected VMs. > > * NOTE: Move to run in EL2 directly, rather than via a hypercall, once > > @@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret) > > * state gets updated in kvm_timer_update_run and > > * kvm_pmu_update_run below). > > */ > > - if (static_branch_unlikely(&userspace_irqchip_in_use)) { > > + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) { > > if (kvm_timer_should_notify_user(vcpu) || > > kvm_pmu_should_notify_user(vcpu)) { > > *ret = -EINTR; > > @@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > > vcpu->mode = OUTSIDE_GUEST_MODE; > > isb(); /* Ensure work in x_flush_hwstate is committed */ > > kvm_pmu_sync_hwstate(vcpu); > > - if (static_branch_unlikely(&userspace_irqchip_in_use)) > > + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) > > kvm_timer_sync_user(vcpu); > > kvm_vgic_sync_hwstate(vcpu); > > local_irq_enable(); > > @@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > > * we don't want vtimer interrupts to race with syncing the > > * timer virtual interrupt state. > > */ > > - if (static_branch_unlikely(&userspace_irqchip_in_use)) > > + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) > > kvm_timer_sync_user(vcpu); > > > > kvm_arch_vcpu_ctxsync_fp(vcpu); > > > > I think this would fix the problem you're seeing without changing the > > userspace view of an erroneous configuration. It would also pave the > > way for the complete removal of the interrupt notification to > > userspace, which I claim has no user and is just a shit idea. > > Yeah, looks like this ought to get it done. > > Even with a fix for this particular issue I do wonder if we should > categorically harden against late initialization failures and un-init > the vCPU (or bug VM, where necessary) to avoid dealing with half-baked > vCPUs/VMs across our UAPI surfaces. > > A sane userspace will probably crash when KVM_RUN returns EINVAL anyway. Thanks for the suggestion. Sure, I'll take another look at the possible things that we can uninitialize and try to re-spin the patch. Marc, If you feel userspace_irqchip_in_use is not necessary anymore, and as a quick fix to this issue, we can get rid of that independent of the un-init effort. Thank you. Raghavendra
On Mon, Oct 28, 2024 at 09:43:45AM -0700, Raghavendra Rao Ananta wrote: > On Sat, Oct 26, 2024 at 7:53 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Sat, Oct 26, 2024 at 08:43:21AM +0100, Marc Zyngier wrote: > > > I think this would fix the problem you're seeing without changing the > > > userspace view of an erroneous configuration. It would also pave the > > > way for the complete removal of the interrupt notification to > > > userspace, which I claim has no user and is just a shit idea. > > > > Yeah, looks like this ought to get it done. > > > > Even with a fix for this particular issue I do wonder if we should > > categorically harden against late initialization failures and un-init > > the vCPU (or bug VM, where necessary) to avoid dealing with half-baked > > vCPUs/VMs across our UAPI surfaces. > > > > A sane userspace will probably crash when KVM_RUN returns EINVAL anyway. > > Thanks for the suggestion. Sure, I'll take another look at the > possible things that we can uninitialize and try to re-spin the patch. > > Marc, > > If you feel userspace_irqchip_in_use is not necessary anymore, and as > a quick fix to this issue, we can get rid of that independent of the > un-init effort. It's a good cleanup to begin with, even better that it fixes a genuine bug. Raghu, could you please test Marc's diff and send it as a patch (w/ correct attribution) if it works? I'm willing to bet that we have more init/uninit bugs lurking, so we can still follow up w/ robustness improvements once we're happy w/ the shape of them.
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index a0d01c46e4084..ae3551bc98aeb 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -821,12 +821,12 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) */ ret = kvm_vgic_map_resources(kvm); if (ret) - return ret; + goto out_err; } ret = kvm_finalize_sys_regs(vcpu); if (ret) - return ret; + goto out_err; /* * This needs to happen after any restriction has been applied @@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) ret = kvm_timer_enable(vcpu); if (ret) - return ret; + goto out_err; ret = kvm_arm_pmu_v3_enable(vcpu); if (ret) - return ret; + goto out_err; if (is_protected_kvm_enabled()) { ret = pkvm_create_hyp_vm(kvm); if (ret) - return ret; + goto out_err; } if (!irqchip_in_kernel(kvm)) { @@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) mutex_unlock(&kvm->arch.config_lock); return ret; + +out_err: + kvm_vm_dead(kvm); + return ret; } bool kvm_arch_intc_initialized(struct kvm *kvm)
Syzbot hit the following WARN_ON() in kvm_timer_update_irq(): WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459 kvm_timer_update_irq+0x21c/0x394 Call trace: kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459 kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968 kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264 kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline] kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline] kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695 kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl fs/ioctl.c:893 [inline] __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132 do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151 el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is invoked after a failed first KVM_RUN. In a general sense though, since kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past initiatializations, it's possible that the VM's state could be left half-baked. Any upcoming ioctls could behave erroneously because of this. Since these late vCPU initializations is past the point of attributing the failures to any ioctl, instead of tearing down each of the previous setups, simply mark the VM as dead, gving an opportunity for the userspace to close and try again. Cc: <stable@vger.kernel.org> Reported-by: syzbot <syzkaller@googlegroups.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> --- arch/arm64/kvm/arm.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)