KVM: arm64: Don't eagerly teardown the vgic on init error

Message ID	20241009183603.3221824-1-maz@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Marc Zyngier <maz@kernel.org> To: kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org Cc: Joey Gouly <joey.gouly@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Oliver Upton <oliver.upton@linux.dev>, Zenghui Yu <yuzenghui@huawei.com>, stable@vger.kernel.org, Alexander Potapenko <glider@google.com> Subject: [PATCH] KVM: arm64: Don't eagerly teardown the vgic on init error Date: Wed, 9 Oct 2024 19:36:03 +0100 Message-Id: <20241009183603.3221824-1-maz@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	KVM: arm64: Don't eagerly teardown the vgic on init error \| expand KVM: arm64: Don't eagerly teardown the vgic on init error

Marc Zyngier Oct. 9, 2024, 6:36 p.m. UTC

As there is very little ordering in the KVM API, userspace can
instanciate a half-baked GIC (missing its memory map, for example)
at almost any time.

This means that, with the right timing, a thread running vcpu-0
can enter the kernel without a GIC configured and get a GIC created
behind its back by another thread. Amusingly, it will pick up
that GIC and start messing with the data structures without the
GIC having been fully initialised.

Similarly, a thread running vcpu-1 can enter the kernel, and try
to init the GIC that was previously created. Since this GIC isn't
properly configured (no memory map), it fails to correctly initialise.

And that's the point where we decide to teardown the GIC, freeing all
its resources. Behind vcpu-0's back. Things stop pretty abruptly,
with a variety of symptoms.  Clearly, this isn't good, we should be
a bit more careful about this.

It is obvious that this guest is not viable, as it is missing some
important part of its configuration. So instead of trying to tear
bits of it down, let's just mark it as *dead*. It means that any
further interaction from userspace will result in -EIO. The memory
will be released on the "normal" path, when userspace gives up.

Cc: stable@vger.kernel.org
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/arm.c            | 3 +++
 arch/arm64/kvm/vgic/vgic-init.c | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

Oliver Upton Oct. 9, 2024, 7:25 p.m. UTC | #1

On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> As there is very little ordering in the KVM API, userspace can
> instanciate a half-baked GIC (missing its memory map, for example)
> at almost any time.
> 
> This means that, with the right timing, a thread running vcpu-0
> can enter the kernel without a GIC configured and get a GIC created
> behind its back by another thread. Amusingly, it will pick up
> that GIC and start messing with the data structures without the
> GIC having been fully initialised.

Huh, I'm definitely missing something. Could you remind me where we open
up this race between KVM_RUN && kvm_vgic_create()?

I'd thought the fact that the latter takes all the vCPU mutexes and
checks if any vCPU in the VM has run would be enough to guard against
such a race, but clearly not...

> Similarly, a thread running vcpu-1 can enter the kernel, and try
> to init the GIC that was previously created. Since this GIC isn't
> properly configured (no memory map), it fails to correctly initialise.
> 
> And that's the point where we decide to teardown the GIC, freeing all
> its resources. Behind vcpu-0's back. Things stop pretty abruptly,
> with a variety of symptoms.  Clearly, this isn't good, we should be
> a bit more careful about this.
> 
> It is obvious that this guest is not viable, as it is missing some
> important part of its configuration. So instead of trying to tear
> bits of it down, let's just mark it as *dead*. It means that any
> further interaction from userspace will result in -EIO. The memory
> will be released on the "normal" path, when userspace gives up.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Alexander Potapenko <glider@google.com>
> Signed-off-by: Marc Zyngier <maz@kernel.org>

Anyway, regarless of *how* we got here, it is pretty clear that tearing
things down on the error path is a bad idea. So:

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>

> ---
>  arch/arm64/kvm/arm.c            | 3 +++
>  arch/arm64/kvm/vgic/vgic-init.c | 6 +++---
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a0d01c46e4084..b97ada19f06a7 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -997,6 +997,9 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
>  static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  {
>  	if (kvm_request_pending(vcpu)) {
> +		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> +			return -EIO;
> +
>  		if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
>  			kvm_vcpu_sleep(vcpu);
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index e7c53e8af3d16..c4cbf798e71a4 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -536,10 +536,10 @@ int kvm_vgic_map_resources(struct kvm *kvm)
>  out:
>  	mutex_unlock(&kvm->arch.config_lock);
>  out_slots:
> -	mutex_unlock(&kvm->slots_lock);
> -
>  	if (ret)
> -		kvm_vgic_destroy(kvm);
> +		kvm_vm_dead(kvm);
> +
> +	mutex_unlock(&kvm->slots_lock);
>  
>  	return ret;
>  }
> -- 
> 2.39.2
>

Sean Christopherson Oct. 9, 2024, 7:36 p.m. UTC | #2

On Wed, Oct 09, 2024, Oliver Upton wrote:
> On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> > As there is very little ordering in the KVM API, userspace can
> > instanciate a half-baked GIC (missing its memory map, for example)
> > at almost any time.
> > 
> > This means that, with the right timing, a thread running vcpu-0
> > can enter the kernel without a GIC configured and get a GIC created
> > behind its back by another thread. Amusingly, it will pick up
> > that GIC and start messing with the data structures without the
> > GIC having been fully initialised.
> 
> Huh, I'm definitely missing something. Could you remind me where we open
> up this race between KVM_RUN && kvm_vgic_create()?
> 
> I'd thought the fact that the latter takes all the vCPU mutexes and
> checks if any vCPU in the VM has run would be enough to guard against
> such a race, but clearly not...

Any chance that fixing bugs where vCPU0 can be accessed (and run!) before its
fully online help?  E.g. if that closes the vCPU0 hole, maybe the vCPU1 case can
be handled a bit more gracefully?

[*] https://lore.kernel.org/all/20241009150455.1057573-1-seanjc@google.com

> > Similarly, a thread running vcpu-1 can enter the kernel, and try
> > to init the GIC that was previously created. Since this GIC isn't
> > properly configured (no memory map), it fails to correctly initialise.
> > 
> > And that's the point where we decide to teardown the GIC, freeing all
> > its resources. Behind vcpu-0's back. Things stop pretty abruptly,
> > with a variety of symptoms.  Clearly, this isn't good, we should be
> > a bit more careful about this.
> > 
> > It is obvious that this guest is not viable, as it is missing some
> > important part of its configuration. So instead of trying to tear
> > bits of it down, let's just mark it as *dead*. It means that any
> > further interaction from userspace will result in -EIO. The memory
> > will be released on the "normal" path, when userspace gives up.

Oliver Upton Oct. 9, 2024, 11:27 p.m. UTC | #3

On Wed, Oct 09, 2024 at 12:36:32PM -0700, Sean Christopherson wrote:
> On Wed, Oct 09, 2024, Oliver Upton wrote:
> > On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> > > As there is very little ordering in the KVM API, userspace can
> > > instanciate a half-baked GIC (missing its memory map, for example)
> > > at almost any time.
> > > 
> > > This means that, with the right timing, a thread running vcpu-0
> > > can enter the kernel without a GIC configured and get a GIC created
> > > behind its back by another thread. Amusingly, it will pick up
> > > that GIC and start messing with the data structures without the
> > > GIC having been fully initialised.
> > 
> > Huh, I'm definitely missing something. Could you remind me where we open
> > up this race between KVM_RUN && kvm_vgic_create()?

Ah, duh, I see it now. kvm_arch_vcpu_run_pid_change() doesn't serialize
on a VM lock, and kvm_vgic_map_resources() has an early return for
vgic_ready() letting it blow straight past the config_lock.

Then if we can't register the MMIO region for the distributor
everything comes crashing down and a vCPU has made it into the KVM_RUN
loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
another functional bug here where a vCPU's attempts to poke the
distributor wind up reaching userspace as MMIO exits. But we can worry
about that another day.

If memory serves, kvm_vgic_map_resources() used to do all of this behind
the config_lock to cure the race, but that wound up inverting lock
ordering on srcu.

Note to self: Impose strict ordering on GIC initialization v. vCPU
creation if/when we get a new flavor of irqchip.

> > I'd thought the fact that the latter takes all the vCPU mutexes and
> > checks if any vCPU in the VM has run would be enough to guard against
> > such a race, but clearly not...
> 
> Any chance that fixing bugs where vCPU0 can be accessed (and run!) before its
> fully online help?

That's an equally gross bug, but kvm_vgic_create() should still be safe
w.r.t. vCPU creation since both hold the kvm->lock in the right spot.
That is, since kvm_vgic_create() is called under the lock any vCPUs
visible to userspace should exist in the vCPU xarray.

The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
callees are allowed to destroy VM-scoped structures in error handling.

> E.g. if that closes the vCPU0 hole, maybe the vCPU1 case can
> be handled a bit more gracefully?

I think this is about as graceful as we can be. The sorts of screw-ups
that precipitate this error handling may involve stupidity across
several KVM ioctls, meaning it is highly unlikely to be attributable /
recoverable.

Oliver Upton Oct. 9, 2024, 11:30 p.m. UTC | #4

On Wed, Oct 09, 2024 at 11:27:52PM +0000, Oliver Upton wrote:
> On Wed, Oct 09, 2024 at 12:36:32PM -0700, Sean Christopherson wrote:
> > On Wed, Oct 09, 2024, Oliver Upton wrote:
> > > On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> > > > As there is very little ordering in the KVM API, userspace can
> > > > instanciate a half-baked GIC (missing its memory map, for example)
> > > > at almost any time.
> > > > 
> > > > This means that, with the right timing, a thread running vcpu-0
> > > > can enter the kernel without a GIC configured and get a GIC created
> > > > behind its back by another thread. Amusingly, it will pick up
> > > > that GIC and start messing with the data structures without the
> > > > GIC having been fully initialised.
> > > 
> > > Huh, I'm definitely missing something. Could you remind me where we open
> > > up this race between KVM_RUN && kvm_vgic_create()?
> 
> Ah, duh, I see it now. kvm_arch_vcpu_run_pid_change() doesn't serialize
> on a VM lock, and kvm_vgic_map_resources() has an early return for
> vgic_ready() letting it blow straight past the config_lock.
> 
> Then if we can't register the MMIO region for the distributor
> everything comes crashing down and a vCPU has made it into the KVM_RUN
> loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> another functional bug here where a vCPU's attempts to poke the

a theoretical bug, that is. In practice the window to race against
likely isn't big enough to get the in-guest vCPU to the point of poking
the halfway-initialized distributor.

> distributor wind up reaching userspace as MMIO exits. But we can worry
> about that another day.
> 
> If memory serves, kvm_vgic_map_resources() used to do all of this behind
> the config_lock to cure the race, but that wound up inverting lock
> ordering on srcu.
> 
> Note to self: Impose strict ordering on GIC initialization v. vCPU
> creation if/when we get a new flavor of irqchip.
> 
> > > I'd thought the fact that the latter takes all the vCPU mutexes and
> > > checks if any vCPU in the VM has run would be enough to guard against
> > > such a race, but clearly not...
> > 
> > Any chance that fixing bugs where vCPU0 can be accessed (and run!) before its
> > fully online help?
> 
> That's an equally gross bug, but kvm_vgic_create() should still be safe
> w.r.t. vCPU creation since both hold the kvm->lock in the right spot.
> That is, since kvm_vgic_create() is called under the lock any vCPUs
> visible to userspace should exist in the vCPU xarray.
> 
> The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> callees are allowed to destroy VM-scoped structures in error handling.
> 
> > E.g. if that closes the vCPU0 hole, maybe the vCPU1 case can
> > be handled a bit more gracefully?
> 
> I think this is about as graceful as we can be. The sorts of screw-ups
> that precipitate this error handling may involve stupidity across
> several KVM ioctls, meaning it is highly unlikely to be attributable /
> recoverable.
> 
> -- 
> Thanks,
> Oliver

Marc Zyngier Oct. 10, 2024, 7:54 a.m. UTC | #5

On Thu, 10 Oct 2024 00:27:46 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Wed, Oct 09, 2024 at 12:36:32PM -0700, Sean Christopherson wrote:
> > On Wed, Oct 09, 2024, Oliver Upton wrote:
> > > On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> > > > As there is very little ordering in the KVM API, userspace can
> > > > instanciate a half-baked GIC (missing its memory map, for example)
> > > > at almost any time.
> > > >
> > > > This means that, with the right timing, a thread running vcpu-0
> > > > can enter the kernel without a GIC configured and get a GIC created
> > > > behind its back by another thread. Amusingly, it will pick up
> > > > that GIC and start messing with the data structures without the
> > > > GIC having been fully initialised.
> > > 
> > > Huh, I'm definitely missing something. Could you remind me where we open
> > > up this race between KVM_RUN && kvm_vgic_create()?

Sorry, I sent the patch bombs away and decided to get my life back for
the evening... Doesn't help that the commit message isn't very clear
(if not wrong in some respects),.

> Ah, duh, I see it now. kvm_arch_vcpu_run_pid_change() doesn't serialize
> on a VM lock, and kvm_vgic_map_resources() has an early return for
> vgic_ready() letting it blow straight past the config_lock.

That. The problem is not so much with the vgic creation (which doesn't
do much) but with the vgic_init() part followed by the map_resources
horror.

> 
> Then if we can't register the MMIO region for the distributor
> everything comes crashing down and a vCPU has made it into the KVM_RUN
> loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> another functional bug here where a vCPU's attempts to poke the
> distributor wind up reaching userspace as MMIO exits. But we can worry
> about that another day.

I don't think that one is that bad. Userspace got us here, and they
now see an MMIO exit for something that it is not prepared to handle.
Suck it up and die (on a black size M t-shirt, please).

> If memory serves, kvm_vgic_map_resources() used to do all of this behind
> the config_lock to cure the race, but that wound up inverting lock
> ordering on srcu.

Probably something like that. We also used to hold the kvm lock, which
made everything much simpler, but awfully wrong.

> Note to self: Impose strict ordering on GIC initialization v. vCPU
> creation if/when we get a new flavor of irqchip.

One of the things we should have done when introducing GICv3 is to
impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
final. I remember some push-back on the QEMU side of things, as they
like to decouple things, but this has proved to be a nightmare.

>
> > > I'd thought the fact that the latter takes all the vCPU mutexes and
> > > checks if any vCPU in the VM has run would be enough to guard against
> > > such a race, but clearly not...
> > 
> > Any chance that fixing bugs where vCPU0 can be accessed (and run!) before its
> > fully online help?
> 
> That's an equally gross bug, but kvm_vgic_create() should still be safe
> w.r.t. vCPU creation since both hold the kvm->lock in the right spot.
> That is, since kvm_vgic_create() is called under the lock any vCPUs
> visible to userspace should exist in the vCPU xarray.
> 
> The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> callees are allowed to destroy VM-scoped structures in error handling.

I think this is symptomatic of more general issue: we perform VM-wide
configuration in the context of a vcpu. We have tons of this stuff to
paper over the lack of a "this VM is fully configured" barrier.

I wonder whether we could sidestep things by punting the finalisation
of the VM to a different context (workqueue?)  and simply return
-EAGAIN or -EINTR to userspace while we're processing it. That doesn't
solve the "I'm missing parts of the address map and I'm going to die"
part though.

> > E.g. if that closes the vCPU0 hole, maybe the vCPU1 case can
> > be handled a bit more gracefully?
> 
> I think this is about as graceful as we can be. The sorts of screw-ups
> that precipitate this error handling may involve stupidity across
> several KVM ioctls, meaning it is highly unlikely to be attributable /
> recoverable.

That's my take as well. We're faced with luserspace that's out to get
us, and by the time we're in the context of a vcpu, it is too late.

I don't see how to fix this without mandating a UABI change.

Thanks,

	M.

Oliver Upton Oct. 10, 2024, 8:47 a.m. UTC | #6

On Thu, Oct 10, 2024 at 08:54:43AM +0100, Marc Zyngier wrote:
> On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton <oliver.upton@linux.dev> wrote:
> > Then if we can't register the MMIO region for the distributor
> > everything comes crashing down and a vCPU has made it into the KVM_RUN
> > loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> > another functional bug here where a vCPU's attempts to poke the
> > distributor wind up reaching userspace as MMIO exits. But we can worry
> > about that another day.
> 
> I don't think that one is that bad. Userspace got us here, and they
> now see an MMIO exit for something that it is not prepared to handle.
> Suck it up and die (on a black size M t-shirt, please).

LOL, I'll remember that.

The situation I have in mind is a bit harder to blame on userspace,
though. Supposing that the whole VM was set up correctly, multiple vCPUs
entering KVM_RUN concurrently could cause this race and have 'unexpected'
MMIO exits go out to userspace.

	vcpu-0				vcpu-1
	======				======
	kvm_vgic_map_resources()
	  dist->ready = true
	  mutex_unlock(config_lock)
	  				kvm_vgic_map_resources()
					  if (vgic_ready())
					    return 0

					< enter guest >
					typer = writel(0, GICD_CTLR)

					< data abort >
					kvm_io_bus_write(...)	<= No GICD, out to userspace

       vgic_register_dist_iodev()

A small but stupid window to race with.

> > If memory serves, kvm_vgic_map_resources() used to do all of this behind
> > the config_lock to cure the race, but that wound up inverting lock
> > ordering on srcu.
> 
> Probably something like that. We also used to hold the kvm lock, which
> made everything much simpler, but awfully wrong.
> 
> > Note to self: Impose strict ordering on GIC initialization v. vCPU
> > creation if/when we get a new flavor of irqchip.
> 
> One of the things we should have done when introducing GICv3 is to
> impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
> final. I remember some push-back on the QEMU side of things, as they
> like to decouple things, but this has proved to be a nightmare.

Pushing more of the initialization complexity into userspace feels like
the right thing. Since we clearly have no idea what we're doing :)

> > The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> > callees are allowed to destroy VM-scoped structures in error handling.
> 
> I think this is symptomatic of more general issue: we perform VM-wide
> configuration in the context of a vcpu. We have tons of this stuff to
> paper over the lack of a "this VM is fully configured" barrier.
> 
> I wonder whether we could sidestep things by punting the finalisation
> of the VM to a different context (workqueue?)  and simply return
> -EAGAIN or -EINTR to userspace while we're processing it. That doesn't
> solve the "I'm missing parts of the address map and I'm going to die"
> part though.

Throwing it back at userspace would be nice, but unfortunately for ABI I
think we need to block/spin vCPUs in the kernel til the VM is in fully
working condition. A fragile userspace could explode for a 'spurious'
EAGAIN/EINTR where there wasn't one before.

Marc Zyngier Oct. 10, 2024, 12:47 p.m. UTC | #7

On Thu, 10 Oct 2024 09:47:04 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Thu, Oct 10, 2024 at 08:54:43AM +0100, Marc Zyngier wrote:
> > On Thu, 10 Oct 2024 00:27:46 +0100, Oliver Upton <oliver.upton@linux.dev> wrote:
> > > Then if we can't register the MMIO region for the distributor
> > > everything comes crashing down and a vCPU has made it into the KVM_RUN
> > > loop w/ the VGIC-shaped rug pulled out from under it. There's definitely
> > > another functional bug here where a vCPU's attempts to poke the
> > > distributor wind up reaching userspace as MMIO exits. But we can worry
> > > about that another day.
> > 
> > I don't think that one is that bad. Userspace got us here, and they
> > now see an MMIO exit for something that it is not prepared to handle.
> > Suck it up and die (on a black size M t-shirt, please).
> 
> LOL, I'll remember that.
> 
> The situation I have in mind is a bit harder to blame on userspace,
> though. Supposing that the whole VM was set up correctly, multiple vCPUs
> entering KVM_RUN concurrently could cause this race and have 'unexpected'
> MMIO exits go out to userspace.
> 
> 	vcpu-0				vcpu-1
> 	======				======
> 	kvm_vgic_map_resources()
> 	  dist->ready = true
> 	  mutex_unlock(config_lock)
> 	  				kvm_vgic_map_resources()
> 					  if (vgic_ready())
> 					    return 0
> 
> 					< enter guest >
> 					typer = writel(0, GICD_CTLR)
> 
> 					< data abort >
> 					kvm_io_bus_write(...)	<= No GICD, out to userspace
> 
>        vgic_register_dist_iodev()
> 
> A small but stupid window to race with.

Ah, gotcha. I guess getting rid of the early-out in
kvm_vgic_map_resources() would plug that one. Want to post a fix for
that?

> 
> > > If memory serves, kvm_vgic_map_resources() used to do all of this behind
> > > the config_lock to cure the race, but that wound up inverting lock
> > > ordering on srcu.
> > 
> > Probably something like that. We also used to hold the kvm lock, which
> > made everything much simpler, but awfully wrong.
> > 
> > > Note to self: Impose strict ordering on GIC initialization v. vCPU
> > > creation if/when we get a new flavor of irqchip.
> > 
> > One of the things we should have done when introducing GICv3 is to
> > impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
> > final. I remember some push-back on the QEMU side of things, as they
> > like to decouple things, but this has proved to be a nightmare.
> 
> Pushing more of the initialization complexity into userspace feels like
> the right thing. Since we clearly have no idea what we're doing :)

KVM APIv2?

> 
> > > The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> > > callees are allowed to destroy VM-scoped structures in error handling.
> > 
> > I think this is symptomatic of more general issue: we perform VM-wide
> > configuration in the context of a vcpu. We have tons of this stuff to
> > paper over the lack of a "this VM is fully configured" barrier.
> > 
> > I wonder whether we could sidestep things by punting the finalisation
> > of the VM to a different context (workqueue?)  and simply return
> > -EAGAIN or -EINTR to userspace while we're processing it. That doesn't
> > solve the "I'm missing parts of the address map and I'm going to die"
> > part though.
> 
> Throwing it back at userspace would be nice, but unfortunately for ABI I
> think we need to block/spin vCPUs in the kernel til the VM is in fully
> working condition. A fragile userspace could explode for a 'spurious'
> EAGAIN/EINTR where there wasn't one before.

EINTR needs to be handled already, as this is how you report
preemption by a signal. But yeah, overall, I'm not enthralled with
much so far...

	M.

Oliver Upton Oct. 10, 2024, 4:47 p.m. UTC | #8

On Thu, Oct 10, 2024 at 01:47:05PM +0100, Marc Zyngier wrote:
> On Thu, 10 Oct 2024 09:47:04 +0100, Oliver Upton <oliver.upton@linux.dev> wrote:
> > A small but stupid window to race with.
> 
> Ah, gotcha. I guess getting rid of the early-out in
> kvm_vgic_map_resources() would plug that one. Want to post a fix for
> that?

Yep, will do.

> > 
> > > > If memory serves, kvm_vgic_map_resources() used to do all of this behind
> > > > the config_lock to cure the race, but that wound up inverting lock
> > > > ordering on srcu.
> > > 
> > > Probably something like that. We also used to hold the kvm lock, which
> > > made everything much simpler, but awfully wrong.
> > > 
> > > > Note to self: Impose strict ordering on GIC initialization v. vCPU
> > > > creation if/when we get a new flavor of irqchip.
> > > 
> > > One of the things we should have done when introducing GICv3 is to
> > > impose that at KVM_DEV_ARM_VGIC_CTRL_INIT, the GIC memory map is
> > > final. I remember some push-back on the QEMU side of things, as they
> > > like to decouple things, but this has proved to be a nightmare.
> > 
> > Pushing more of the initialization complexity into userspace feels like
> > the right thing. Since we clearly have no idea what we're doing :)
> 
> KVM APIv2?

Even better, we can just go straight to v3 and skip all the mistakes we
would've made in v2.

> > 
> > > > The crappy assumption here is kvm_arch_vcpu_run_pid_change() and its
> > > > callees are allowed to destroy VM-scoped structures in error handling.
> > > 
> > > I think this is symptomatic of more general issue: we perform VM-wide
> > > configuration in the context of a vcpu. We have tons of this stuff to
> > > paper over the lack of a "this VM is fully configured" barrier.
> > > 
> > > I wonder whether we could sidestep things by punting the finalisation
> > > of the VM to a different context (workqueue?)  and simply return
> > > -EAGAIN or -EINTR to userspace while we're processing it. That doesn't
> > > solve the "I'm missing parts of the address map and I'm going to die"
> > > part though.
> > 
> > Throwing it back at userspace would be nice, but unfortunately for ABI I
> > think we need to block/spin vCPUs in the kernel til the VM is in fully
> > working condition. A fragile userspace could explode for a 'spurious'
> > EAGAIN/EINTR where there wasn't one before.
> 
> EINTR needs to be handled already, as this is how you report
> preemption by a signal.

Of course, I'm just assuming userspace is mean and will complain if no
signal actually arrives.

Marc Zyngier Oct. 11, 2024, 1:20 p.m. UTC | #9

On Wed, 09 Oct 2024 19:36:03 +0100, Marc Zyngier wrote:
> As there is very little ordering in the KVM API, userspace can
> instanciate a half-baked GIC (missing its memory map, for example)
> at almost any time.
> 
> This means that, with the right timing, a thread running vcpu-0
> can enter the kernel without a GIC configured and get a GIC created
> behind its back by another thread. Amusingly, it will pick up
> that GIC and start messing with the data structures without the
> GIC having been fully initialised.
> 
> [...]

Applied to fixes, thanks!

[1/1] KVM: arm64: Don't eagerly teardown the vgic on init error
      commit: df5fd75ee305cb5927e0b1a0b46cc988ad8db2b1

Cheers,

	M.

Mark Brown Oct. 24, 2024, 4:12 p.m. UTC | #10

On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> As there is very little ordering in the KVM API, userspace can
> instanciate a half-baked GIC (missing its memory map, for example)
> at almost any time.

The vgic_init selftest has started failing in mainline on multiple
platforms, with a bisect pointing at this patch which is present there
as commit df5fd75ee305cb5.  The test reports:

# selftests: kvm: vgic_init
# Random seed: 0x6b8b4567
# Running GIC_v3 tests.
# ==== Test Assertion Failure ====
#   lib/kvm_util.c:724: false
#   pid=1947 tid=1947 errno=5 - Input/output error
#      1	0x0000000000404edb: __vm_mem_region_delete at kvm_util.c:724 (discriminator 5)
#      2	0x0000000000405d0b: kvm_vm_free at kvm_util.c:762 (discriminator 12)
#      3	0x0000000000402d5f: vm_gic_destroy at vgic_init.c:101
#      4	 (inlined by) test_vcpus_then_vgic at vgic_init.c:368
#      5	 (inlined by) run_tests at vgic_init.c:720
#      6	0x0000000000401a6f: main at vgic_init.c:748
#      7	0x0000ffffa7b37543: ?? ??:0
#      8	0x0000ffffa7b37617: ?? ??:0
#      9	0x0000000000401b6f: _start at ??:?
#   KVM killed/bugged the VM, check the kernel log for clues
not ok 10 selftests: kvm: vgic_init # exit=254

which does rather look like a test bug rather than a problem in the
change itself.

Full log from one run on synquacer at:

   https://validation.linaro.org/scheduler/job/4108424#L1846

bisect log (using a current mailine build of kselftests rather than one
matching the tested kernel to avoid the build issues):

git bisect start
# status: waiting for both good and bad commits
# bad: [c2ee9f594da826bea183ed14f2cc029c719bf4da] KVM: selftests: Fix build on on non-x86 architectures
git bisect bad c2ee9f594da826bea183ed14f2cc029c719bf4da
# status: waiting for good commit(s), bad commit known
# good: [9852d85ec9d492ebef56dc5f229416c925758edc] Linux 6.12-rc1
git bisect good 9852d85ec9d492ebef56dc5f229416c925758edc
# good: [a1029768f3931b31aa52790f1dde0c7d6a6552eb] Merge tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
git bisect good a1029768f3931b31aa52790f1dde0c7d6a6552eb
# good: [b1b46751671be5a426982f037a47ae05f37ff80b] mm: fix follow_pfnmap API lockdep assert
git bisect good b1b46751671be5a426982f037a47ae05f37ff80b
# good: [531643fcd98c8d045d72a05cb0aaf49e5a4bdf5c] Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect good 531643fcd98c8d045d72a05cb0aaf49e5a4bdf5c
# good: [c55228220dd33e7627ad9736b6fce4df5e7eac98] Merge tag 'char-misc-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect good c55228220dd33e7627ad9736b6fce4df5e7eac98
# good: [c1bc09d7bfcbe90c6df3a630ec1fb0fcd4799236] Merge tag 'probes-fixes-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect good c1bc09d7bfcbe90c6df3a630ec1fb0fcd4799236
# bad: [5978d4ec7e82ffc472ac2645601dd10b09e61b0f] KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS
git bisect bad 5978d4ec7e82ffc472ac2645601dd10b09e61b0f
# good: [6ded46b5a4fd7fc9c6104b770627043aaf996abf] KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out
git bisect good 6ded46b5a4fd7fc9c6104b770627043aaf996abf
# good: [d4a89e5aee23eaebdc45f63cb3d6d5917ff6acf4] KVM: arm64: Expose S1PIE to guests
git bisect good d4a89e5aee23eaebdc45f63cb3d6d5917ff6acf4
# bad: [afa9b48f327c9ef36bfba4c643a29385a633252b] KVM: arm64: Shave a few bytes from the EL2 idmap code
git bisect bad afa9b48f327c9ef36bfba4c643a29385a633252b
# bad: [df5fd75ee305cb5927e0b1a0b46cc988ad8db2b1] KVM: arm64: Don't eagerly teardown the vgic on init error
git bisect bad df5fd75ee305cb5927e0b1a0b46cc988ad8db2b1
# first bad commit: [df5fd75ee305cb5927e0b1a0b46cc988ad8db2b1] KVM: arm64: Don't eagerly teardown the vgic on init error

Marc Zyngier Oct. 24, 2024, 6:05 p.m. UTC | #11

On Thu, 24 Oct 2024 17:12:22 +0100,
Mark Brown <broonie@kernel.org> wrote:
> 
> [1  <text/plain; us-ascii (7bit)>]
> On Wed, Oct 09, 2024 at 07:36:03PM +0100, Marc Zyngier wrote:
> > As there is very little ordering in the KVM API, userspace can
> > instanciate a half-baked GIC (missing its memory map, for example)
> > at almost any time.
> 
> The vgic_init selftest has started failing in mainline on multiple
> platforms, with a bisect pointing at this patch which is present there
> as commit df5fd75ee305cb5.  The test reports:
> 
> # selftests: kvm: vgic_init
> # Random seed: 0x6b8b4567
> # Running GIC_v3 tests.
> # ==== Test Assertion Failure ====
> #   lib/kvm_util.c:724: false
> #   pid=1947 tid=1947 errno=5 - Input/output error
> #      1	0x0000000000404edb: __vm_mem_region_delete at kvm_util.c:724 (discriminator 5)
> #      2	0x0000000000405d0b: kvm_vm_free at kvm_util.c:762 (discriminator 12)
> #      3	0x0000000000402d5f: vm_gic_destroy at vgic_init.c:101
> #      4	 (inlined by) test_vcpus_then_vgic at vgic_init.c:368
> #      5	 (inlined by) run_tests at vgic_init.c:720
> #      6	0x0000000000401a6f: main at vgic_init.c:748
> #      7	0x0000ffffa7b37543: ?? ??:0
> #      8	0x0000ffffa7b37617: ?? ??:0
> #      9	0x0000000000401b6f: _start at ??:?
> #   KVM killed/bugged the VM, check the kernel log for clues
> not ok 10 selftests: kvm: vgic_init # exit=254
> 
> which does rather look like a test bug rather than a problem in the
> change itself.

Well, the test tries to do braindead things, and then the test
infrastructure seems surprised that KVM tells it to bugger off...

I can paper over it with this (see below), but frankly, someone who
actually cares about this crap should take a look (and ownership).

	M.

diff --git a/tools/testing/selftests/kvm/aarch64/vgic_init.c b/tools/testing/selftests/kvm/aarch64/vgic_init.c
index b3b5fb0ff0a9..60b076ae85ea 100644
--- a/tools/testing/selftests/kvm/aarch64/vgic_init.c
+++ b/tools/testing/selftests/kvm/aarch64/vgic_init.c
@@ -742,6 +742,8 @@ int main(int ac, char **av)
 	pa_bits = vm_guest_mode_params[VM_MODE_DEFAULT].pa_bits;
 	max_phys_size = 1ULL << pa_bits;
 
+	allow_ioctl_returning_eio = true;
+
 	ret = test_kvm_device(KVM_DEV_TYPE_ARM_VGIC_V3);
 	if (!ret) {
 		pr_info("Running GIC_v3 tests.\n");
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index bc7c242480d6..988c68a77ce3 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -298,6 +298,8 @@ static __always_inline void static_assert_is_vm(struct kvm_vm *vm) { }
 	kvm_do_ioctl((vm)->fd, cmd, arg);			\
 })
 
+extern bool allow_ioctl_returning_eio;
+
 /*
  * Assert that a VM or vCPU ioctl() succeeded, with extra magic to detect if
  * the ioctl() failed because KVM killed/bugged the VM.  To detect a dead VM,
@@ -313,6 +315,8 @@ do {											\
 	static_assert_is_vm(vm);							\
 											\
 	if (cond)									\
+		break;									\
+	if (errno == EIO && allow_ioctl_returning_eio)					\
 		break;									\
 											\
 	if (errno == EIO &&								\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a2b7df5f1d39..d9834421ab12 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -25,6 +25,8 @@ static uint32_t last_guest_seed;
 
 static int vcpu_mmap_sz(void);
 
+bool allow_ioctl_returning_eio = false;
+
 int open_path_or_exit(const char *path, int flags)
 {
 	int fd;

Mark Brown Oct. 25, 2024, 10:54 a.m. UTC | #12

On Thu, Oct 24, 2024 at 07:05:10PM +0100, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:

> > # ==== Test Assertion Failure ====
> > #   lib/kvm_util.c:724: false
> > #   pid=1947 tid=1947 errno=5 - Input/output error
> > #      1	0x0000000000404edb: __vm_mem_region_delete at kvm_util.c:724 (discriminator 5)
> > #      2	0x0000000000405d0b: kvm_vm_free at kvm_util.c:762 (discriminator 12)
> > #      3	0x0000000000402d5f: vm_gic_destroy at vgic_init.c:101
> > #      4	 (inlined by) test_vcpus_then_vgic at vgic_init.c:368
> > #      5	 (inlined by) run_tests at vgic_init.c:720
> > #      6	0x0000000000401a6f: main at vgic_init.c:748
> > #      7	0x0000ffffa7b37543: ?? ??:0
> > #      8	0x0000ffffa7b37617: ?? ??:0
> > #      9	0x0000000000401b6f: _start at ??:?
> > #   KVM killed/bugged the VM, check the kernel log for clues
> > not ok 10 selftests: kvm: vgic_init # exit=254

> > which does rather look like a test bug rather than a problem in the
> > change itself.

> Well, the test tries to do braindead things, and then the test
> infrastructure seems surprised that KVM tells it to bugger off...

> I can paper over it with this (see below), but frankly, someone who
> actually cares about this crap should take a look (and ownership).

I'm not even sure that's a terrible fix, looking at the changelog I get
the impression the test is deliberately looking to do problematic things
with the goal of making sure that the kernel handles them appropriately.
That's not interacting well with the KVM selftest framework's general
assert early assert often approach but it's a reasonable thing to want
to test so relaxing the asserts like this is one way of squaring the
circile.

Eric Auger Oct. 25, 2024, 12:18 p.m. UTC | #13

Hi Mark, Marc,

On 10/25/24 12:54, Mark Brown wrote:
> On Thu, Oct 24, 2024 at 07:05:10PM +0100, Marc Zyngier wrote:
>> Mark Brown <broonie@kernel.org> wrote:
> 
>>> # ==== Test Assertion Failure ====
>>> #   lib/kvm_util.c:724: false
>>> #   pid=1947 tid=1947 errno=5 - Input/output error
>>> #      1	0x0000000000404edb: __vm_mem_region_delete at kvm_util.c:724 (discriminator 5)
>>> #      2	0x0000000000405d0b: kvm_vm_free at kvm_util.c:762 (discriminator 12)
>>> #      3	0x0000000000402d5f: vm_gic_destroy at vgic_init.c:101
>>> #      4	 (inlined by) test_vcpus_then_vgic at vgic_init.c:368
>>> #      5	 (inlined by) run_tests at vgic_init.c:720
>>> #      6	0x0000000000401a6f: main at vgic_init.c:748
>>> #      7	0x0000ffffa7b37543: ?? ??:0
>>> #      8	0x0000ffffa7b37617: ?? ??:0
>>> #      9	0x0000000000401b6f: _start at ??:?
>>> #   KVM killed/bugged the VM, check the kernel log for clues
>>> not ok 10 selftests: kvm: vgic_init # exit=254
> 
>>> which does rather look like a test bug rather than a problem in the
>>> change itself.
> 
>> Well, the test tries to do braindead things, and then the test
>> infrastructure seems surprised that KVM tells it to bugger off...
> 
>> I can paper over it with this (see below), but frankly, someone who
>> actually cares about this crap should take a look (and ownership).
As I am the original contributor of the crap I can definitively have a
look at it and take ownership. Those tests were originally written
because the init sequence was different between kvmtool and qemu and we
had regular regressions when touching the init sequence at some point.
Now this may be not valid anymore ...
> 
> I'm not even sure that's a terrible fix, looking at the changelog I get
> the impression the test is deliberately looking to do problematic things
> with the goal of making sure that the kernel handles them appropriately.
> That's not interacting well with the KVM selftest framework's general
> assert early assert often approach but it's a reasonable thing to want
Can you elaborate on the "assert early assert often approach". What
shall this test rather do according to you?

I am OoO next week but I can have a look afterwards. On which machine is
it failing?

Thanks

Eric


> to test so relaxing the asserts like this is one way of squaring the
> circile.

Mark Brown Oct. 25, 2024, 12:59 p.m. UTC | #14

On Fri, Oct 25, 2024 at 02:18:02PM +0200, Eric Auger wrote:
> On 10/25/24 12:54, Mark Brown wrote:

> > I'm not even sure that's a terrible fix, looking at the changelog I get
> > the impression the test is deliberately looking to do problematic things
> > with the goal of making sure that the kernel handles them appropriately.
> > That's not interacting well with the KVM selftest framework's general
> > assert early assert often approach but it's a reasonable thing to want

> Can you elaborate on the "assert early assert often approach". What
> shall this test rather do according to you?

In general the KVM selftests are filled with asserts which just
immediately cause the test to exit with a backtrace.  That's certainly
an approach that can be taken with testsuites, but it does make things
very fagile.  This means that if the test is deliberately doing
something which is liable to cause errors and put the VM in a bad state
then it seems relatively likely that some part of a partial cleanup will
run into a spurious error caused by the earlier testing putting the VM
in an error state.

> I am OoO next week but I can have a look afterwards. On which machine is
> it failing?

It was failing on a wide range of arm64 machines, I think every one I
test (which would track with the change being very generic).

Eric Auger Oct. 25, 2024, 1:05 p.m. UTC | #15

Hi Mark,

On 10/25/24 14:59, Mark Brown wrote:
> On Fri, Oct 25, 2024 at 02:18:02PM +0200, Eric Auger wrote:
>> On 10/25/24 12:54, Mark Brown wrote:
> 
>>> I'm not even sure that's a terrible fix, looking at the changelog I get
>>> the impression the test is deliberately looking to do problematic things
>>> with the goal of making sure that the kernel handles them appropriately.
>>> That's not interacting well with the KVM selftest framework's general
>>> assert early assert often approach but it's a reasonable thing to want
> 
>> Can you elaborate on the "assert early assert often approach". What
>> shall this test rather do according to you?
> 
> In general the KVM selftests are filled with asserts which just
> immediately cause the test to exit with a backtrace.  That's certainly
> an approach that can be taken with testsuites, but it does make things
> very fagile.  This means that if the test is deliberately doing
> something which is liable to cause errors and put the VM in a bad state
> then it seems relatively likely that some part of a partial cleanup will
> run into a spurious error caused by the earlier testing putting the VM
> in an error state.
OK I better understand now. Thank you for the clarification.
> 
>> I am OoO next week but I can have a look afterwards. On which machine is
>> it failing?
> 
> It was failing on a wide range of arm64 machines, I think every one I
> test (which would track with the change being very generic).

Yes I can reproduce on my end.

Thanks

Eric

Marc Zyngier Oct. 25, 2024, 1:05 p.m. UTC | #16

On Fri, 25 Oct 2024 11:54:38 +0100,
Mark Brown <broonie@kernel.org> wrote:
> 
> [1  <text/plain; us-ascii (7bit)>]
> On Thu, Oct 24, 2024 at 07:05:10PM +0100, Marc Zyngier wrote:
> > Mark Brown <broonie@kernel.org> wrote:
> 
> > > # ==== Test Assertion Failure ====
> > > #   lib/kvm_util.c:724: false
> > > #   pid=1947 tid=1947 errno=5 - Input/output error
> > > #      1	0x0000000000404edb: __vm_mem_region_delete at kvm_util.c:724 (discriminator 5)
> > > #      2	0x0000000000405d0b: kvm_vm_free at kvm_util.c:762 (discriminator 12)
> > > #      3	0x0000000000402d5f: vm_gic_destroy at vgic_init.c:101
> > > #      4	 (inlined by) test_vcpus_then_vgic at vgic_init.c:368
> > > #      5	 (inlined by) run_tests at vgic_init.c:720
> > > #      6	0x0000000000401a6f: main at vgic_init.c:748
> > > #      7	0x0000ffffa7b37543: ?? ??:0
> > > #      8	0x0000ffffa7b37617: ?? ??:0
> > > #      9	0x0000000000401b6f: _start at ??:?
> > > #   KVM killed/bugged the VM, check the kernel log for clues
> > > not ok 10 selftests: kvm: vgic_init # exit=254
> 
> > > which does rather look like a test bug rather than a problem in the
> > > change itself.
> 
> > Well, the test tries to do braindead things, and then the test
> > infrastructure seems surprised that KVM tells it to bugger off...
> 
> > I can paper over it with this (see below), but frankly, someone who
> > actually cares about this crap should take a look (and ownership).
> 
> I'm not even sure that's a terrible fix, looking at the changelog I get
> the impression the test is deliberately looking to do problematic things
> with the goal of making sure that the kernel handles them appropriately.
> That's not interacting well with the KVM selftest framework's general
> assert early assert often approach but it's a reasonable thing to want
> to test so relaxing the asserts like this is one way of squaring the
> circile.

It *is* a terrible fix, since it makes no effort in finding out
whether the VM is be dead for a good or a bad reason. In a way, the
fix is worse than the current error, because it silently hide the
crap.

So this test will continue to explode until someone fixes it
*properly*. And if that means rewriting the selftest harness, so be
it.

	M.

Mark Brown Oct. 25, 2024, 1:43 p.m. UTC | #17

On Fri, Oct 25, 2024 at 02:05:26PM +0100, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:

> > I'm not even sure that's a terrible fix, looking at the changelog I get
> > the impression the test is deliberately looking to do problematic things
> > with the goal of making sure that the kernel handles them appropriately.
> > That's not interacting well with the KVM selftest framework's general
> > assert early assert often approach but it's a reasonable thing to want
> > to test so relaxing the asserts like this is one way of squaring the
> > circile.

> It *is* a terrible fix, since it makes no effort in finding out
> whether the VM is be dead for a good or a bad reason. In a way, the
> fix is worse than the current error, because it silently hide the
> crap.

I mean, it's clearly not ideal or finished especially in that it's just
allowing the error code throughout the program rather than during a
specific case but the broad approach of flagging that the VM is expected
to already be in a bad state and not support operations any more seems
reasonable.

KVM: arm64: Don't eagerly teardown the vgic on init error

Commit Message

Comments

Patch