diff mbox

kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform

Message ID 1406230067-926-1-git-send-email-will.deacon@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Will Deacon July 24, 2014, 7:27 p.m. UTC
If the physical address of GICV isn't page-aligned, then we end up
creating a stage-2 mapping of the page containing it, which causes us to
map neighbouring memory locations directly into the guest.

As an example, consider a platform with GICV at physical 0x2c02f000
running a 64k-page host kernel. If qemu maps this into the guest at
0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
physical regions may cause UNPREDICTABLE behaviour, for example, on the
Juno platform this will cause an SError exception to EL3, which brings
down the entire physical CPU resulting in RCU stalls / HYP panics / host
crashing / wasted weeks of debugging.

SBSA recommends that systems alias the 4k GICV across the bounding 64k
region, in which case GICV physical could be described as 0x2c020000 in
the above scenario.

This patch fixes the problem by failing the vgic probe if the physical
address of GICV isn't page-aligned. Note that this generated a warning
in dmesg about freeing enabled IRQs, so I had to move the IRQ enabling
later in the probe.

Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joel Schopp <joel.schopp@amd.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---

Paulo, Gleb,

This fixes a *really* nasty bug with 64k-page hosts and KVM. I believe
Marc and Christoffer are both on holiday at the moment (not together),
so could you please take this as an urgent fix? Without it, I can trivially
bring down machines using kvm. I've checked that it applies cleanly against
-next, so you shouldn't see any conflicts during the merge window.

Thanks,

Will

 virt/kvm/arm/vgic.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Peter Maydell July 24, 2014, 7:47 p.m. UTC | #1
On 24 July 2014 20:27, Will Deacon <will.deacon@arm.com> wrote:
> If the physical address of GICV isn't page-aligned, then we end up
> creating a stage-2 mapping of the page containing it, which causes us to
> map neighbouring memory locations directly into the guest.
>
> As an example, consider a platform with GICV at physical 0x2c02f000
> running a 64k-page host kernel. If qemu maps this into the guest at
> 0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
> map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
> physical regions may cause UNPREDICTABLE behaviour, for example, on the
> Juno platform this will cause an SError exception to EL3, which brings
> down the entire physical CPU resulting in RCU stalls / HYP panics / host
> crashing / wasted weeks of debugging.

This seems to me like a specific problem with Juno rather than an
issue with having the GICV at a non-page-aligned start. The
requirement to be able to expose host GICV as the guest GICC
in a 64K pages system is just "nothing else in that 64K page
(or pages, if the GICV runs across two pages) is allowed to be
unsafe for the guest to touch", which remains true whether the
GICV starts at 0K in the 64K page or 60K.

> SBSA recommends that systems alias the 4k GICV across the bounding 64k
> region, in which case GICV physical could be described as 0x2c020000 in
> the above scenario.

The SBSA "make every 4K region in the 64K page be the same thing"
recommendation is one way of satisfying the requirement that the
whole 64K page is safe for the guest to touch. (Making the rest of
the page RAZ/WI would be another option I guess.) If your system
actually implements the SBSA recommendation then in fact
describing the GICV-phys-base as the 64K-aligned address is wrong,
because then the register at GICV-base + 4K would not be
the first register in the 2nd page of the GICV, it would be another
copy of the 1st page. This happens to work on Linux guests
currently because they don't touch anything in the 2nd page,
but for cases like device passthrough IIRC we might well like
the guest to use some of the 2nd page registers. So the only
correct choice on those systems is to specify the +60K address
as the GICV physaddr in the device tree, and use Marc's patchset
to allow QEMU/kvmtool to determine the page offset within the 64K
page so it can reflect that in the guest's device tree.

I can't think of any way of determining whether a particular
system gets this right or wrong automatically, which suggests
perhaps we need to allow the device tree to specify that the
GICV is 64k-page-safe...

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon July 24, 2014, 7:55 p.m. UTC | #2
On Thu, Jul 24, 2014 at 08:47:23PM +0100, Peter Maydell wrote:
> On 24 July 2014 20:27, Will Deacon <will.deacon@arm.com> wrote:
> > If the physical address of GICV isn't page-aligned, then we end up
> > creating a stage-2 mapping of the page containing it, which causes us to
> > map neighbouring memory locations directly into the guest.
> >
> > As an example, consider a platform with GICV at physical 0x2c02f000
> > running a 64k-page host kernel. If qemu maps this into the guest at
> > 0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
> > map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
> > physical regions may cause UNPREDICTABLE behaviour, for example, on the
> > Juno platform this will cause an SError exception to EL3, which brings
> > down the entire physical CPU resulting in RCU stalls / HYP panics / host
> > crashing / wasted weeks of debugging.
> 
> This seems to me like a specific problem with Juno rather than an
> issue with having the GICV at a non-page-aligned start. The
> requirement to be able to expose host GICV as the guest GICC
> in a 64K pages system is just "nothing else in that 64K page
> (or pages, if the GICV runs across two pages) is allowed to be
> unsafe for the guest to touch", which remains true whether the
> GICV starts at 0K in the 64K page or 60K.

I agree, and for that we would need a new ioctl so we can query the
page-offset of the GICV on systems where it is safe. Given that such an
ioctl doesn't exist today, I would like to plug the hole in mainline kernels
with this patch, we can relax in the future if systems appear where it would
be safe to map the entire 64k region.

> > SBSA recommends that systems alias the 4k GICV across the bounding 64k
> > region, in which case GICV physical could be described as 0x2c020000 in
> > the above scenario.
> 
> The SBSA "make every 4K region in the 64K page be the same thing"
> recommendation is one way of satisfying the requirement that the
> whole 64K page is safe for the guest to touch. (Making the rest of
> the page RAZ/WI would be another option I guess.) If your system
> actually implements the SBSA recommendation then in fact
> describing the GICV-phys-base as the 64K-aligned address is wrong,
> because then the register at GICV-base + 4K would not be
> the first register in the 2nd page of the GICV, it would be another
> copy of the 1st page. This happens to work on Linux guests
> currently because they don't touch anything in the 2nd page,
> but for cases like device passthrough IIRC we might well like
> the guest to use some of the 2nd page registers. So the only
> correct choice on those systems is to specify the +60K address
> as the GICV physaddr in the device tree, and use Marc's patchset
> to allow QEMU/kvmtool to determine the page offset within the 64K
> page so it can reflect that in the guest's device tree.

Again, that can be solved by introduced Marc's attr for determining the
GICV offset within the 64k page. I don't think that's -stable material.

> I can't think of any way of determining whether a particular
> system gets this right or wrong automatically, which suggests
> perhaps we need to allow the device tree to specify that the
> GICV is 64k-page-safe...

When we support such systems, I also think we'll need a device-tree change.
My main concern right now is stopping the ability to hose the entire machine
by trying to instantiate a virtual GIC.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joel Schopp July 24, 2014, 7:59 p.m. UTC | #3
On 07/24/2014 02:47 PM, Peter Maydell wrote:
> On 24 July 2014 20:27, Will Deacon <will.deacon@arm.com> wrote:
>> If the physical address of GICV isn't page-aligned, then we end up
>> creating a stage-2 mapping of the page containing it, which causes us to
>> map neighbouring memory locations directly into the guest.
>>
>> As an example, consider a platform with GICV at physical 0x2c02f000
>> running a 64k-page host kernel. If qemu maps this into the guest at
>> 0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
>> map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
>> physical regions may cause UNPREDICTABLE behaviour, for example, on the
>> Juno platform this will cause an SError exception to EL3, which brings
>> down the entire physical CPU resulting in RCU stalls / HYP panics / host
>> crashing / wasted weeks of debugging.
> This seems to me like a specific problem with Juno rather than an
> issue with having the GICV at a non-page-aligned start. The
> requirement to be able to expose host GICV as the guest GICC
> in a 64K pages system is just "nothing else in that 64K page
> (or pages, if the GICV runs across two pages) is allowed to be
> unsafe for the guest to touch", which remains true whether the
> GICV starts at 0K in the 64K page or 60K.
>
>> SBSA recommends that systems alias the 4k GICV across the bounding 64k
>> region, in which case GICV physical could be described as 0x2c020000 in
>> the above scenario.
> The SBSA "make every 4K region in the 64K page be the same thing"
> recommendation is one way of satisfying the requirement that the
> whole 64K page is safe for the guest to touch. (Making the rest of
> the page RAZ/WI would be another option I guess.) If your system
> actually implements the SBSA recommendation then in fact
> describing the GICV-phys-base as the 64K-aligned address is wrong,
> because then the register at GICV-base + 4K would not be
> the first register in the 2nd page of the GICV, it would be another
> copy of the 1st page. This happens to work on Linux guests
> currently because they don't touch anything in the 2nd page,
> but for cases like device passthrough IIRC we might well like
> the guest to use some of the 2nd page registers. So the only
> correct choice on those systems is to specify the +60K address
> as the GICV physaddr in the device tree, and use Marc's patchset
> to allow QEMU/kvmtool to determine the page offset within the 64K
> page so it can reflect that in the guest's device tree.
I have one of those systems specifying +60K address as the GICV physaddr
and it works well for me with 64K pages and kvm with both QEMU and kvmtool.

>
> I can't think of any way of determining whether a particular
> system gets this right or wrong automatically, which suggests
> perhaps we need to allow the device tree to specify that the
> GICV is 64k-page-safe...
I don't have a better solution, despite my lack of enthusiasm for yet
another device tree property.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joel Schopp July 24, 2014, 8:01 p.m. UTC | #4
On 07/24/2014 02:55 PM, Will Deacon wrote:
> On Thu, Jul 24, 2014 at 08:47:23PM +0100, Peter Maydell wrote:
>> On 24 July 2014 20:27, Will Deacon <will.deacon@arm.com> wrote:
>>> If the physical address of GICV isn't page-aligned, then we end up
>>> creating a stage-2 mapping of the page containing it, which causes us to
>>> map neighbouring memory locations directly into the guest.
>>>
>>> As an example, consider a platform with GICV at physical 0x2c02f000
>>> running a 64k-page host kernel. If qemu maps this into the guest at
>>> 0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
>>> map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
>>> physical regions may cause UNPREDICTABLE behaviour, for example, on the
>>> Juno platform this will cause an SError exception to EL3, which brings
>>> down the entire physical CPU resulting in RCU stalls / HYP panics / host
>>> crashing / wasted weeks of debugging.
>> This seems to me like a specific problem with Juno rather than an
>> issue with having the GICV at a non-page-aligned start. The
>> requirement to be able to expose host GICV as the guest GICC
>> in a 64K pages system is just "nothing else in that 64K page
>> (or pages, if the GICV runs across two pages) is allowed to be
>> unsafe for the guest to touch", which remains true whether the
>> GICV starts at 0K in the 64K page or 60K.
> I agree, and for that we would need a new ioctl so we can query the
> page-offset of the GICV on systems where it is safe. Given that such an
> ioctl doesn't exist today, I would like to plug the hole in mainline kernels
> with this patch, we can relax in the future if systems appear where it would
> be safe to map the entire 64k region.
I have such a system. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Maydell July 24, 2014, 8:05 p.m. UTC | #5
On 24 July 2014 20:55, Will Deacon <will.deacon@arm.com> wrote:
> Again, that can be solved by introduced Marc's attr for determining the
> GICV offset within the 64k page. I don't think that's -stable material.

Agreed that we don't want to put Marc's patchset in -stable
(and that without it systems with GICV in their host devicetree
at pagebase+60K are unusable, so we're not actually regressing
anything if we put this into stable). But...

>> I can't think of any way of determining whether a particular
>> system gets this right or wrong automatically, which suggests
>> perhaps we need to allow the device tree to specify that the
>> GICV is 64k-page-safe...
>
> When we support such systems, I also think we'll need a device-tree change.
> My main concern right now is stopping the ability to hose the entire machine
> by trying to instantiate a virtual GIC.

...I don't see how your patch prevents instantiating a VGIC
and hosing the machine on a system where the 64K
with the GICV registers in it goes
 [GICV registers] [machine blows up if you read this]
 0K                      8K                                             64K

Whether the 64K page contains Bad Stuff is completely
orthogonal to whether the device tree offset the host has
for the GICV is 0K, 60K or anything in between. What you
should be checking for is "is this system design broken?",
which is probably a device tree attribute.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 56ff9bebb577..fa9a95b3ed19 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1526,17 +1526,25 @@  int kvm_vgic_hyp_init(void)
 		goto out_unmap;
 	}
 
-	kvm_info("%s@%llx IRQ%d\n", vgic_node->name,
-		 vctrl_res.start, vgic_maint_irq);
-	on_each_cpu(vgic_init_maintenance_interrupt, NULL, 1);
-
 	if (of_address_to_resource(vgic_node, 3, &vcpu_res)) {
 		kvm_err("Cannot obtain VCPU resource\n");
 		ret = -ENXIO;
 		goto out_unmap;
 	}
+
+	if (!PAGE_ALIGNED(vcpu_res.start)) {
+		kvm_err("GICV physical address 0x%llx not page aligned\n",
+			(unsigned long long)vcpu_res.start);
+		ret = -ENXIO;
+		goto out_unmap;
+	}
+
 	vgic_vcpu_base = vcpu_res.start;
 
+	kvm_info("%s@%llx IRQ%d\n", vgic_node->name,
+		 vctrl_res.start, vgic_maint_irq);
+	on_each_cpu(vgic_init_maintenance_interrupt, NULL, 1);
+
 	goto out;
 
 out_unmap: