diff mbox series

[RFC] KVM: arm64: allow ID_MMFR4_EL1 to be writable

Message ID E1s2DPv-00AhaA-Ra@rmk-PC.armlinux.org.uk (mailing list archive)
State New, archived
Headers show
Series [RFC] KVM: arm64: allow ID_MMFR4_EL1 to be writable | expand

Commit Message

Russell King (Oracle) May 1, 2024, 5:06 p.m. UTC
Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
changed their value on the same Neoverse N1 r3p1 hardware which makes
migrating between these kernels on the host problematical.

We already permit changing HPDS in AA64MMFR1_EL1 and CnP in
AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1,
so this patch includes support for that even though it isn't required.

Discussing with Marc Zygnier, AC2 should also be fine to be writable,
even though we don't inject an UNDEF if the guest accesses those
registers.

The only questionable one is XNX, execute-never control distinction,
which is also in AA64MMFR1_EL1 but we don't allow to be changed there.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/kvm/sys_regs.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Oliver Upton May 1, 2024, 5:57 p.m. UTC | #1
Hi Russell,

On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> changed their value on the same Neoverse N1 r3p1 hardware which makes
> migrating between these kernels on the host problematical.

It'd be helpful to expand a bit more on how these fields changed, better
yet if we can blame it back to a commit. I'm guessing the only direction
of migration you care about is old -> new then?

> We already permit changing HPDS in AA64MMFR1_EL1 and CnP in
> AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1,
> so this patch includes support for that even though it isn't required.
> 
> Discussing with Marc Zygnier, AC2 should also be fine to be writable,

typo: Zyngier

> even though we don't inject an UNDEF if the guest accesses those
> registers.
> 
> The only questionable one is XNX, execute-never control distinction,
> which is also in AA64MMFR1_EL1 but we don't allow to be changed there.

It is quite odd for us to expose this field to non-nested VMs in the
first place, though I suppose we will apply an additional set of
restrictions for nested VMs when they come along.
Russell King (Oracle) May 1, 2024, 6:08 p.m. UTC | #2
On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote:
> Hi Russell,
> 
> On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> > changed their value on the same Neoverse N1 r3p1 hardware which makes
> > migrating between these kernels on the host problematical.
> 
> It'd be helpful to expand a bit more on how these fields changed, better
> yet if we can blame it back to a commit. I'm guessing the only direction
> of migration you care about is old -> new then?

Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110
with our 5.15 kernel. I've been looking at tracking down which commit
is responsible but I've come up with nothing that fits.

The only change I can see is the FTR definition for MMFR4, but this
always included 4:7 (AC2) which changed 0 -> 1. So... no idea what
commit caused the change.

There are a load of other registers that we need sorting, but this
is just a test forray into attempting to solve this.

> 
> > We already permit changing HPDS in AA64MMFR1_EL1 and CnP in
> > AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1,
> > so this patch includes support for that even though it isn't required.
> > 
> > Discussing with Marc Zygnier, AC2 should also be fine to be writable,
> 
> typo: Zyngier
> 
> > even though we don't inject an UNDEF if the guest accesses those
> > registers.
> > 
> > The only questionable one is XNX, execute-never control distinction,
> > which is also in AA64MMFR1_EL1 but we don't allow to be changed there.
> 
> It is quite odd for us to expose this field to non-nested VMs in the
> first place, though I suppose we will apply an additional set of
> restrictions for nested VMs when they come along.

Yes, it did strike me as odd, since the description seems to imply that
XNX affects EL2, which the VM wouldn't have access to. So I'm not sure
why we don't just force it to zero.
Oliver Upton May 1, 2024, 6:59 p.m. UTC | #3
On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote:
> > Hi Russell,
> > 
> > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> > > changed their value on the same Neoverse N1 r3p1 hardware which makes
> > > migrating between these kernels on the host problematical.
> > 
> > It'd be helpful to expand a bit more on how these fields changed, better
> > yet if we can blame it back to a commit. I'm guessing the only direction
> > of migration you care about is old -> new then?
> 
> Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110
> with our 5.15 kernel. I've been looking at tracking down which commit
> is responsible but I've come up with nothing that fits.
> 
> The only change I can see is the FTR definition for MMFR4, but this
> always included 4:7 (AC2) which changed 0 -> 1. So... no idea what
> commit caused the change.
> 
> There are a load of other registers that we need sorting, but this
> is just a test forray into attempting to solve this.

Got it, let me see if I can find it then. Do share that list of
problematic registers when you have it, hopefully this isn't the tip of
the iceberg...

> > 
> > > We already permit changing HPDS in AA64MMFR1_EL1 and CnP in
> > > AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1,
> > > so this patch includes support for that even though it isn't required.
> > > 
> > > Discussing with Marc Zygnier, AC2 should also be fine to be writable,
> > 
> > typo: Zyngier
> > 
> > > even though we don't inject an UNDEF if the guest accesses those
> > > registers.
> > > 
> > > The only questionable one is XNX, execute-never control distinction,
> > > which is also in AA64MMFR1_EL1 but we don't allow to be changed there.
> > 
> > It is quite odd for us to expose this field to non-nested VMs in the
> > first place, though I suppose we will apply an additional set of
> > restrictions for nested VMs when they come along.
> 
> Yes, it did strike me as odd, since the description seems to imply that
> XNX affects EL2, which the VM wouldn't have access to. So I'm not sure
> why we don't just force it to zero.

Probably because we failed to catch it in the first place and setting to
0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate
issues with the patch, especially since it is fixing a genuine UAPI
breakage in KVM.
Russell King (Oracle) May 1, 2024, 7:51 p.m. UTC | #4
On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
> On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote:
> > > Hi Russell,
> > > 
> > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> > > > changed their value on the same Neoverse N1 r3p1 hardware which makes
> > > > migrating between these kernels on the host problematical.
> > > 
> > > It'd be helpful to expand a bit more on how these fields changed, better
> > > yet if we can blame it back to a commit. I'm guessing the only direction
> > > of migration you care about is old -> new then?
> > 
> > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110
> > with our 5.15 kernel. I've been looking at tracking down which commit
> > is responsible but I've come up with nothing that fits.
> > 
> > The only change I can see is the FTR definition for MMFR4, but this
> > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what
> > commit caused the change.
> > 
> > There are a load of other registers that we need sorting, but this
> > is just a test forray into attempting to solve this.
> 
> Got it, let me see if I can find it then. Do share that list of
> problematic registers when you have it, hopefully this isn't the tip of
> the iceberg...

There unfortunately is an iceberg, but hopefully it isn't big enough to
sink a ship!

Besides ID_MMFR4_EL1, here are the other differences we've identified.
Note that these are Oracle's UEK kernels, so based on stable kernel
branches.

Register		Field		5.4.x	5.15.x
ID_PFR0_EL1		CSV2		0	1
ID_ISAR6_EL1		DP		0	1
ID_PFR2_EL1		SSBS		0	1
			CSV3		0	1
ID_AA64DFR0_EL1		PMSVer		1	0
			DebugVer	8	6
ID_AA64MMFR1_EL1	XNX		0	1
ID_AA64MMFR2_EL1	EVT		0	1
KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2
					0x12	0

I think some of these differences are due to Marc's removal of the
WA2 code in commit 29e8910a566a ("KVM: arm64: Simplify handling of
ARCH_WORKAROUND_2"). The WA2 register for example has changed from
avail/enabled to not_avail.
Russell King (Oracle) May 2, 2024, 10:50 a.m. UTC | #5
On Wed, May 01, 2024 at 08:51:15PM +0100, Russell King (Oracle) wrote:
> On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
> > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote:
> > > > Hi Russell,
> > > > 
> > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes
> > > > > migrating between these kernels on the host problematical.
> > > > 
> > > > It'd be helpful to expand a bit more on how these fields changed, better
> > > > yet if we can blame it back to a commit. I'm guessing the only direction
> > > > of migration you care about is old -> new then?
> > > 
> > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110
> > > with our 5.15 kernel. I've been looking at tracking down which commit
> > > is responsible but I've come up with nothing that fits.
> > > 
> > > The only change I can see is the FTR definition for MMFR4, but this
> > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what
> > > commit caused the change.
> > > 
> > > There are a load of other registers that we need sorting, but this
> > > is just a test forray into attempting to solve this.
> > 
> > Got it, let me see if I can find it then. Do share that list of
> > problematic registers when you have it, hopefully this isn't the tip of
> > the iceberg...
> 
> There unfortunately is an iceberg, but hopefully it isn't big enough to
> sink a ship!
> 
> Besides ID_MMFR4_EL1, here are the other differences we've identified.
> Note that these are Oracle's UEK kernels, so based on stable kernel
> branches.
> 
> Register		Field		5.4.x	5.15.x
> ID_PFR0_EL1		CSV2		0	1
> ID_ISAR6_EL1		DP		0	1
> ID_PFR2_EL1		SSBS		0	1
> 			CSV3		0	1
> ID_AA64DFR0_EL1		PMSVer		1	0
> 			DebugVer	8	6
> ID_AA64MMFR1_EL1	XNX		0	1
> ID_AA64MMFR2_EL1	EVT		0	1
> KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2
> 					0x12	0

I'm finding sys_regs.c very unintuitive for working out what we allow
to be written, because it's all coded in negative-logic. By that I mean
the mask values are all ~(what-we-don't-allow) rather than a positive
this-is-what-we-allow. So I've ended up creating a table, looking up
the registers and working out what's read-only and what's read-write.

From that, I can see (for example) that from the ISAR6_EL1 register,
the field names appear in the AA64ISAR0_EL1 and AA64ISAR1_EL1
registers, and all non-res0 fields are writable. It is therefore my
intention to submit a patch doing this:

-       AA32_ID_SANITISED(ID_ISAR6_EL1),
+       AA32_ID_WRITABLE(ID_ISAR6_EL1, ID_ISAR6_EL1_I8MM |
+                                      ID_ISAR6_EL1_BF16 |
+                                      ID_ISAR6_EL1_SPECRES |
+                                      ID_ISAR6_EL1_SB |
+                                      ID_ISAR6_EL1_FHM |
+                                      ID_ISAR6_EL1_DP |
+                                      ID_ISAR6_EL1_JSCVT),

which, like the MMFR4 patch, uses positive logic for what we allow
to be changed, even though this is equivalent to ~ID_ISAR6_EL1_RES0
which tells us nothing without either looking up in the spec, or
looking at the generated sysreg-defs.h to figure it out.
Russell King (Oracle) May 2, 2024, 2:40 p.m. UTC | #6
On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
> On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> > Yes, it did strike me as odd, since the description seems to imply that
> > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure
> > why we don't just force it to zero.
> 
> Probably because we failed to catch it in the first place and setting to
> 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate
> issues with the patch, especially since it is fixing a genuine UAPI
> breakage in KVM.

I think the only two ways around this would be to:

1) teach QEMU about the contents of these registers, with which fields
   in these registers can be ignored when reloading a VMs context.

2) allow userspace to write to the XNX field such that it can be set
   to values seen with previous kernels (thus allowing at least one-
   way migration.)

(1) has the advantage that reloading a VM state on older vs newer
kernels can work in either direction, whereas (2) would only work
for state saved on an older kernel loaded onto a newer kernel.

I've been bitten before with KVM differences between kernel versions
in the past - where the number of registers that userspace sees has
changed despite being on the same hardware. I'm now wondering what
testing goes on to validate this part of the UAPI across different
kernel versions on the same hardware.

Surely, given the impact of these changing value (a failure of VMs),
we should be aware whenever any of these registers are different from
one kernel to another on the same hardware?

It's fine if all one cares about is starting fresh VMs on each host
kernel reboot, but any use case beyond that seems to be fragile.

I did knock up a test program that dumped the list of registers so
that one could trivially diff the output between various kernels.
Maybe I need to extend that to dump the register values themselves,
and then maybe we need to find a way to get some kind of automated
testing setup to highlight differences. (something maybe kernelci
could add?)
Marc Zyngier May 2, 2024, 3:23 p.m. UTC | #7
On Thu, 02 May 2024 11:50:10 +0100,
"Russell King (Oracle)" <linux@armlinux.org.uk> wrote:
> 
> On Wed, May 01, 2024 at 08:51:15PM +0100, Russell King (Oracle) wrote:
> > On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
> > > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> > > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote:
> > > > > Hi Russell,
> > > > > 
> > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> > > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> > > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes
> > > > > > migrating between these kernels on the host problematical.
> > > > > 
> > > > > It'd be helpful to expand a bit more on how these fields changed, better
> > > > > yet if we can blame it back to a commit. I'm guessing the only direction
> > > > > of migration you care about is old -> new then?
> > > > 
> > > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110
> > > > with our 5.15 kernel. I've been looking at tracking down which commit
> > > > is responsible but I've come up with nothing that fits.
> > > > 
> > > > The only change I can see is the FTR definition for MMFR4, but this
> > > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what
> > > > commit caused the change.
> > > > 
> > > > There are a load of other registers that we need sorting, but this
> > > > is just a test forray into attempting to solve this.
> > > 
> > > Got it, let me see if I can find it then. Do share that list of
> > > problematic registers when you have it, hopefully this isn't the tip of
> > > the iceberg...
> > 
> > There unfortunately is an iceberg, but hopefully it isn't big enough to
> > sink a ship!
> > 
> > Besides ID_MMFR4_EL1, here are the other differences we've identified.
> > Note that these are Oracle's UEK kernels, so based on stable kernel
> > branches.
> > 
> > Register		Field		5.4.x	5.15.x
> > ID_PFR0_EL1		CSV2		0	1
> > ID_ISAR6_EL1		DP		0	1
> > ID_PFR2_EL1		SSBS		0	1
> > 			CSV3		0	1
> > ID_AA64DFR0_EL1		PMSVer		1	0
> > 			DebugVer	8	6
> > ID_AA64MMFR1_EL1	XNX		0	1
> > ID_AA64MMFR2_EL1	EVT		0	1
> > KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2
> > 					0x12	0
> 
> I'm finding sys_regs.c very unintuitive for working out what we allow
> to be written, because it's all coded in negative-logic. By that I mean
> the mask values are all ~(what-we-don't-allow) rather than a positive
> this-is-what-we-allow. So I've ended up creating a table, looking up
> the registers and working out what's read-only and what's read-write.

[...]

Using positive or negative logic doesn't really have any impact on the
result. It often is a matter of concisely expressing what is allowed
or not.

There is also the fact that a lot of the KVM code now checks for
"feature downgrade" and enforces it. Construct such as:

	if (!kvm_has_feat(kvm, ID_AA64ISAR0_EL1, TLB, OS))
		kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_TLBIRVAALE1OS|
						HFGITR_EL2_TLBIRVALE1OS	|
						HFGITR_EL2_TLBIRVAAE1OS	|
						HFGITR_EL2_TLBIRVAE1OS	|
						HFGITR_EL2_TLBIVAALE1OS	|
						HFGITR_EL2_TLBIVALE1OS	|
						HFGITR_EL2_TLBIVAAE1OS	|
						HFGITR_EL2_TLBIASIDE1OS	|
						HFGITR_EL2_TLBIVAE1OS	|
						HFGITR_EL2_TLBIVMALLE1OS);

use negative logic by expressing what we don't want to enable.

In the end, consistency matters.

	M.
Oliver Upton May 2, 2024, 4:45 p.m. UTC | #8
On Thu, May 02, 2024 at 03:40:38PM +0100, Russell King (Oracle) wrote:
> On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
> > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> > > Yes, it did strike me as odd, since the description seems to imply that
> > > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure
> > > why we don't just force it to zero.
> > 
> > Probably because we failed to catch it in the first place and setting to
> > 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate
> > issues with the patch, especially since it is fixing a genuine UAPI
> > breakage in KVM.
> 
> I think the only two ways around this would be to:
> 
> 1) teach QEMU about the contents of these registers, with which fields
>    in these registers can be ignored when reloading a VMs context.
> 
> 2) allow userspace to write to the XNX field such that it can be set
>    to values seen with previous kernels (thus allowing at least one-
>    way migration.)
> 
> (1) has the advantage that reloading a VM state on older vs newer
> kernels can work in either direction, whereas (2) would only work
> for state saved on an older kernel loaded onto a newer kernel.

Yeah, so this is something that has affected my employer as well.

I think (1) should only be expected of VMMs that want rollback safety,
i.e. the ability to migrate state back to an older kernel. Our userspace
initializes vCPUs from a fixed set of feature ID register values that
prevents VMs on new kernels from picking up new CPU features.

It is quite tedious, but necessary as rollback safety is very much a
non-goal of the KVM UAPI.

OTOH, in cases where KVM screws up and breaks UAPI, the kernel needs to
do something special to accept the previously-advertised state even if
it were nonsensical.

For example, there was a bug where KVM advertised an IMP DEF PMU to VMs
even though the only thing KVM virtualizes is PMUv3. We fixed it in
commit f90f9360c3d7 ("KVM: arm64: Rewrite IMPDEF PMU version as NI") by
accepting the old value in the ioctl and changing the field to NI
internally.

I dislike these sort of hacks, but when we're caught between upholding
UAPI and the architecture it seems to be the best option. I wonder if an
approach similar to this would be sufficient to address the SPE change
that you noticed.

> I've been bitten before with KVM differences between kernel versions
> in the past - where the number of registers that userspace sees has
> changed despite being on the same hardware.

This is intended behavior, as VMs are initialized to the maximum feature
set KVM is able to support. Forward-compatibility for the set of exposed
registers is tested, see the get-reg-list selftest.

> I'm now wondering what
> testing goes on to validate this part of the UAPI across different
> kernel versions on the same hardware.

We may've been a bit more relaxed in the past with this, but in recent
history we've been careful about preserving UAPI. On top of that, we now
have some generalized infrastructure for dealing with these things by
way of the 'writable' / mutable ID register work.

Although it isn't precisely what you're looking for, the set_id_regs
selftest ensures we at least accept features that are valid for the
underlying HW platform and explicitly tests downgrades.

> I did knock up a test program that dumped the list of registers so
> that one could trivially diff the output between various kernels.
> Maybe I need to extend that to dump the register values themselves,
> and then maybe we need to find a way to get some kind of automated
> testing setup to highlight differences. (something maybe kernelci
> could add?)

Given the absolutely massive test matrix of implementations and kernel
versions I question our ability to support this. Additionally the only
thing we'd care about upstream is the unsafe removal of a feature.

Nevertheless, a starting point could be to drop some pr_info() into
set_id_regs.c to print the starting value of the ID registers which
could be diffed.
Russell King (Oracle) May 7, 2024, 9:27 a.m. UTC | #9
On Thu, May 02, 2024 at 04:23:10PM +0100, Marc Zyngier wrote:
> On Thu, 02 May 2024 11:50:10 +0100,
> "Russell King (Oracle)" <linux@armlinux.org.uk> wrote:
> > 
> > On Wed, May 01, 2024 at 08:51:15PM +0100, Russell King (Oracle) wrote:
> > > On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
> > > > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
> > > > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote:
> > > > > > Hi Russell,
> > > > > > 
> > > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote:
> > > > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2
> > > > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes
> > > > > > > migrating between these kernels on the host problematical.
> > > > > > 
> > > > > > It'd be helpful to expand a bit more on how these fields changed, better
> > > > > > yet if we can blame it back to a commit. I'm guessing the only direction
> > > > > > of migration you care about is old -> new then?
> > > > > 
> > > > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110
> > > > > with our 5.15 kernel. I've been looking at tracking down which commit
> > > > > is responsible but I've come up with nothing that fits.
> > > > > 
> > > > > The only change I can see is the FTR definition for MMFR4, but this
> > > > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what
> > > > > commit caused the change.
> > > > > 
> > > > > There are a load of other registers that we need sorting, but this
> > > > > is just a test forray into attempting to solve this.
> > > > 
> > > > Got it, let me see if I can find it then. Do share that list of
> > > > problematic registers when you have it, hopefully this isn't the tip of
> > > > the iceberg...
> > > 
> > > There unfortunately is an iceberg, but hopefully it isn't big enough to
> > > sink a ship!
> > > 
> > > Besides ID_MMFR4_EL1, here are the other differences we've identified.
> > > Note that these are Oracle's UEK kernels, so based on stable kernel
> > > branches.
> > > 
> > > Register		Field		5.4.x	5.15.x
> > > ID_PFR0_EL1		CSV2		0	1
> > > ID_ISAR6_EL1		DP		0	1
> > > ID_PFR2_EL1		SSBS		0	1
> > > 			CSV3		0	1
> > > ID_AA64DFR0_EL1		PMSVer		1	0
> > > 			DebugVer	8	6
> > > ID_AA64MMFR1_EL1	XNX		0	1
> > > ID_AA64MMFR2_EL1	EVT		0	1
> > > KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2
> > > 					0x12	0
> > 
> > I'm finding sys_regs.c very unintuitive for working out what we allow
> > to be written, because it's all coded in negative-logic. By that I mean
> > the mask values are all ~(what-we-don't-allow) rather than a positive
> > this-is-what-we-allow. So I've ended up creating a table, looking up
> > the registers and working out what's read-only and what's read-write.
> 
> [...]
> 
> Using positive or negative logic doesn't really have any impact on the
> result. It often is a matter of concisely expressing what is allowed
> or not.
> 
> There is also the fact that a lot of the KVM code now checks for
> "feature downgrade" and enforces it. Construct such as:
> 
> 	if (!kvm_has_feat(kvm, ID_AA64ISAR0_EL1, TLB, OS))
> 		kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_TLBIRVAALE1OS|
> 						HFGITR_EL2_TLBIRVALE1OS	|
> 						HFGITR_EL2_TLBIRVAAE1OS	|
> 						HFGITR_EL2_TLBIRVAE1OS	|
> 						HFGITR_EL2_TLBIVAALE1OS	|
> 						HFGITR_EL2_TLBIVALE1OS	|
> 						HFGITR_EL2_TLBIVAAE1OS	|
> 						HFGITR_EL2_TLBIASIDE1OS	|
> 						HFGITR_EL2_TLBIVAE1OS	|
> 						HFGITR_EL2_TLBIVMALLE1OS);
> 
> use negative logic by expressing what we don't want to enable.
> 
> In the end, consistency matters.

Is that a request to change my patch?

I'm unclear whether anyone wants changes to it.
Cornelia Huck May 8, 2024, 12:06 p.m. UTC | #10
On Thu, May 02 2024, Oliver Upton <oliver.upton@linux.dev> wrote:

> On Thu, May 02, 2024 at 03:40:38PM +0100, Russell King (Oracle) wrote:
>> On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote:
>> > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote:
>> > > Yes, it did strike me as odd, since the description seems to imply that
>> > > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure
>> > > why we don't just force it to zero.
>> > 
>> > Probably because we failed to catch it in the first place and setting to
>> > 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate
>> > issues with the patch, especially since it is fixing a genuine UAPI
>> > breakage in KVM.
>> 
>> I think the only two ways around this would be to:
>> 
>> 1) teach QEMU about the contents of these registers, with which fields
>>    in these registers can be ignored when reloading a VMs context.
>> 
>> 2) allow userspace to write to the XNX field such that it can be set
>>    to values seen with previous kernels (thus allowing at least one-
>>    way migration.)
>> 
>> (1) has the advantage that reloading a VM state on older vs newer
>> kernels can work in either direction, whereas (2) would only work
>> for state saved on an older kernel loaded onto a newer kernel.
>
> Yeah, so this is something that has affected my employer as well.
>
> I think (1) should only be expected of VMMs that want rollback safety,
> i.e. the ability to migrate state back to an older kernel. Our userspace
> initializes vCPUs from a fixed set of feature ID register values that
> prevents VMs on new kernels from picking up new CPU features.
>
> It is quite tedious, but necessary as rollback safety is very much a
> non-goal of the KVM UAPI.

Depending on your use case, rollback safety might be quite
important... have we ever stated exactly which guarantees the KVM UAPI
is giving? IOW, can someone implementing a VMM look at a doc and see
"oh, if I want to be able to go backwards, I need to be able to deal
with x, y, and z coming up on the new kernel"?

>
> OTOH, in cases where KVM screws up and breaks UAPI, the kernel needs to
> do something special to accept the previously-advertised state even if
> it were nonsensical.
>
> For example, there was a bug where KVM advertised an IMP DEF PMU to VMs
> even though the only thing KVM virtualizes is PMUv3. We fixed it in
> commit f90f9360c3d7 ("KVM: arm64: Rewrite IMPDEF PMU version as NI") by
> accepting the old value in the ioctl and changing the field to NI
> internally.
>
> I dislike these sort of hacks, but when we're caught between upholding
> UAPI and the architecture it seems to be the best option. I wonder if an
> approach similar to this would be sufficient to address the SPE change
> that you noticed.
>
>> I've been bitten before with KVM differences between kernel versions
>> in the past - where the number of registers that userspace sees has
>> changed despite being on the same hardware.
>
> This is intended behavior, as VMs are initialized to the maximum feature
> set KVM is able to support. Forward-compatibility for the set of exposed
> registers is tested, see the get-reg-list selftest.

I've seen this problem come up as well; if it is clear that this is
something that KVM expects the VMM to handle if needed, that is fine;
should we consider "it's tested in a selftest" as a canonical indicator
of "this is what KVM supports compatibility wise"?
Oliver Upton May 8, 2024, 5:14 p.m. UTC | #11
Hi Cornelia,

On Wed, May 08, 2024 at 02:06:36PM +0200, Cornelia Huck wrote:
> On Thu, May 02 2024, Oliver Upton <oliver.upton@linux.dev> wrote:
> > I think (1) should only be expected of VMMs that want rollback safety,
> > i.e. the ability to migrate state back to an older kernel. Our userspace
> > initializes vCPUs from a fixed set of feature ID register values that
> > prevents VMs on new kernels from picking up new CPU features.
> >
> > It is quite tedious, but necessary as rollback safety is very much a
> > non-goal of the KVM UAPI.
> 
> Depending on your use case, rollback safety might be quite
> important... have we ever stated exactly which guarantees the KVM UAPI
> is giving? IOW, can someone implementing a VMM look at a doc and see
> "oh, if I want to be able to go backwards, I need to be able to deal
> with x, y, and z coming up on the new kernel"?

The behavior of KVM/arm64 has always been that new VMs get the maximum
set of vCPU features supported by KVM / hardware modulo the ones we
require explicit opt-in from userspace (e.g. SVE, vPMU). This behavior
is described in the arm64 vCPU feature documentation [1].

The biggest benefit of this approach is that new vCPU features can be
added without a VMM change, as userspace can just treat the registers in
KVM_GET_REG_LIST as an opaque blob of data that needs to be migrated.

I'm willing to wager that the set of users who want to migrate state
from kernel N -> (N - 1) know the exact CPU definition they want to
expose to the guest, and in that case should be using a static set of
feature register values matching their intent.

Beyond the CPU architecture, KVM presents hypercall features to the VM
which userspace can _opt-out_ of on a per-feature basis using the
feature bitmap registers described in [2]. Like the feature ID
registers, we've preallocated a range of indices to be used for
hypercall bitmaps. So if an unexpected bitmap register appears on a new
kernel, userspace should write 0 to it to prevent new features from
silently creeping in.

Prescribing the exact combination of these UAPIs to achieve a
rollback-safe feature set is beyond the scope of the KVM documentation
and should be determined based on the minimum kernel version that needs
to work.

[1]: https://docs.kernel.org/virt/kvm/arm/vcpu-features.html#the-id-registers
[2]: https://docs.kernel.org/virt/kvm/arm/hypercalls.html

> >> I've been bitten before with KVM differences between kernel versions
> >> in the past - where the number of registers that userspace sees has
> >> changed despite being on the same hardware.
> >
> > This is intended behavior, as VMs are initialized to the maximum feature
> > set KVM is able to support. Forward-compatibility for the set of exposed
> > registers is tested, see the get-reg-list selftest.
> 
> I've seen this problem come up as well; if it is clear that this is
> something that KVM expects the VMM to handle if needed, that is fine;
> should we consider "it's tested in a selftest" as a canonical indicator
> of "this is what KVM supports compatibility wise"?

It certainly is the level of compatibility that gets actively tested :)

The canonical reason for this behavior, though, is that KVM/arm64
defaults to the maximum-possible feature set as discussed above.
Cornelia Huck May 10, 2024, 3:11 p.m. UTC | #12
On Wed, May 08 2024, Oliver Upton <oliver.upton@linux.dev> wrote:

> Hi Cornelia,
>
> On Wed, May 08, 2024 at 02:06:36PM +0200, Cornelia Huck wrote:
>> On Thu, May 02 2024, Oliver Upton <oliver.upton@linux.dev> wrote:
>> > I think (1) should only be expected of VMMs that want rollback safety,
>> > i.e. the ability to migrate state back to an older kernel. Our userspace
>> > initializes vCPUs from a fixed set of feature ID register values that
>> > prevents VMs on new kernels from picking up new CPU features.
>> >
>> > It is quite tedious, but necessary as rollback safety is very much a
>> > non-goal of the KVM UAPI.
>> 
>> Depending on your use case, rollback safety might be quite
>> important... have we ever stated exactly which guarantees the KVM UAPI
>> is giving? IOW, can someone implementing a VMM look at a doc and see
>> "oh, if I want to be able to go backwards, I need to be able to deal
>> with x, y, and z coming up on the new kernel"?
>
> The behavior of KVM/arm64 has always been that new VMs get the maximum
> set of vCPU features supported by KVM / hardware modulo the ones we
> require explicit opt-in from userspace (e.g. SVE, vPMU). This behavior
> is described in the arm64 vCPU feature documentation [1].
>
> The biggest benefit of this approach is that new vCPU features can be
> added without a VMM change, as userspace can just treat the registers in
> KVM_GET_REG_LIST as an opaque blob of data that needs to be migrated.

It also needs to actively turn off everything it does not know how to
handle, if migration between different hardware is supposed to be
supported.

>
> I'm willing to wager that the set of users who want to migrate state
> from kernel N -> (N - 1) know the exact CPU definition they want to
> expose to the guest, and in that case should be using a static set of
> feature register values matching their intent.

I think the trouble starts when we introduce additional ranges of
registers that can be controled via that interface -- old userspace will
be able to figure out that there are more ranges available than what it
is aware of, but will have no idea how to handle those additional ranges
to get into a defined state (what is the actual range, for example?)

>
> Beyond the CPU architecture, KVM presents hypercall features to the VM
> which userspace can _opt-out_ of on a per-feature basis using the
> feature bitmap registers described in [2]. Like the feature ID
> registers, we've preallocated a range of indices to be used for
> hypercall bitmaps. So if an unexpected bitmap register appears on a new
> kernel, userspace should write 0 to it to prevent new features from
> silently creeping in.

Hm, the doc says: "The features for the registers that are untouched,
probably because userspace isn’t aware of them, will be exposed as is to
the guest." Seems to indicate that it is not too hard to get this wrong :)

>
> Prescribing the exact combination of these UAPIs to achieve a
> rollback-safe feature set is beyond the scope of the KVM documentation
> and should be determined based on the minimum kernel version that needs
> to work.
>
> [1]: https://docs.kernel.org/virt/kvm/arm/vcpu-features.html#the-id-registers
> [2]: https://docs.kernel.org/virt/kvm/arm/hypercalls.html
>
>> >> I've been bitten before with KVM differences between kernel versions
>> >> in the past - where the number of registers that userspace sees has
>> >> changed despite being on the same hardware.
>> >
>> > This is intended behavior, as VMs are initialized to the maximum feature
>> > set KVM is able to support. Forward-compatibility for the set of exposed
>> > registers is tested, see the get-reg-list selftest.
>> 
>> I've seen this problem come up as well; if it is clear that this is
>> something that KVM expects the VMM to handle if needed, that is fine;
>> should we consider "it's tested in a selftest" as a canonical indicator
>> of "this is what KVM supports compatibility wise"?
>
> It certainly is the level of compatibility that gets actively tested :)

:)

>
> The canonical reason for this behavior, though, is that KVM/arm64
> defaults to the maximum-possible feature set as discussed above.

/me contemplating "reverse" features, but too tired to think this
through on a Friday afternoon.
Oliver Upton May 13, 2024, 9:26 p.m. UTC | #13
On Fri, May 10, 2024 at 05:11:09PM +0200, Cornelia Huck wrote:
> On Wed, May 08 2024, Oliver Upton <oliver.upton@linux.dev> wrote:
> > I'm willing to wager that the set of users who want to migrate state
> > from kernel N -> (N - 1) know the exact CPU definition they want to
> > expose to the guest, and in that case should be using a static set of
> > feature register values matching their intent.
> 
> I think the trouble starts when we introduce additional ranges of
> registers that can be controled via that interface -- old userspace will
> be able to figure out that there are more ranges available than what it
> is aware of, but will have no idea how to handle those additional ranges
> to get into a defined state (what is the actual range, for example?)

Even though the UAPI was designed around supporting multiple ranges, I
do not see this being an issue for quite some time. We already fully
describe the Feature ID register range from the architecture.

> >
> > Beyond the CPU architecture, KVM presents hypercall features to the VM
> > which userspace can _opt-out_ of on a per-feature basis using the
> > feature bitmap registers described in [2]. Like the feature ID
> > registers, we've preallocated a range of indices to be used for
> > hypercall bitmaps. So if an unexpected bitmap register appears on a new
> > kernel, userspace should write 0 to it to prevent new features from
> > silently creeping in.
> 
> Hm, the doc says: "The features for the registers that are untouched,
> probably because userspace isn’t aware of them, will be exposed as is to
> the guest." Seems to indicate that it is not too hard to get this wrong :)

Sure, but keep in mind the full range of possible Feature ID registers is
already known to userspace. Many of these register encodings are reserved,
RAZ to allow for future expansion of the architecture [*]. New registers
will come from one of these RAZ encodings.

If userspace wants to assert complete control over the CPU feature set
exposed to the VM it should use a pre-baked value for every register in
the range Op0=3, Op1=0, CRn=0, CRm={1-8}, Op2={0-8}.

[*] DDI0487J.a Table D18-2

> >
> > The canonical reason for this behavior, though, is that KVM/arm64
> > defaults to the maximum-possible feature set as discussed above.
> 
> /me contemplating "reverse" features, but too tired to think this
> through on a Friday afternoon.
> 

Reverse as in a negative / removed feature?

These tend to appear as negative ID register fields, to imply that the
feature set is less than what's provided for with a value of 0x0. KVM
probably couldn't start deprecating features in the default ID register
values but userspace could probe for features it can opt-out of using
the writable mask + trying to write -1 to a field.

DDI0487J.a D19.1.3 covers this.
Cornelia Huck May 22, 2024, 10:14 a.m. UTC | #14
On Mon, May 13 2024, Oliver Upton <oliver.upton@linux.dev> wrote:

[apologies for the late reply]

> On Fri, May 10, 2024 at 05:11:09PM +0200, Cornelia Huck wrote:
>> On Wed, May 08 2024, Oliver Upton <oliver.upton@linux.dev> wrote:
>> > I'm willing to wager that the set of users who want to migrate state
>> > from kernel N -> (N - 1) know the exact CPU definition they want to
>> > expose to the guest, and in that case should be using a static set of
>> > feature register values matching their intent.
>> 
>> I think the trouble starts when we introduce additional ranges of
>> registers that can be controled via that interface -- old userspace will
>> be able to figure out that there are more ranges available than what it
>> is aware of, but will have no idea how to handle those additional ranges
>> to get into a defined state (what is the actual range, for example?)
>
> Even though the UAPI was designed around supporting multiple ranges, I
> do not see this being an issue for quite some time. We already fully
> describe the Feature ID register range from the architecture.

Yeah, we're probably fine for the time being, but something to keep in
mind, I guess.

>
>> >
>> > Beyond the CPU architecture, KVM presents hypercall features to the VM
>> > which userspace can _opt-out_ of on a per-feature basis using the
>> > feature bitmap registers described in [2]. Like the feature ID
>> > registers, we've preallocated a range of indices to be used for
>> > hypercall bitmaps. So if an unexpected bitmap register appears on a new
>> > kernel, userspace should write 0 to it to prevent new features from
>> > silently creeping in.
>> 
>> Hm, the doc says: "The features for the registers that are untouched,
>> probably because userspace isn’t aware of them, will be exposed as is to
>> the guest." Seems to indicate that it is not too hard to get this wrong :)
>
> Sure, but keep in mind the full range of possible Feature ID registers is
> already known to userspace. Many of these register encodings are reserved,
> RAZ to allow for future expansion of the architecture [*]. New registers
> will come from one of these RAZ encodings.
>
> If userspace wants to assert complete control over the CPU feature set
> exposed to the VM it should use a pre-baked value for every register in
> the range Op0=3, Op1=0, CRn=0, CRm={1-8}, Op2={0-8}.
>
> [*] DDI0487J.a Table D18-2

Might make sense to spell that out?

From 1d952c816d3192a4c2f3d95339aa96d86d3d0406 Mon Sep 17 00:00:00 2001
From: Cornelia Huck <cohuck@redhat.com>
Date: Wed, 22 May 2024 12:01:24 +0200
Subject: [PATCH] KVM: arm64: provide hint about vCPU feature stability

Give userspace some advice on how to achieve a stable feature set.

Signed-off-by: Cornelia Huck <cohuck@redhat.com>
---
 Documentation/virt/kvm/arm/vcpu-features.rst | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Documentation/virt/kvm/arm/vcpu-features.rst b/Documentation/virt/kvm/arm/vcpu-features.rst
index f7cc6d8d8b74..01973bf8815d 100644
--- a/Documentation/virt/kvm/arm/vcpu-features.rst
+++ b/Documentation/virt/kvm/arm/vcpu-features.rst
@@ -40,6 +40,10 @@ outlined by the architecture in DDI0487J.a D19.1.3 'Principles of the ID
 scheme for fields in ID register'. KVM does not allow ID register values that
 exceed the capabilities of the system.
 
+As KVM will enable new features by default, userspace that wants to provide a
+stable feature set regardless of the kernel version is advised to supply a
+complete set of values for all of the ID registers.
+
 .. warning::
    It is **strongly recommended** that userspace modify the ID register values
    before accessing the rest of the vCPU's CPU register state. KVM may use the

>
>> >
>> > The canonical reason for this behavior, though, is that KVM/arm64
>> > defaults to the maximum-possible feature set as discussed above.
>> 
>> /me contemplating "reverse" features, but too tired to think this
>> through on a Friday afternoon.
>> 
>
> Reverse as in a negative / removed feature?
>
> These tend to appear as negative ID register fields, to imply that the
> feature set is less than what's provided for with a value of 0x0. KVM
> probably couldn't start deprecating features in the default ID register
> values but userspace could probe for features it can opt-out of using
> the writable mask + trying to write -1 to a field.
>
> DDI0487J.a D19.1.3 covers this.

Yes, that would make the most sense (explicit opt-out from userspace),
the advice from above would still apply.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 0cc289b17665..f306b38ec341 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2116,6 +2116,16 @@  static unsigned int hidden_user_visibility(const struct kvm_vcpu *vcpu,
 	.val = mask,				\
 }
 
+/* sys_reg_desc initialiser for writable AA32 ID registers */
+#define AA32_ID_WRITABLE(name, mask) {		\
+	ID_DESC(name),				\
+	.set_user = set_id_reg,			\
+	.visibility = aa32_id_visibility,	\
+	.reset = kvm_read_sanitised_id_reg,	\
+	.val = mask,				\
+}
+
+
 /*
  * sys_reg_desc initialiser for architecturally unallocated cpufeature ID
  * register with encoding Op0=3, Op1=0, CRn=0, CRm=crm, Op2=op2
@@ -2270,7 +2280,11 @@  static const struct sys_reg_desc sys_reg_descs[] = {
 	AA32_ID_SANITISED(ID_ISAR3_EL1),
 	AA32_ID_SANITISED(ID_ISAR4_EL1),
 	AA32_ID_SANITISED(ID_ISAR5_EL1),
-	AA32_ID_SANITISED(ID_MMFR4_EL1),
+	AA32_ID_WRITABLE(ID_MMFR4_EL1, ID_MMFR4_EL1_LSM |
+				       ID_MMFR4_EL1_HPDS |
+				       ID_MMFR4_EL1_CnP |
+				       ID_MMFR4_EL1_XNX |
+				       ID_MMFR4_EL1_AC2),
 	AA32_ID_SANITISED(ID_ISAR6_EL1),
 
 	/* CRm=3 */