Message ID | E1s2DPv-00AhaA-Ra@rmk-PC.armlinux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] KVM: arm64: allow ID_MMFR4_EL1 to be writable | expand |
Hi Russell, On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > changed their value on the same Neoverse N1 r3p1 hardware which makes > migrating between these kernels on the host problematical. It'd be helpful to expand a bit more on how these fields changed, better yet if we can blame it back to a commit. I'm guessing the only direction of migration you care about is old -> new then? > We already permit changing HPDS in AA64MMFR1_EL1 and CnP in > AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1, > so this patch includes support for that even though it isn't required. > > Discussing with Marc Zygnier, AC2 should also be fine to be writable, typo: Zyngier > even though we don't inject an UNDEF if the guest accesses those > registers. > > The only questionable one is XNX, execute-never control distinction, > which is also in AA64MMFR1_EL1 but we don't allow to be changed there. It is quite odd for us to expose this field to non-nested VMs in the first place, though I suppose we will apply an additional set of restrictions for nested VMs when they come along.
On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote: > Hi Russell, > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > > changed their value on the same Neoverse N1 r3p1 hardware which makes > > migrating between these kernels on the host problematical. > > It'd be helpful to expand a bit more on how these fields changed, better > yet if we can blame it back to a commit. I'm guessing the only direction > of migration you care about is old -> new then? Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110 with our 5.15 kernel. I've been looking at tracking down which commit is responsible but I've come up with nothing that fits. The only change I can see is the FTR definition for MMFR4, but this always included 4:7 (AC2) which changed 0 -> 1. So... no idea what commit caused the change. There are a load of other registers that we need sorting, but this is just a test forray into attempting to solve this. > > > We already permit changing HPDS in AA64MMFR1_EL1 and CnP in > > AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1, > > so this patch includes support for that even though it isn't required. > > > > Discussing with Marc Zygnier, AC2 should also be fine to be writable, > > typo: Zyngier > > > even though we don't inject an UNDEF if the guest accesses those > > registers. > > > > The only questionable one is XNX, execute-never control distinction, > > which is also in AA64MMFR1_EL1 but we don't allow to be changed there. > > It is quite odd for us to expose this field to non-nested VMs in the > first place, though I suppose we will apply an additional set of > restrictions for nested VMs when they come along. Yes, it did strike me as odd, since the description seems to imply that XNX affects EL2, which the VM wouldn't have access to. So I'm not sure why we don't just force it to zero.
On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote: > > Hi Russell, > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > > > changed their value on the same Neoverse N1 r3p1 hardware which makes > > > migrating between these kernels on the host problematical. > > > > It'd be helpful to expand a bit more on how these fields changed, better > > yet if we can blame it back to a commit. I'm guessing the only direction > > of migration you care about is old -> new then? > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110 > with our 5.15 kernel. I've been looking at tracking down which commit > is responsible but I've come up with nothing that fits. > > The only change I can see is the FTR definition for MMFR4, but this > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what > commit caused the change. > > There are a load of other registers that we need sorting, but this > is just a test forray into attempting to solve this. Got it, let me see if I can find it then. Do share that list of problematic registers when you have it, hopefully this isn't the tip of the iceberg... > > > > > We already permit changing HPDS in AA64MMFR1_EL1 and CnP in > > > AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1, > > > so this patch includes support for that even though it isn't required. > > > > > > Discussing with Marc Zygnier, AC2 should also be fine to be writable, > > > > typo: Zyngier > > > > > even though we don't inject an UNDEF if the guest accesses those > > > registers. > > > > > > The only questionable one is XNX, execute-never control distinction, > > > which is also in AA64MMFR1_EL1 but we don't allow to be changed there. > > > > It is quite odd for us to expose this field to non-nested VMs in the > > first place, though I suppose we will apply an additional set of > > restrictions for nested VMs when they come along. > > Yes, it did strike me as odd, since the description seems to imply that > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure > why we don't just force it to zero. Probably because we failed to catch it in the first place and setting to 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate issues with the patch, especially since it is fixing a genuine UAPI breakage in KVM.
On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote: > > > Hi Russell, > > > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes > > > > migrating between these kernels on the host problematical. > > > > > > It'd be helpful to expand a bit more on how these fields changed, better > > > yet if we can blame it back to a commit. I'm guessing the only direction > > > of migration you care about is old -> new then? > > > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110 > > with our 5.15 kernel. I've been looking at tracking down which commit > > is responsible but I've come up with nothing that fits. > > > > The only change I can see is the FTR definition for MMFR4, but this > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what > > commit caused the change. > > > > There are a load of other registers that we need sorting, but this > > is just a test forray into attempting to solve this. > > Got it, let me see if I can find it then. Do share that list of > problematic registers when you have it, hopefully this isn't the tip of > the iceberg... There unfortunately is an iceberg, but hopefully it isn't big enough to sink a ship! Besides ID_MMFR4_EL1, here are the other differences we've identified. Note that these are Oracle's UEK kernels, so based on stable kernel branches. Register Field 5.4.x 5.15.x ID_PFR0_EL1 CSV2 0 1 ID_ISAR6_EL1 DP 0 1 ID_PFR2_EL1 SSBS 0 1 CSV3 0 1 ID_AA64DFR0_EL1 PMSVer 1 0 DebugVer 8 6 ID_AA64MMFR1_EL1 XNX 0 1 ID_AA64MMFR2_EL1 EVT 0 1 KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 0x12 0 I think some of these differences are due to Marc's removal of the WA2 code in commit 29e8910a566a ("KVM: arm64: Simplify handling of ARCH_WORKAROUND_2"). The WA2 register for example has changed from avail/enabled to not_avail.
On Wed, May 01, 2024 at 08:51:15PM +0100, Russell King (Oracle) wrote: > On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: > > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote: > > > > Hi Russell, > > > > > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes > > > > > migrating between these kernels on the host problematical. > > > > > > > > It'd be helpful to expand a bit more on how these fields changed, better > > > > yet if we can blame it back to a commit. I'm guessing the only direction > > > > of migration you care about is old -> new then? > > > > > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110 > > > with our 5.15 kernel. I've been looking at tracking down which commit > > > is responsible but I've come up with nothing that fits. > > > > > > The only change I can see is the FTR definition for MMFR4, but this > > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what > > > commit caused the change. > > > > > > There are a load of other registers that we need sorting, but this > > > is just a test forray into attempting to solve this. > > > > Got it, let me see if I can find it then. Do share that list of > > problematic registers when you have it, hopefully this isn't the tip of > > the iceberg... > > There unfortunately is an iceberg, but hopefully it isn't big enough to > sink a ship! > > Besides ID_MMFR4_EL1, here are the other differences we've identified. > Note that these are Oracle's UEK kernels, so based on stable kernel > branches. > > Register Field 5.4.x 5.15.x > ID_PFR0_EL1 CSV2 0 1 > ID_ISAR6_EL1 DP 0 1 > ID_PFR2_EL1 SSBS 0 1 > CSV3 0 1 > ID_AA64DFR0_EL1 PMSVer 1 0 > DebugVer 8 6 > ID_AA64MMFR1_EL1 XNX 0 1 > ID_AA64MMFR2_EL1 EVT 0 1 > KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 > 0x12 0 I'm finding sys_regs.c very unintuitive for working out what we allow to be written, because it's all coded in negative-logic. By that I mean the mask values are all ~(what-we-don't-allow) rather than a positive this-is-what-we-allow. So I've ended up creating a table, looking up the registers and working out what's read-only and what's read-write. From that, I can see (for example) that from the ISAR6_EL1 register, the field names appear in the AA64ISAR0_EL1 and AA64ISAR1_EL1 registers, and all non-res0 fields are writable. It is therefore my intention to submit a patch doing this: - AA32_ID_SANITISED(ID_ISAR6_EL1), + AA32_ID_WRITABLE(ID_ISAR6_EL1, ID_ISAR6_EL1_I8MM | + ID_ISAR6_EL1_BF16 | + ID_ISAR6_EL1_SPECRES | + ID_ISAR6_EL1_SB | + ID_ISAR6_EL1_FHM | + ID_ISAR6_EL1_DP | + ID_ISAR6_EL1_JSCVT), which, like the MMFR4 patch, uses positive logic for what we allow to be changed, even though this is equivalent to ~ID_ISAR6_EL1_RES0 which tells us nothing without either looking up in the spec, or looking at the generated sysreg-defs.h to figure it out.
On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > > Yes, it did strike me as odd, since the description seems to imply that > > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure > > why we don't just force it to zero. > > Probably because we failed to catch it in the first place and setting to > 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate > issues with the patch, especially since it is fixing a genuine UAPI > breakage in KVM. I think the only two ways around this would be to: 1) teach QEMU about the contents of these registers, with which fields in these registers can be ignored when reloading a VMs context. 2) allow userspace to write to the XNX field such that it can be set to values seen with previous kernels (thus allowing at least one- way migration.) (1) has the advantage that reloading a VM state on older vs newer kernels can work in either direction, whereas (2) would only work for state saved on an older kernel loaded onto a newer kernel. I've been bitten before with KVM differences between kernel versions in the past - where the number of registers that userspace sees has changed despite being on the same hardware. I'm now wondering what testing goes on to validate this part of the UAPI across different kernel versions on the same hardware. Surely, given the impact of these changing value (a failure of VMs), we should be aware whenever any of these registers are different from one kernel to another on the same hardware? It's fine if all one cares about is starting fresh VMs on each host kernel reboot, but any use case beyond that seems to be fragile. I did knock up a test program that dumped the list of registers so that one could trivially diff the output between various kernels. Maybe I need to extend that to dump the register values themselves, and then maybe we need to find a way to get some kind of automated testing setup to highlight differences. (something maybe kernelci could add?)
On Thu, 02 May 2024 11:50:10 +0100, "Russell King (Oracle)" <linux@armlinux.org.uk> wrote: > > On Wed, May 01, 2024 at 08:51:15PM +0100, Russell King (Oracle) wrote: > > On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: > > > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > > > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote: > > > > > Hi Russell, > > > > > > > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > > > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > > > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes > > > > > > migrating between these kernels on the host problematical. > > > > > > > > > > It'd be helpful to expand a bit more on how these fields changed, better > > > > > yet if we can blame it back to a commit. I'm guessing the only direction > > > > > of migration you care about is old -> new then? > > > > > > > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110 > > > > with our 5.15 kernel. I've been looking at tracking down which commit > > > > is responsible but I've come up with nothing that fits. > > > > > > > > The only change I can see is the FTR definition for MMFR4, but this > > > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what > > > > commit caused the change. > > > > > > > > There are a load of other registers that we need sorting, but this > > > > is just a test forray into attempting to solve this. > > > > > > Got it, let me see if I can find it then. Do share that list of > > > problematic registers when you have it, hopefully this isn't the tip of > > > the iceberg... > > > > There unfortunately is an iceberg, but hopefully it isn't big enough to > > sink a ship! > > > > Besides ID_MMFR4_EL1, here are the other differences we've identified. > > Note that these are Oracle's UEK kernels, so based on stable kernel > > branches. > > > > Register Field 5.4.x 5.15.x > > ID_PFR0_EL1 CSV2 0 1 > > ID_ISAR6_EL1 DP 0 1 > > ID_PFR2_EL1 SSBS 0 1 > > CSV3 0 1 > > ID_AA64DFR0_EL1 PMSVer 1 0 > > DebugVer 8 6 > > ID_AA64MMFR1_EL1 XNX 0 1 > > ID_AA64MMFR2_EL1 EVT 0 1 > > KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 > > 0x12 0 > > I'm finding sys_regs.c very unintuitive for working out what we allow > to be written, because it's all coded in negative-logic. By that I mean > the mask values are all ~(what-we-don't-allow) rather than a positive > this-is-what-we-allow. So I've ended up creating a table, looking up > the registers and working out what's read-only and what's read-write. [...] Using positive or negative logic doesn't really have any impact on the result. It often is a matter of concisely expressing what is allowed or not. There is also the fact that a lot of the KVM code now checks for "feature downgrade" and enforces it. Construct such as: if (!kvm_has_feat(kvm, ID_AA64ISAR0_EL1, TLB, OS)) kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_TLBIRVAALE1OS| HFGITR_EL2_TLBIRVALE1OS | HFGITR_EL2_TLBIRVAAE1OS | HFGITR_EL2_TLBIRVAE1OS | HFGITR_EL2_TLBIVAALE1OS | HFGITR_EL2_TLBIVALE1OS | HFGITR_EL2_TLBIVAAE1OS | HFGITR_EL2_TLBIASIDE1OS | HFGITR_EL2_TLBIVAE1OS | HFGITR_EL2_TLBIVMALLE1OS); use negative logic by expressing what we don't want to enable. In the end, consistency matters. M.
On Thu, May 02, 2024 at 03:40:38PM +0100, Russell King (Oracle) wrote: > On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: > > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > > > Yes, it did strike me as odd, since the description seems to imply that > > > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure > > > why we don't just force it to zero. > > > > Probably because we failed to catch it in the first place and setting to > > 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate > > issues with the patch, especially since it is fixing a genuine UAPI > > breakage in KVM. > > I think the only two ways around this would be to: > > 1) teach QEMU about the contents of these registers, with which fields > in these registers can be ignored when reloading a VMs context. > > 2) allow userspace to write to the XNX field such that it can be set > to values seen with previous kernels (thus allowing at least one- > way migration.) > > (1) has the advantage that reloading a VM state on older vs newer > kernels can work in either direction, whereas (2) would only work > for state saved on an older kernel loaded onto a newer kernel. Yeah, so this is something that has affected my employer as well. I think (1) should only be expected of VMMs that want rollback safety, i.e. the ability to migrate state back to an older kernel. Our userspace initializes vCPUs from a fixed set of feature ID register values that prevents VMs on new kernels from picking up new CPU features. It is quite tedious, but necessary as rollback safety is very much a non-goal of the KVM UAPI. OTOH, in cases where KVM screws up and breaks UAPI, the kernel needs to do something special to accept the previously-advertised state even if it were nonsensical. For example, there was a bug where KVM advertised an IMP DEF PMU to VMs even though the only thing KVM virtualizes is PMUv3. We fixed it in commit f90f9360c3d7 ("KVM: arm64: Rewrite IMPDEF PMU version as NI") by accepting the old value in the ioctl and changing the field to NI internally. I dislike these sort of hacks, but when we're caught between upholding UAPI and the architecture it seems to be the best option. I wonder if an approach similar to this would be sufficient to address the SPE change that you noticed. > I've been bitten before with KVM differences between kernel versions > in the past - where the number of registers that userspace sees has > changed despite being on the same hardware. This is intended behavior, as VMs are initialized to the maximum feature set KVM is able to support. Forward-compatibility for the set of exposed registers is tested, see the get-reg-list selftest. > I'm now wondering what > testing goes on to validate this part of the UAPI across different > kernel versions on the same hardware. We may've been a bit more relaxed in the past with this, but in recent history we've been careful about preserving UAPI. On top of that, we now have some generalized infrastructure for dealing with these things by way of the 'writable' / mutable ID register work. Although it isn't precisely what you're looking for, the set_id_regs selftest ensures we at least accept features that are valid for the underlying HW platform and explicitly tests downgrades. > I did knock up a test program that dumped the list of registers so > that one could trivially diff the output between various kernels. > Maybe I need to extend that to dump the register values themselves, > and then maybe we need to find a way to get some kind of automated > testing setup to highlight differences. (something maybe kernelci > could add?) Given the absolutely massive test matrix of implementations and kernel versions I question our ability to support this. Additionally the only thing we'd care about upstream is the unsafe removal of a feature. Nevertheless, a starting point could be to drop some pr_info() into set_id_regs.c to print the starting value of the ID registers which could be diffed.
On Thu, May 02, 2024 at 04:23:10PM +0100, Marc Zyngier wrote: > On Thu, 02 May 2024 11:50:10 +0100, > "Russell King (Oracle)" <linux@armlinux.org.uk> wrote: > > > > On Wed, May 01, 2024 at 08:51:15PM +0100, Russell King (Oracle) wrote: > > > On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: > > > > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: > > > > > On Wed, May 01, 2024 at 05:57:20PM +0000, Oliver Upton wrote: > > > > > > Hi Russell, > > > > > > > > > > > > On Wed, May 01, 2024 at 06:06:51PM +0100, Russell King (Oracle) wrote: > > > > > > > Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 > > > > > > > changed their value on the same Neoverse N1 r3p1 hardware which makes > > > > > > > migrating between these kernels on the host problematical. > > > > > > > > > > > > It'd be helpful to expand a bit more on how these fields changed, better > > > > > > yet if we can blame it back to a commit. I'm guessing the only direction > > > > > > of migration you care about is old -> new then? > > > > > > > > > > Yes. For MMFR4_EL1, we see 0 with our 5.4 based kernel, and 0x21110 > > > > > with our 5.15 kernel. I've been looking at tracking down which commit > > > > > is responsible but I've come up with nothing that fits. > > > > > > > > > > The only change I can see is the FTR definition for MMFR4, but this > > > > > always included 4:7 (AC2) which changed 0 -> 1. So... no idea what > > > > > commit caused the change. > > > > > > > > > > There are a load of other registers that we need sorting, but this > > > > > is just a test forray into attempting to solve this. > > > > > > > > Got it, let me see if I can find it then. Do share that list of > > > > problematic registers when you have it, hopefully this isn't the tip of > > > > the iceberg... > > > > > > There unfortunately is an iceberg, but hopefully it isn't big enough to > > > sink a ship! > > > > > > Besides ID_MMFR4_EL1, here are the other differences we've identified. > > > Note that these are Oracle's UEK kernels, so based on stable kernel > > > branches. > > > > > > Register Field 5.4.x 5.15.x > > > ID_PFR0_EL1 CSV2 0 1 > > > ID_ISAR6_EL1 DP 0 1 > > > ID_PFR2_EL1 SSBS 0 1 > > > CSV3 0 1 > > > ID_AA64DFR0_EL1 PMSVer 1 0 > > > DebugVer 8 6 > > > ID_AA64MMFR1_EL1 XNX 0 1 > > > ID_AA64MMFR2_EL1 EVT 0 1 > > > KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 > > > 0x12 0 > > > > I'm finding sys_regs.c very unintuitive for working out what we allow > > to be written, because it's all coded in negative-logic. By that I mean > > the mask values are all ~(what-we-don't-allow) rather than a positive > > this-is-what-we-allow. So I've ended up creating a table, looking up > > the registers and working out what's read-only and what's read-write. > > [...] > > Using positive or negative logic doesn't really have any impact on the > result. It often is a matter of concisely expressing what is allowed > or not. > > There is also the fact that a lot of the KVM code now checks for > "feature downgrade" and enforces it. Construct such as: > > if (!kvm_has_feat(kvm, ID_AA64ISAR0_EL1, TLB, OS)) > kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_TLBIRVAALE1OS| > HFGITR_EL2_TLBIRVALE1OS | > HFGITR_EL2_TLBIRVAAE1OS | > HFGITR_EL2_TLBIRVAE1OS | > HFGITR_EL2_TLBIVAALE1OS | > HFGITR_EL2_TLBIVALE1OS | > HFGITR_EL2_TLBIVAAE1OS | > HFGITR_EL2_TLBIASIDE1OS | > HFGITR_EL2_TLBIVAE1OS | > HFGITR_EL2_TLBIVMALLE1OS); > > use negative logic by expressing what we don't want to enable. > > In the end, consistency matters. Is that a request to change my patch? I'm unclear whether anyone wants changes to it.
On Thu, May 02 2024, Oliver Upton <oliver.upton@linux.dev> wrote: > On Thu, May 02, 2024 at 03:40:38PM +0100, Russell King (Oracle) wrote: >> On Wed, May 01, 2024 at 06:59:17PM +0000, Oliver Upton wrote: >> > On Wed, May 01, 2024 at 07:08:05PM +0100, Russell King (Oracle) wrote: >> > > Yes, it did strike me as odd, since the description seems to imply that >> > > XNX affects EL2, which the VM wouldn't have access to. So I'm not sure >> > > why we don't just force it to zero. >> > >> > Probably because we failed to catch it in the first place and setting to >> > 0 now would be even more UAPI breakage. Meh :-/ I don't see any immediate >> > issues with the patch, especially since it is fixing a genuine UAPI >> > breakage in KVM. >> >> I think the only two ways around this would be to: >> >> 1) teach QEMU about the contents of these registers, with which fields >> in these registers can be ignored when reloading a VMs context. >> >> 2) allow userspace to write to the XNX field such that it can be set >> to values seen with previous kernels (thus allowing at least one- >> way migration.) >> >> (1) has the advantage that reloading a VM state on older vs newer >> kernels can work in either direction, whereas (2) would only work >> for state saved on an older kernel loaded onto a newer kernel. > > Yeah, so this is something that has affected my employer as well. > > I think (1) should only be expected of VMMs that want rollback safety, > i.e. the ability to migrate state back to an older kernel. Our userspace > initializes vCPUs from a fixed set of feature ID register values that > prevents VMs on new kernels from picking up new CPU features. > > It is quite tedious, but necessary as rollback safety is very much a > non-goal of the KVM UAPI. Depending on your use case, rollback safety might be quite important... have we ever stated exactly which guarantees the KVM UAPI is giving? IOW, can someone implementing a VMM look at a doc and see "oh, if I want to be able to go backwards, I need to be able to deal with x, y, and z coming up on the new kernel"? > > OTOH, in cases where KVM screws up and breaks UAPI, the kernel needs to > do something special to accept the previously-advertised state even if > it were nonsensical. > > For example, there was a bug where KVM advertised an IMP DEF PMU to VMs > even though the only thing KVM virtualizes is PMUv3. We fixed it in > commit f90f9360c3d7 ("KVM: arm64: Rewrite IMPDEF PMU version as NI") by > accepting the old value in the ioctl and changing the field to NI > internally. > > I dislike these sort of hacks, but when we're caught between upholding > UAPI and the architecture it seems to be the best option. I wonder if an > approach similar to this would be sufficient to address the SPE change > that you noticed. > >> I've been bitten before with KVM differences between kernel versions >> in the past - where the number of registers that userspace sees has >> changed despite being on the same hardware. > > This is intended behavior, as VMs are initialized to the maximum feature > set KVM is able to support. Forward-compatibility for the set of exposed > registers is tested, see the get-reg-list selftest. I've seen this problem come up as well; if it is clear that this is something that KVM expects the VMM to handle if needed, that is fine; should we consider "it's tested in a selftest" as a canonical indicator of "this is what KVM supports compatibility wise"?
Hi Cornelia, On Wed, May 08, 2024 at 02:06:36PM +0200, Cornelia Huck wrote: > On Thu, May 02 2024, Oliver Upton <oliver.upton@linux.dev> wrote: > > I think (1) should only be expected of VMMs that want rollback safety, > > i.e. the ability to migrate state back to an older kernel. Our userspace > > initializes vCPUs from a fixed set of feature ID register values that > > prevents VMs on new kernels from picking up new CPU features. > > > > It is quite tedious, but necessary as rollback safety is very much a > > non-goal of the KVM UAPI. > > Depending on your use case, rollback safety might be quite > important... have we ever stated exactly which guarantees the KVM UAPI > is giving? IOW, can someone implementing a VMM look at a doc and see > "oh, if I want to be able to go backwards, I need to be able to deal > with x, y, and z coming up on the new kernel"? The behavior of KVM/arm64 has always been that new VMs get the maximum set of vCPU features supported by KVM / hardware modulo the ones we require explicit opt-in from userspace (e.g. SVE, vPMU). This behavior is described in the arm64 vCPU feature documentation [1]. The biggest benefit of this approach is that new vCPU features can be added without a VMM change, as userspace can just treat the registers in KVM_GET_REG_LIST as an opaque blob of data that needs to be migrated. I'm willing to wager that the set of users who want to migrate state from kernel N -> (N - 1) know the exact CPU definition they want to expose to the guest, and in that case should be using a static set of feature register values matching their intent. Beyond the CPU architecture, KVM presents hypercall features to the VM which userspace can _opt-out_ of on a per-feature basis using the feature bitmap registers described in [2]. Like the feature ID registers, we've preallocated a range of indices to be used for hypercall bitmaps. So if an unexpected bitmap register appears on a new kernel, userspace should write 0 to it to prevent new features from silently creeping in. Prescribing the exact combination of these UAPIs to achieve a rollback-safe feature set is beyond the scope of the KVM documentation and should be determined based on the minimum kernel version that needs to work. [1]: https://docs.kernel.org/virt/kvm/arm/vcpu-features.html#the-id-registers [2]: https://docs.kernel.org/virt/kvm/arm/hypercalls.html > >> I've been bitten before with KVM differences between kernel versions > >> in the past - where the number of registers that userspace sees has > >> changed despite being on the same hardware. > > > > This is intended behavior, as VMs are initialized to the maximum feature > > set KVM is able to support. Forward-compatibility for the set of exposed > > registers is tested, see the get-reg-list selftest. > > I've seen this problem come up as well; if it is clear that this is > something that KVM expects the VMM to handle if needed, that is fine; > should we consider "it's tested in a selftest" as a canonical indicator > of "this is what KVM supports compatibility wise"? It certainly is the level of compatibility that gets actively tested :) The canonical reason for this behavior, though, is that KVM/arm64 defaults to the maximum-possible feature set as discussed above.
On Wed, May 08 2024, Oliver Upton <oliver.upton@linux.dev> wrote: > Hi Cornelia, > > On Wed, May 08, 2024 at 02:06:36PM +0200, Cornelia Huck wrote: >> On Thu, May 02 2024, Oliver Upton <oliver.upton@linux.dev> wrote: >> > I think (1) should only be expected of VMMs that want rollback safety, >> > i.e. the ability to migrate state back to an older kernel. Our userspace >> > initializes vCPUs from a fixed set of feature ID register values that >> > prevents VMs on new kernels from picking up new CPU features. >> > >> > It is quite tedious, but necessary as rollback safety is very much a >> > non-goal of the KVM UAPI. >> >> Depending on your use case, rollback safety might be quite >> important... have we ever stated exactly which guarantees the KVM UAPI >> is giving? IOW, can someone implementing a VMM look at a doc and see >> "oh, if I want to be able to go backwards, I need to be able to deal >> with x, y, and z coming up on the new kernel"? > > The behavior of KVM/arm64 has always been that new VMs get the maximum > set of vCPU features supported by KVM / hardware modulo the ones we > require explicit opt-in from userspace (e.g. SVE, vPMU). This behavior > is described in the arm64 vCPU feature documentation [1]. > > The biggest benefit of this approach is that new vCPU features can be > added without a VMM change, as userspace can just treat the registers in > KVM_GET_REG_LIST as an opaque blob of data that needs to be migrated. It also needs to actively turn off everything it does not know how to handle, if migration between different hardware is supposed to be supported. > > I'm willing to wager that the set of users who want to migrate state > from kernel N -> (N - 1) know the exact CPU definition they want to > expose to the guest, and in that case should be using a static set of > feature register values matching their intent. I think the trouble starts when we introduce additional ranges of registers that can be controled via that interface -- old userspace will be able to figure out that there are more ranges available than what it is aware of, but will have no idea how to handle those additional ranges to get into a defined state (what is the actual range, for example?) > > Beyond the CPU architecture, KVM presents hypercall features to the VM > which userspace can _opt-out_ of on a per-feature basis using the > feature bitmap registers described in [2]. Like the feature ID > registers, we've preallocated a range of indices to be used for > hypercall bitmaps. So if an unexpected bitmap register appears on a new > kernel, userspace should write 0 to it to prevent new features from > silently creeping in. Hm, the doc says: "The features for the registers that are untouched, probably because userspace isn’t aware of them, will be exposed as is to the guest." Seems to indicate that it is not too hard to get this wrong :) > > Prescribing the exact combination of these UAPIs to achieve a > rollback-safe feature set is beyond the scope of the KVM documentation > and should be determined based on the minimum kernel version that needs > to work. > > [1]: https://docs.kernel.org/virt/kvm/arm/vcpu-features.html#the-id-registers > [2]: https://docs.kernel.org/virt/kvm/arm/hypercalls.html > >> >> I've been bitten before with KVM differences between kernel versions >> >> in the past - where the number of registers that userspace sees has >> >> changed despite being on the same hardware. >> > >> > This is intended behavior, as VMs are initialized to the maximum feature >> > set KVM is able to support. Forward-compatibility for the set of exposed >> > registers is tested, see the get-reg-list selftest. >> >> I've seen this problem come up as well; if it is clear that this is >> something that KVM expects the VMM to handle if needed, that is fine; >> should we consider "it's tested in a selftest" as a canonical indicator >> of "this is what KVM supports compatibility wise"? > > It certainly is the level of compatibility that gets actively tested :) :) > > The canonical reason for this behavior, though, is that KVM/arm64 > defaults to the maximum-possible feature set as discussed above. /me contemplating "reverse" features, but too tired to think this through on a Friday afternoon.
On Fri, May 10, 2024 at 05:11:09PM +0200, Cornelia Huck wrote: > On Wed, May 08 2024, Oliver Upton <oliver.upton@linux.dev> wrote: > > I'm willing to wager that the set of users who want to migrate state > > from kernel N -> (N - 1) know the exact CPU definition they want to > > expose to the guest, and in that case should be using a static set of > > feature register values matching their intent. > > I think the trouble starts when we introduce additional ranges of > registers that can be controled via that interface -- old userspace will > be able to figure out that there are more ranges available than what it > is aware of, but will have no idea how to handle those additional ranges > to get into a defined state (what is the actual range, for example?) Even though the UAPI was designed around supporting multiple ranges, I do not see this being an issue for quite some time. We already fully describe the Feature ID register range from the architecture. > > > > Beyond the CPU architecture, KVM presents hypercall features to the VM > > which userspace can _opt-out_ of on a per-feature basis using the > > feature bitmap registers described in [2]. Like the feature ID > > registers, we've preallocated a range of indices to be used for > > hypercall bitmaps. So if an unexpected bitmap register appears on a new > > kernel, userspace should write 0 to it to prevent new features from > > silently creeping in. > > Hm, the doc says: "The features for the registers that are untouched, > probably because userspace isn’t aware of them, will be exposed as is to > the guest." Seems to indicate that it is not too hard to get this wrong :) Sure, but keep in mind the full range of possible Feature ID registers is already known to userspace. Many of these register encodings are reserved, RAZ to allow for future expansion of the architecture [*]. New registers will come from one of these RAZ encodings. If userspace wants to assert complete control over the CPU feature set exposed to the VM it should use a pre-baked value for every register in the range Op0=3, Op1=0, CRn=0, CRm={1-8}, Op2={0-8}. [*] DDI0487J.a Table D18-2 > > > > The canonical reason for this behavior, though, is that KVM/arm64 > > defaults to the maximum-possible feature set as discussed above. > > /me contemplating "reverse" features, but too tired to think this > through on a Friday afternoon. > Reverse as in a negative / removed feature? These tend to appear as negative ID register fields, to imply that the feature set is less than what's provided for with a value of 0x0. KVM probably couldn't start deprecating features in the default ID register values but userspace could probe for features it can opt-out of using the writable mask + trying to write -1 to a field. DDI0487J.a D19.1.3 covers this.
On Mon, May 13 2024, Oliver Upton <oliver.upton@linux.dev> wrote: [apologies for the late reply] > On Fri, May 10, 2024 at 05:11:09PM +0200, Cornelia Huck wrote: >> On Wed, May 08 2024, Oliver Upton <oliver.upton@linux.dev> wrote: >> > I'm willing to wager that the set of users who want to migrate state >> > from kernel N -> (N - 1) know the exact CPU definition they want to >> > expose to the guest, and in that case should be using a static set of >> > feature register values matching their intent. >> >> I think the trouble starts when we introduce additional ranges of >> registers that can be controled via that interface -- old userspace will >> be able to figure out that there are more ranges available than what it >> is aware of, but will have no idea how to handle those additional ranges >> to get into a defined state (what is the actual range, for example?) > > Even though the UAPI was designed around supporting multiple ranges, I > do not see this being an issue for quite some time. We already fully > describe the Feature ID register range from the architecture. Yeah, we're probably fine for the time being, but something to keep in mind, I guess. > >> > >> > Beyond the CPU architecture, KVM presents hypercall features to the VM >> > which userspace can _opt-out_ of on a per-feature basis using the >> > feature bitmap registers described in [2]. Like the feature ID >> > registers, we've preallocated a range of indices to be used for >> > hypercall bitmaps. So if an unexpected bitmap register appears on a new >> > kernel, userspace should write 0 to it to prevent new features from >> > silently creeping in. >> >> Hm, the doc says: "The features for the registers that are untouched, >> probably because userspace isn’t aware of them, will be exposed as is to >> the guest." Seems to indicate that it is not too hard to get this wrong :) > > Sure, but keep in mind the full range of possible Feature ID registers is > already known to userspace. Many of these register encodings are reserved, > RAZ to allow for future expansion of the architecture [*]. New registers > will come from one of these RAZ encodings. > > If userspace wants to assert complete control over the CPU feature set > exposed to the VM it should use a pre-baked value for every register in > the range Op0=3, Op1=0, CRn=0, CRm={1-8}, Op2={0-8}. > > [*] DDI0487J.a Table D18-2 Might make sense to spell that out? From 1d952c816d3192a4c2f3d95339aa96d86d3d0406 Mon Sep 17 00:00:00 2001 From: Cornelia Huck <cohuck@redhat.com> Date: Wed, 22 May 2024 12:01:24 +0200 Subject: [PATCH] KVM: arm64: provide hint about vCPU feature stability Give userspace some advice on how to achieve a stable feature set. Signed-off-by: Cornelia Huck <cohuck@redhat.com> --- Documentation/virt/kvm/arm/vcpu-features.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/virt/kvm/arm/vcpu-features.rst b/Documentation/virt/kvm/arm/vcpu-features.rst index f7cc6d8d8b74..01973bf8815d 100644 --- a/Documentation/virt/kvm/arm/vcpu-features.rst +++ b/Documentation/virt/kvm/arm/vcpu-features.rst @@ -40,6 +40,10 @@ outlined by the architecture in DDI0487J.a D19.1.3 'Principles of the ID scheme for fields in ID register'. KVM does not allow ID register values that exceed the capabilities of the system. +As KVM will enable new features by default, userspace that wants to provide a +stable feature set regardless of the kernel version is advised to supply a +complete set of values for all of the ID registers. + .. warning:: It is **strongly recommended** that userspace modify the ID register values before accessing the rest of the vCPU's CPU register state. KVM may use the > >> > >> > The canonical reason for this behavior, though, is that KVM/arm64 >> > defaults to the maximum-possible feature set as discussed above. >> >> /me contemplating "reverse" features, but too tired to think this >> through on a Friday afternoon. >> > > Reverse as in a negative / removed feature? > > These tend to appear as negative ID register fields, to imply that the > feature set is less than what's provided for with a value of 0x0. KVM > probably couldn't start deprecating features in the default ID register > values but userspace could probe for features it can opt-out of using > the writable mask + trying to write -1 to a field. > > DDI0487J.a D19.1.3 covers this. Yes, that would make the most sense (explicit opt-out from userspace), the advice from above would still apply.
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 0cc289b17665..f306b38ec341 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -2116,6 +2116,16 @@ static unsigned int hidden_user_visibility(const struct kvm_vcpu *vcpu, .val = mask, \ } +/* sys_reg_desc initialiser for writable AA32 ID registers */ +#define AA32_ID_WRITABLE(name, mask) { \ + ID_DESC(name), \ + .set_user = set_id_reg, \ + .visibility = aa32_id_visibility, \ + .reset = kvm_read_sanitised_id_reg, \ + .val = mask, \ +} + + /* * sys_reg_desc initialiser for architecturally unallocated cpufeature ID * register with encoding Op0=3, Op1=0, CRn=0, CRm=crm, Op2=op2 @@ -2270,7 +2280,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { AA32_ID_SANITISED(ID_ISAR3_EL1), AA32_ID_SANITISED(ID_ISAR4_EL1), AA32_ID_SANITISED(ID_ISAR5_EL1), - AA32_ID_SANITISED(ID_MMFR4_EL1), + AA32_ID_WRITABLE(ID_MMFR4_EL1, ID_MMFR4_EL1_LSM | + ID_MMFR4_EL1_HPDS | + ID_MMFR4_EL1_CnP | + ID_MMFR4_EL1_XNX | + ID_MMFR4_EL1_AC2), AA32_ID_SANITISED(ID_ISAR6_EL1), /* CRm=3 */
Between 5.4 and 5.15, the guests view of HPDS, CnP, XNX and AC2 changed their value on the same Neoverse N1 r3p1 hardware which makes migrating between these kernels on the host problematical. We already permit changing HPDS in AA64MMFR1_EL1 and CnP in AA64MMFR2_EL1. We also allow LSM as we allow that in AA64MMFR2_EL1, so this patch includes support for that even though it isn't required. Discussing with Marc Zygnier, AC2 should also be fine to be writable, even though we don't inject an UNDEF if the guest accesses those registers. The only questionable one is XNX, execute-never control distinction, which is also in AA64MMFR1_EL1 but we don't allow to be changed there. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> --- arch/arm64/kvm/sys_regs.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)