Message ID | 20190416091948.4420-1-kai.huang@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kvm: x86: Fix several SPTE mask calculation errors caused by MKTME | expand |
On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote: > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced > due to several high bits of physical address are repurposed for memory > encryption. To honor such behavior those repurposed bits are reduced from > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported > by CPUID anymore. This neglects to mention the most relevant tidbit of information in terms of justification for this patch: the number of bits stolen for MKTME is programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of kernel support. > KVM uses boot_cpu_data.x86_phys_bits to calculate several SPTE masks > based on assumption that: 1) boot_cpu_data.x86_phys_bits equals to > physical address bits reported by CPUID -- this is used to check CPU has > reserved bits when KVM calculates shadow_mmio_{value|mask}; and whether > shadow_nonpresent_or_rsvd_mask should be setup (KVM assumes L1TF is not > present if CPU has 52 physical address bits); 2) if it is smaller than > 52, bits [x86_phys_bits, 51] are reserved bits. > > With Intel MKTME or AMD SME/SEV above assumption is not valid any more, > especially when calculating reserved bits with Intel MKTME, since Intel > MKTME treats the reduced bits as 'keyID', thus they are not reduced > bits, therefore boot_cpu_data.x86_phys_bits cannot be used to calcualte > reserved bits anymore, although we can still use it for AMD SME/SEV > since SME/SEV treats the reduced bits differently -- they are treated as > reserved bits, the same as other reserved bits in page table entity [1]. > > Fix by introducing a new 'shadow_first_rsvd_bit' variable in kvm x86 MMU > code to store the actual position of reserved bits -- for Intel MKTME, > it equals to physical address reported by CPUID, and for AMD SME/SEV, it > is boot_cpu_data.x86_phys_bits. And in reserved bits related calculation > it is used instead of boot_cpu_data.x86_phys_bits. Some other code > changes too to make code more reasonable, ie, kvm_set_mmio_spte_mask is > moved to x86/kvm/mmu.c from x86/kvm/x86.c to use shadow_first_rsvd_bit; > shadow_nonpresent_or_rsvd_mask calculation logic is slightly changed, > based on the fact that only Intel CPU is impacted by L1TF, so that KVM > can use shadow_first_rsvd_bit to check whether KVM should set > shadow_nonpresent_or_rsvd_mask or not. > > Note that for the physical address bits reported to guest should remain > unchanged -- KVM should report physical address reported by CPUID to > guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME, > there's no harm if guest sets up 'keyID' bits in guest page table (since > MKTME only works at physical address level), and KVM doesn't even expose > MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it > should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore > KVM should still reports physcial address reported by CPUID to guest. > > [1] Section 7.10.1 Determining Support for Secure Memory Encryption, > AMD Architecture Programmer's Manual Volume 2: System Programming). > Signed-off-by: Kai Huang <kai.huang@linux.intel.com> > --- > arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++---------- > arch/x86/kvm/mmu.h | 1 + > arch/x86/kvm/x86.c | 29 ---------- > 3 files changed, 125 insertions(+), 57 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index e10962dfc203..3add7283980e 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5; > */ > static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; > > +/* > + * The position of first reserved bit. If it equals to 52, then CPU doesn't > + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are > + * reserved bits. > + * > + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of > + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce > + * physical address bits, and the reduced bits are taken away from > + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD > + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them > + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved > + * bits, thus physical address bits reported by CPUID cannot be used to > + * determine reserved bits position either. > + */ No need to rehash the justification for the variable, it's sufficient to state *what* the variable tracks. And the whole first paragraph can be dropped if the variable is renamed. > +static u64 __read_mostly shadow_first_rsvd_bit; Hmm, 'first' is technically incorrect since EPT has reserved attribute bits on non-leaf entries. And 'first' is vague, e.g. it could be interpreted as the MSB or LSB. In the end, we're still tracking the number of physical address bits, so maybe something like 'shadow_phys_bits'? E.g. putting it together: /* * The number of non-reserved physical address bits irrespective of features * that repurpose software-accessible bits, e.g. MKTME. */ static u64 __read_mostly shadow_phys_bits; > + > > static void mmu_spte_set(u64 *sptep, u64 spte); > static union kvm_mmu_page_role > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value) > } > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); > > +void kvm_set_mmio_spte_mask(void) Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is silly. The current call site is immediately after kvm_mmu_module_init(), I don't see any reason not to just move the call there. > +{ > + u64 mask; > + > + /* > + * Set the reserved bits and the present bit of an paging-structure > + * entry to generate page fault with PFER.RSV = 1. > + */ > + > + /* > + * Mask the uppermost physical address bit, which would be reserved as > + * long as the supported physical address width is less than 52. > + */ > + mask = 1ull << 51; > + > + /* Set the present bit. */ > + mask |= 1ull; > + > + /* > + * If reserved bit is not supported, clear the present bit to disable > + * mmio page fault. > + */ > + if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52) > + mask &= ~1ull; > + > + kvm_mmu_set_mmio_spte_mask(mask, mask); > +} > + > static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) > { > return sp->role.ad_disabled; > @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, > u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK; > u64 mask = generation_mmio_spte_mask(gen); > u64 gpa = gfn << PAGE_SHIFT; > + u64 high_gpa_offset; > > access &= ACC_WRITE_MASK | ACC_USER_MASK; > mask |= shadow_mmio_value | access; > mask |= gpa | shadow_nonpresent_or_rsvd_mask; > + /* > + * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to > + * shadow_first_rsvd_bit - 1 are actually keyID bits but not > + * reserved bits. We need to put high GPA bits to actual reserved > + * bits to mitigate L1TF attack. > + */ > + high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + > + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; > mask |= (gpa & shadow_nonpresent_or_rsvd_mask) > - << shadow_nonpresent_or_rsvd_mask_len; > + << high_gpa_offset; > > page_header(__pa(sptep))->mmio_cached = true; > > @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte) > static gfn_t get_mmio_spte_gfn(u64 spte) > { > u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask; > + /* See comments in mark_mmio_spte */ > + u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + > + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; The exact shift needed is constant after init, there should be no need to dynamically calculate it for every MMIO SPTE. > - gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len) > + gpa |= (spte >> high_gpa_offset) > & shadow_nonpresent_or_rsvd_mask; > > return gpa >> PAGE_SHIFT; > @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, > } > EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes); > > +static u8 kvm_get_cpuid_phys_bits(void) > +{ > + u32 eax, ebx, ecx, edx; > + > + if (boot_cpu_data.extended_cpuid_level < 0x80000008) > + return boot_cpu_data.x86_phys_bits; > + > + cpuid(0x80000008, &eax, &ebx, &ecx, &edx); > + > + return eax & 0xff; cpuid_eax() will do most of the work for you. > +} > + > static void kvm_mmu_reset_all_pte_masks(void) > { > u8 low_phys_bits; > + bool need_l1tf; > > shadow_user_mask = 0; > shadow_accessed_mask = 0; > @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void) > shadow_acc_track_mask = 0; > > /* > - * If the CPU has 46 or less physical address bits, then set an > - * appropriate mask to guard against L1TF attacks. Otherwise, it is > + * Calcualte the first reserved bit position. Although both Intel > + * MKTME and AMD SME/SEV reduce physical address bits for memory > + * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect > + * such fact), they treat those reduced bits differently -- Intel > + * MKTME treats those as 'keyID' thus not reserved bits, but AMD > + * SME/SEV still treats those bits as reserved bits, so for AMD > + * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for > + * Intel (and other x86 vendors that don't support memory encryption > + * at all), shadow_first_rsvd_bit is physical address bits reported > + * by CPUID. > + */ > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) Checking for a specific vendor should only be done as a last resort, e.g. I believe this will break Hygon CPUs w/ SME/SEV. E.g. the exisiting "is AMD" check in mmu.c looks at shadow_x_mask, not an CPUID vendor. {MK}TME provides a CPUID feature bit and MSRs to query support, and this code only runs at module init so the overhead of a RDMSR is negligible. > + shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits; > + else > + shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits(); If you rename the helper to be less specific and actually check for MKTME support then the MKTME comment doesn't need to be as verbose, e.g.: static u8 kvm_get_shadow_phys_bits(void) { if (!<has MKTME> || WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008)) return boot_cpu_data.x86_phys_bits; /* * MKTME steals physical address bits for key IDs, but the key ID bits * are not treated as reserved. x86_phys_bits is adjusted to account * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports * the number of software-available bits irrespective of MKTME. */ return cpuid_eax(0x80000008) & 0xff; } > + > + /* > + * Only Intel is impacted by L1TF, therefore for AMD and other x86 > + * vendors L1TF mitigation is not needed. > + * > + * For Intel CPU, if it has 46 or less physical address bits, then set > + * an appropriate mask to guard against L1TF attacks. Otherwise, it is > * assumed that the CPU is not vulnerable to L1TF. > */ > + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && Again, checking vendors is bad. Not employing a mitigation technique due to an assumption about what processors are affected by a vulnerability is even worse. The existing code makes the assumption regarding processors with >46 bits of address space because a) no such processor existed before the discovery of L1TF, and it's reasonable to assume hardware vendors won't ship future processors with such an obvious vulnerability, and b) hardcoding the number of reserved bits to set simplifies the code. > + (shadow_first_rsvd_bit < > + 52 - shadow_nonpresent_or_rsvd_mask_len)) > + need_l1tf = true; > + else > + need_l1tf = false; > + > low_phys_bits = boot_cpu_data.x86_phys_bits; > - if (boot_cpu_data.x86_phys_bits < > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > + shadow_nonpresent_or_rsvd_mask = 0; > + if (need_l1tf) { > shadow_nonpresent_or_rsvd_mask = > rsvd_bits(boot_cpu_data.x86_phys_bits - > shadow_nonpresent_or_rsvd_mask_len, This is broken, the reserved bits mask is being calculated with the wrong number of physical bits. I think fixing this would eliminate the need for the high_gpa_offset shenanigans. > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, > static void > __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > struct rsvd_bits_validate *rsvd_check, > - int maxphyaddr, int level, bool nx, bool gbpages, > + int first_rsvd_bit, int level, bool nx, bool gbpages, > bool pse, bool amd) Similar to the earlier comment regarding 'first', it's probably less confusing overall to just leave this as 'maxphyaddr'. > { > u64 exb_bit_rsvd = 0; > @@ -4369,14 +4465,14 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > break; > case PT32E_ROOT_LEVEL: > rsvd_check->rsvd_bits_mask[0][2] = > - rsvd_bits(maxphyaddr, 63) | > + rsvd_bits(first_rsvd_bit, 63) | > rsvd_bits(5, 8) | rsvd_bits(1, 2); /* PDPTE */ > rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd | > - rsvd_bits(maxphyaddr, 62); /* PDE */ > + rsvd_bits(first_rsvd_bit, 62); /* PDE */ > rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd | > - rsvd_bits(maxphyaddr, 62); /* PTE */ > + rsvd_bits(first_rsvd_bit, 62); /* PTE */ > rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd | > - rsvd_bits(maxphyaddr, 62) | > + rsvd_bits(first_rsvd_bit, 62) | > rsvd_bits(13, 20); /* large page */ > rsvd_check->rsvd_bits_mask[1][0] = > rsvd_check->rsvd_bits_mask[0][0]; > @@ -4384,28 +4480,28 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > case PT64_ROOT_5LEVEL: > rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd | > nonleaf_bit8_rsvd | rsvd_bits(7, 7) | > - rsvd_bits(maxphyaddr, 51); > + rsvd_bits(first_rsvd_bit, 51); > rsvd_check->rsvd_bits_mask[1][4] = > rsvd_check->rsvd_bits_mask[0][4]; > /* fall through */ > case PT64_ROOT_4LEVEL: > rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd | > nonleaf_bit8_rsvd | rsvd_bits(7, 7) | > - rsvd_bits(maxphyaddr, 51); > + rsvd_bits(first_rsvd_bit, 51); > rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd | > nonleaf_bit8_rsvd | gbpages_bit_rsvd | > - rsvd_bits(maxphyaddr, 51); > + rsvd_bits(first_rsvd_bit, 51); > rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd | > - rsvd_bits(maxphyaddr, 51); > + rsvd_bits(first_rsvd_bit, 51); > rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd | > - rsvd_bits(maxphyaddr, 51); > + rsvd_bits(first_rsvd_bit, 51); > rsvd_check->rsvd_bits_mask[1][3] = > rsvd_check->rsvd_bits_mask[0][3]; > rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd | > - gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) | > + gbpages_bit_rsvd | rsvd_bits(first_rsvd_bit, 51) | > rsvd_bits(13, 29); > rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd | > - rsvd_bits(maxphyaddr, 51) | > + rsvd_bits(first_rsvd_bit, 51) | > rsvd_bits(13, 20); /* large page */ > rsvd_check->rsvd_bits_mask[1][0] = > rsvd_check->rsvd_bits_mask[0][0]; > @@ -4425,27 +4521,27 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > > static void > __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check, > - int maxphyaddr, bool execonly) > + int first_rsvd_bit, bool execonly) > { > u64 bad_mt_xwr; > > rsvd_check->rsvd_bits_mask[0][4] = > - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7); > + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7); > rsvd_check->rsvd_bits_mask[0][3] = > - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7); > + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7); > rsvd_check->rsvd_bits_mask[0][2] = > - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6); > + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6); > rsvd_check->rsvd_bits_mask[0][1] = > - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6); > - rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); > + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6); > + rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(first_rsvd_bit, 51); > > /* large page */ > rsvd_check->rsvd_bits_mask[1][4] = rsvd_check->rsvd_bits_mask[0][4]; > rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3]; > rsvd_check->rsvd_bits_mask[1][2] = > - rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29); > + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 29); > rsvd_check->rsvd_bits_mask[1][1] = > - rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20); > + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 20); > rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0]; > > bad_mt_xwr = 0xFFull << (2 * 8); /* bits 3..5 must not be 2 */ > @@ -4523,13 +4619,13 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, > > if (boot_cpu_is_amd()) > __reset_rsvds_bits_mask(vcpu, shadow_zero_check, > - boot_cpu_data.x86_phys_bits, > + shadow_first_rsvd_bit, > context->shadow_root_level, false, > boot_cpu_has(X86_FEATURE_GBPAGES), > true, true); > else > __reset_rsvds_bits_mask_ept(shadow_zero_check, > - boot_cpu_data.x86_phys_bits, > + shadow_first_rsvd_bit, > false); > > if (!shadow_me_mask) > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 54c2a377795b..45e579e112ea 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -51,6 +51,7 @@ static inline u64 rsvd_bits(int s, int e) > return ((1ULL << (e - s + 1)) - 1) << s; > } > > +void kvm_set_mmio_spte_mask(void); > void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value); > > void > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 38440316a806..3a71aab495b8 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6848,35 +6848,6 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = { > .get_guest_ip = kvm_get_guest_ip, > }; > > -static void kvm_set_mmio_spte_mask(void) > -{ > - u64 mask; > - int maxphyaddr = boot_cpu_data.x86_phys_bits; > - > - /* > - * Set the reserved bits and the present bit of an paging-structure > - * entry to generate page fault with PFER.RSV = 1. > - */ > - > - /* > - * Mask the uppermost physical address bit, which would be reserved as > - * long as the supported physical address width is less than 52. > - */ > - mask = 1ull << 51; > - > - /* Set the present bit. */ > - mask |= 1ull; > - > - /* > - * If reserved bit is not supported, clear the present bit to disable > - * mmio page fault. > - */ > - if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52) > - mask &= ~1ull; > - > - kvm_mmu_set_mmio_spte_mask(mask, mask); > -} > - > #ifdef CONFIG_X86_64 > static void pvclock_gtod_update_fn(struct work_struct *work) > { > -- > 2.13.6 >
On Mon, 2019-04-22 at 09:39 -0700, Sean Christopherson wrote: > On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote: > > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced > > due to several high bits of physical address are repurposed for memory > > encryption. To honor such behavior those repurposed bits are reduced from > > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus > > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported > > by CPUID anymore. > > This neglects to mention the most relevant tidbit of information in terms > of justification for this patch: the number of bits stolen for MKTME is > programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of > kernel support. I can add BIOS part. But the key issue is kernel adjusts boot_cpu_data.x86_phys_bits, isn't it? If kernel doesn't adjust boot_cpu_data.x86_phys_bits then this patch theoretically is not needed? > > > KVM uses boot_cpu_data.x86_phys_bits to calculate several SPTE masks > > based on assumption that: 1) boot_cpu_data.x86_phys_bits equals to > > physical address bits reported by CPUID -- this is used to check CPU has > > reserved bits when KVM calculates shadow_mmio_{value|mask}; and whether > > shadow_nonpresent_or_rsvd_mask should be setup (KVM assumes L1TF is not > > present if CPU has 52 physical address bits); 2) if it is smaller than > > 52, bits [x86_phys_bits, 51] are reserved bits. > > > > With Intel MKTME or AMD SME/SEV above assumption is not valid any more, > > especially when calculating reserved bits with Intel MKTME, since Intel > > MKTME treats the reduced bits as 'keyID', thus they are not reduced > > bits, therefore boot_cpu_data.x86_phys_bits cannot be used to calcualte > > reserved bits anymore, although we can still use it for AMD SME/SEV > > since SME/SEV treats the reduced bits differently -- they are treated as > > reserved bits, the same as other reserved bits in page table entity [1]. > > > > Fix by introducing a new 'shadow_first_rsvd_bit' variable in kvm x86 MMU > > code to store the actual position of reserved bits -- for Intel MKTME, > > it equals to physical address reported by CPUID, and for AMD SME/SEV, it > > is boot_cpu_data.x86_phys_bits. And in reserved bits related calculation > > it is used instead of boot_cpu_data.x86_phys_bits. Some other code > > changes too to make code more reasonable, ie, kvm_set_mmio_spte_mask is > > moved to x86/kvm/mmu.c from x86/kvm/x86.c to use shadow_first_rsvd_bit; > > shadow_nonpresent_or_rsvd_mask calculation logic is slightly changed, > > based on the fact that only Intel CPU is impacted by L1TF, so that KVM > > can use shadow_first_rsvd_bit to check whether KVM should set > > shadow_nonpresent_or_rsvd_mask or not. > > > > Note that for the physical address bits reported to guest should remain > > unchanged -- KVM should report physical address reported by CPUID to > > guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME, > > there's no harm if guest sets up 'keyID' bits in guest page table (since > > MKTME only works at physical address level), and KVM doesn't even expose > > MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it > > should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore > > KVM should still reports physcial address reported by CPUID to guest. > > > > [1] Section 7.10.1 Determining Support for Secure Memory Encryption, > > AMD Architecture Programmer's Manual Volume 2: System Programming). > > Signed-off-by: Kai Huang <kai.huang@linux.intel.com> > > --- > > arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++---------- > > arch/x86/kvm/mmu.h | 1 + > > arch/x86/kvm/x86.c | 29 ---------- > > 3 files changed, 125 insertions(+), 57 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > > index e10962dfc203..3add7283980e 100644 > > --- a/arch/x86/kvm/mmu.c > > +++ b/arch/x86/kvm/mmu.c > > @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5; > > */ > > static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; > > > > +/* > > + * The position of first reserved bit. If it equals to 52, then CPU doesn't > > + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are > > + * reserved bits. > > + * > > + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of > > + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce > > + * physical address bits, and the reduced bits are taken away from > > + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD > > + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them > > + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved > > + * bits, thus physical address bits reported by CPUID cannot be used to > > + * determine reserved bits position either. > > + */ > > No need to rehash the justification for the variable, it's sufficient to > state *what* the variable tracks. And the whole first paragraph can be > dropped if the variable is renamed. > > > +static u64 __read_mostly shadow_first_rsvd_bit; > > Hmm, 'first' is technically incorrect since EPT has reserved attribute bits > on non-leaf entries. And 'first' is vague, e.g. it could be interpreted as > the MSB or LSB. > > In the end, we're still tracking the number of physical address bits, so > maybe something like 'shadow_phys_bits'? > > E.g. putting it together: > > /* > * The number of non-reserved physical address bits irrespective of features > * that repurpose software-accessible bits, e.g. MKTME. > */ > static u64 __read_mostly shadow_phys_bits; Fine to me. Will do what you suggested. > > > + > > > > static void mmu_spte_set(u64 *sptep, u64 spte); > > static union kvm_mmu_page_role > > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value) > > } > > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); > > > > +void kvm_set_mmio_spte_mask(void) > > Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is > silly. The current call site is immediately after kvm_mmu_module_init(), > I don't see any reason not to just move the call there. I don't know whether calling kvm_set_mmio_spte_mask() from kvm_arch_init() is silly but maybe there's histroy -- KVM calls kvm_mmu_set_mask_ptes() from kvm_arch_init() too, which may also be silly according to your judgement. I have no problem calling kvm_set_mmio_spte_mask() from kvm_mmu_module_init(), but IMHO this logic is irrelevant to this patch, and it's better to have a separate patch for this purpose if necessary? > > +{ > > + u64 mask; > > + > > + /* > > + * Set the reserved bits and the present bit of an paging-structure > > + * entry to generate page fault with PFER.RSV = 1. > > + */ > > + > > + /* > > + * Mask the uppermost physical address bit, which would be reserved as > > + * long as the supported physical address width is less than 52. > > + */ > > + mask = 1ull << 51; > > + > > + /* Set the present bit. */ > > + mask |= 1ull; > > + > > + /* > > + * If reserved bit is not supported, clear the present bit to disable > > + * mmio page fault. > > + */ > > + if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52) > > + mask &= ~1ull; > > + > > + kvm_mmu_set_mmio_spte_mask(mask, mask); > > +} > > + > > static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) > > { > > return sp->role.ad_disabled; > > @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, > > u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK; > > u64 mask = generation_mmio_spte_mask(gen); > > u64 gpa = gfn << PAGE_SHIFT; > > + u64 high_gpa_offset; > > > > access &= ACC_WRITE_MASK | ACC_USER_MASK; > > mask |= shadow_mmio_value | access; > > mask |= gpa | shadow_nonpresent_or_rsvd_mask; > > + /* > > + * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to > > + * shadow_first_rsvd_bit - 1 are actually keyID bits but not > > + * reserved bits. We need to put high GPA bits to actual reserved > > + * bits to mitigate L1TF attack. > > + */ > > + high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + > > + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; > > mask |= (gpa & shadow_nonpresent_or_rsvd_mask) > > - << shadow_nonpresent_or_rsvd_mask_len; > > + << high_gpa_offset; > > > > page_header(__pa(sptep))->mmio_cached = true; > > > > @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte) > > static gfn_t get_mmio_spte_gfn(u64 spte) > > { > > u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask; > > + /* See comments in mark_mmio_spte */ > > + u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + > > + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; > > The exact shift needed is constant after init, there should be no need to > dynamically calculate it for every MMIO SPTE. I can add one more static variable such as 'shadow_high_gfn_offset' and calculate it in kvm_mmu_reset_all_pte_masks (if necessary, as you mentioned below)? Is this good to you? > > > - gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len) > > + gpa |= (spte >> high_gpa_offset) > > & shadow_nonpresent_or_rsvd_mask; > > > > return gpa >> PAGE_SHIFT; > > @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, > > } > > EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes); > > > > +static u8 kvm_get_cpuid_phys_bits(void) > > +{ > > + u32 eax, ebx, ecx, edx; > > + > > + if (boot_cpu_data.extended_cpuid_level < 0x80000008) > > + return boot_cpu_data.x86_phys_bits; > > + > > + cpuid(0x80000008, &eax, &ebx, &ecx, &edx); > > + > > + return eax & 0xff; > > cpuid_eax() will do most of the work for you. Sure will do. thanks. > > > +} > > + > > static void kvm_mmu_reset_all_pte_masks(void) > > { > > u8 low_phys_bits; > > + bool need_l1tf; > > > > shadow_user_mask = 0; > > shadow_accessed_mask = 0; > > @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void) > > shadow_acc_track_mask = 0; > > > > /* > > - * If the CPU has 46 or less physical address bits, then set an > > - * appropriate mask to guard against L1TF attacks. Otherwise, it is > > + * Calcualte the first reserved bit position. Although both Intel > > + * MKTME and AMD SME/SEV reduce physical address bits for memory > > + * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect > > + * such fact), they treat those reduced bits differently -- Intel > > + * MKTME treats those as 'keyID' thus not reserved bits, but AMD > > + * SME/SEV still treats those bits as reserved bits, so for AMD > > + * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for > > + * Intel (and other x86 vendors that don't support memory encryption > > + * at all), shadow_first_rsvd_bit is physical address bits reported > > + * by CPUID. > > + */ > > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) > > Checking for a specific vendor should only be done as a last resort, e.g. > I believe this will break Hygon CPUs w/ SME/SEV. This can be easily adjusted in the future. I can add Hygon check to this patch. > E.g. the exisiting > "is AMD" check in mmu.c looks at shadow_x_mask, not an CPUID vendor. shadow_x_mask hasn't been calculated here. It is calcualted later in kvm_mmu_set_mask_ptes() which is called from kvm_arch_init() after kvm_x86_ops is setup. And does this justify "using CPU vendor check should be as last resort"? There are other kernel code checks CPU vendors too. > {MK}TME provides a CPUID feature bit and MSRs to query support, and this > code only runs at module init so the overhead of a RDMSR is negligible. Yes can check against CPUID feature bit but not sure why it is better than CPU vendor. > > > + shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits; > > + else > > + shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits(); > > If you rename the helper to be less specific and actually check for MKTME > support then the MKTME comment doesn't need to be as verbose, e.g.: > > static u8 kvm_get_shadow_phys_bits(void) > { > if (!<has MKTME> || > WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008)) > return boot_cpu_data.x86_phys_bits; Why do we need WARN_ON_ONCE here? I don't have problem using as you suggested, but I don't get why checking against CPU vendor is last resort? > > /* > * MKTME steals physical address bits for key IDs, but the key ID bits > * are not treated as reserved. x86_phys_bits is adjusted to account > * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports > * the number of software-available bits irrespective of MKTME. > */ > return cpuid_eax(0x80000008) & 0xff; > } > > > + > > + /* > > + * Only Intel is impacted by L1TF, therefore for AMD and other x86 > > + * vendors L1TF mitigation is not needed. > > + * > > + * For Intel CPU, if it has 46 or less physical address bits, then set > > + * an appropriate mask to guard against L1TF attacks. Otherwise, it is > > * assumed that the CPU is not vulnerable to L1TF. > > */ > > + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && > > Again, checking vendors is bad. Not employing a mitigation technique due > to an assumption about what processors are affected by a vulnerability is > even worse. Why, isn't it a fact that L1TF only impacts Intel CPUs? What's the benefit of employing a mitigation to CPUs that don't have L1TF issue? Such mitigation only makes performance worse, even not noticable. For example, why not employing such mitigation regardless to physical address bits at all? > The existing code makes the assumption regarding processors > with >46 bits of address space because a) no such processor existed before > the discovery of L1TF, and it's reasonable to assume hardware vendors > won't ship future processors with such an obvious vulnerability, and b) > hardcoding the number of reserved bits to set simplifies the code. Yes, but we cannot simply use 'shadow_phys_bits' to check against 46 anymore, right? For example, if AMD has 52 phys bits, but it reduces 5 or more bits, then current KVM code would employ l1tf mitigation, but actually it really shouldn't? > > > + (shadow_first_rsvd_bit < > > + 52 - shadow_nonpresent_or_rsvd_mask_len)) > > + need_l1tf = true; > > + else > > + need_l1tf = false; > > + > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > - if (boot_cpu_data.x86_phys_bits < > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > + shadow_nonpresent_or_rsvd_mask = 0; > > + if (need_l1tf) { > > shadow_nonpresent_or_rsvd_mask = > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > shadow_nonpresent_or_rsvd_mask_len, > > This is broken, the reserved bits mask is being calculated with the wrong > number of physical bits. I think fixing this would eliminate the need for > the high_gpa_offset shenanigans. You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether high_gpa_offset can be avoided. > > > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, > > static void > > __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > > struct rsvd_bits_validate *rsvd_check, > > - int maxphyaddr, int level, bool nx, bool gbpages, > > + int first_rsvd_bit, int level, bool nx, bool gbpages, > > bool pse, bool amd) > > Similar to the earlier comment regarding 'first', it's probably less > confusing overall to just leave this as 'maxphyaddr'. Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? Thanks, -Kai
On Mon, Apr 22, 2019 at 06:57:01PM -0700, Huang, Kai wrote: > On Mon, 2019-04-22 at 09:39 -0700, Sean Christopherson wrote: > > On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote: > > > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced > > > due to several high bits of physical address are repurposed for memory > > > encryption. To honor such behavior those repurposed bits are reduced from > > > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus > > > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported > > > by CPUID anymore. > > > > This neglects to mention the most relevant tidbit of information in terms > > of justification for this patch: the number of bits stolen for MKTME is > > programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of > > kernel support. > > I can add BIOS part. But the key issue is kernel adjusts > boot_cpu_data.x86_phys_bits, isn't it? > > If kernel doesn't adjust boot_cpu_data.x86_phys_bits then this patch > theoretically is not needed? True, but the context matters, e.g. readers might wonder why this code doesn't simply check a feature flag to see if MKTME is enabled. Knowing that PA bits can be repurposed regardless of (full) kernel support is just as important as knowing that the kernel adjusts boot_cpu_data.x86_phys_bits. ... > > > arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++---------- > > > arch/x86/kvm/mmu.h | 1 + > > > arch/x86/kvm/x86.c | 29 ---------- > > > 3 files changed, 125 insertions(+), 57 deletions(-) > > > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > > > index e10962dfc203..3add7283980e 100644 > > > --- a/arch/x86/kvm/mmu.c > > > +++ b/arch/x86/kvm/mmu.c > > > @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5; > > > */ > > > static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; > > > > > > +/* > > > + * The position of first reserved bit. If it equals to 52, then CPU doesn't > > > + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are > > > + * reserved bits. > > > + * > > > + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of > > > + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce > > > + * physical address bits, and the reduced bits are taken away from > > > + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD > > > + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them > > > + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved > > > + * bits, thus physical address bits reported by CPUID cannot be used to > > > + * determine reserved bits position either. > > > + */ > > > > No need to rehash the justification for the variable, it's sufficient to > > state *what* the variable tracks. And the whole first paragraph can be > > dropped if the variable is renamed. > > > > > +static u64 __read_mostly shadow_first_rsvd_bit; > > > > Hmm, 'first' is technically incorrect since EPT has reserved attribute bits > > on non-leaf entries. And 'first' is vague, e.g. it could be interpreted as > > the MSB or LSB. > > > > In the end, we're still tracking the number of physical address bits, so > > maybe something like 'shadow_phys_bits'? > > > > E.g. putting it together: > > > > /* > > * The number of non-reserved physical address bits irrespective of features > > * that repurpose software-accessible bits, e.g. MKTME. > > */ > > static u64 __read_mostly shadow_phys_bits; > > Fine to me. Will do what you suggested. > > > > > > + > > > > > > static void mmu_spte_set(u64 *sptep, u64 spte); > > > static union kvm_mmu_page_role > > > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value) > > > } > > > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); > > > > > > +void kvm_set_mmio_spte_mask(void) > > > > Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is > > silly. The current call site is immediately after kvm_mmu_module_init(), > > I don't see any reason not to just move the call there. > > I don't know whether calling kvm_set_mmio_spte_mask() from kvm_arch_init() is > silly but maybe there's histroy -- KVM calls kvm_mmu_set_mask_ptes() from > kvm_arch_init() too, which may also be silly according to your judgement. I > have no problem calling kvm_set_mmio_spte_mask() from kvm_mmu_module_init(), > but IMHO this logic is irrelevant to this patch, and it's better to have a > separate patch for this purpose if necessary? A separate patch would be fine, but I would do it as a prereq, i.e. move the function first and modify it second. That would help review the functional changes. > > > +{ > > > + u64 mask; > > > + > > > + /* > > > + * Set the reserved bits and the present bit of an paging-structure > > > + * entry to generate page fault with PFER.RSV = 1. > > > + */ > > > + > > > + /* > > > + * Mask the uppermost physical address bit, which would be reserved as > > > + * long as the supported physical address width is less than 52. > > > + */ > > > + mask = 1ull << 51; > > > + > > > + /* Set the present bit. */ > > > + mask |= 1ull; > > > + > > > + /* > > > + * If reserved bit is not supported, clear the present bit to disable > > > + * mmio page fault. > > > + */ > > > + if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52) > > > + mask &= ~1ull; > > > + > > > + kvm_mmu_set_mmio_spte_mask(mask, mask); > > > +} > > > + > > > static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) > > > { > > > return sp->role.ad_disabled; > > > @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, > > > u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK; > > > u64 mask = generation_mmio_spte_mask(gen); > > > u64 gpa = gfn << PAGE_SHIFT; > > > + u64 high_gpa_offset; > > > > > > access &= ACC_WRITE_MASK | ACC_USER_MASK; > > > mask |= shadow_mmio_value | access; > > > mask |= gpa | shadow_nonpresent_or_rsvd_mask; > > > + /* > > > + * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to > > > + * shadow_first_rsvd_bit - 1 are actually keyID bits but not > > > + * reserved bits. We need to put high GPA bits to actual reserved > > > + * bits to mitigate L1TF attack. > > > + */ > > > + high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + > > > + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; > > > mask |= (gpa & shadow_nonpresent_or_rsvd_mask) > > > - << shadow_nonpresent_or_rsvd_mask_len; > > > + << high_gpa_offset; > > > > > > page_header(__pa(sptep))->mmio_cached = true; > > > > > > @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte) > > > static gfn_t get_mmio_spte_gfn(u64 spte) > > > { > > > u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask; > > > + /* See comments in mark_mmio_spte */ > > > + u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + > > > + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; > > > > The exact shift needed is constant after init, there should be no need to > > dynamically calculate it for every MMIO SPTE. > > I can add one more static variable such as 'shadow_high_gfn_offset' and calculate it in > kvm_mmu_reset_all_pte_masks (if necessary, as you mentioned below)? Is this good to you? I'm pretty sure we can use shadow_phys_bits directly once the bug in kvm_mmu_reset_all_pte_masks() is fixed. > > > > > - gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len) > > > + gpa |= (spte >> high_gpa_offset) > > > & shadow_nonpresent_or_rsvd_mask; > > > > > > return gpa >> PAGE_SHIFT; > > > @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, > > > } > > > EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes); > > > > > > +static u8 kvm_get_cpuid_phys_bits(void) > > > +{ > > > + u32 eax, ebx, ecx, edx; > > > + > > > + if (boot_cpu_data.extended_cpuid_level < 0x80000008) > > > + return boot_cpu_data.x86_phys_bits; > > > + > > > + cpuid(0x80000008, &eax, &ebx, &ecx, &edx); > > > + > > > + return eax & 0xff; > > > > cpuid_eax() will do most of the work for you. > > Sure will do. thanks. > > > > > > +} > > > + > > > static void kvm_mmu_reset_all_pte_masks(void) > > > { > > > u8 low_phys_bits; > > > + bool need_l1tf; > > > > > > shadow_user_mask = 0; > > > shadow_accessed_mask = 0; > > > @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void) > > > shadow_acc_track_mask = 0; > > > > > > /* > > > - * If the CPU has 46 or less physical address bits, then set an > > > - * appropriate mask to guard against L1TF attacks. Otherwise, it is > > > + * Calcualte the first reserved bit position. Although both Intel > > > + * MKTME and AMD SME/SEV reduce physical address bits for memory > > > + * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect > > > + * such fact), they treat those reduced bits differently -- Intel > > > + * MKTME treats those as 'keyID' thus not reserved bits, but AMD > > > + * SME/SEV still treats those bits as reserved bits, so for AMD > > > + * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for > > > + * Intel (and other x86 vendors that don't support memory encryption > > > + * at all), shadow_first_rsvd_bit is physical address bits reported > > > + * by CPUID. > > > + */ > > > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) > > > > Checking for a specific vendor should only be done as a last resort, e.g. > > I believe this will break Hygon CPUs w/ SME/SEV. > > This can be easily adjusted in the future. I can add Hygon check to this patch. > > > E.g. the exisiting > > "is AMD" check in mmu.c looks at shadow_x_mask, not an CPUID vendor. > > shadow_x_mask hasn't been calculated here. It is calcualted later in kvm_mmu_set_mask_ptes() which > is called from kvm_arch_init() after kvm_x86_ops is setup. > > And does this justify "using CPU vendor check should be as last resort"? There are other kernel code > checks CPU vendors too. > > > > {MK}TME provides a CPUID feature bit and MSRs to query support, and this > > code only runs at module init so the overhead of a RDMSR is negligible. > > Yes can check against CPUID feature bit but not sure why it is better than CPU vendor. It helps readers understand the code flow, i.e. it's a mental cue that the bit stealing only applies to MKTME. Checking for "Intel" could easiliy be misinterpreted as "x86_phys_bits isn't accurate for Intel CPUs." > > > + shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits; > > > + else > > > + shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits(); > > > > If you rename the helper to be less specific and actually check for MKTME > > support then the MKTME comment doesn't need to be as verbose, e.g.: > > > > static u8 kvm_get_shadow_phys_bits(void) > > { > > if (!<has MKTME> || > > WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008)) > > return boot_cpu_data.x86_phys_bits; > > Why do we need WARN_ON_ONCE here? Because KVM would essentially be consuming known bad data since we've already established that 'x86_phys_bits' is wrong when MKTME is enabled. I.e. we should never encounter a platform with MKTME but not CPUID leaf 0x80000008. > I don't have problem using as you suggested, but I don't get why checking > against CPU vendor is last resort? A few reasons of the top of my head: - CPU vendor is less precise, e.g. by checking for MKTME support KVM can WARN if it's consuming known bad data. - Checking for features makes the code self-documenting to some extent. - Multiple vendors may support a feature, now or in the future. E.g. the Hygon case is a great example. > > > > /* > > * MKTME steals physical address bits for key IDs, but the key ID bits > > * are not treated as reserved. x86_phys_bits is adjusted to account > > * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports > > * the number of software-available bits irrespective of MKTME. > > */ > > return cpuid_eax(0x80000008) & 0xff; > > } > > > > > + > > > + /* > > > + * Only Intel is impacted by L1TF, therefore for AMD and other x86 > > > + * vendors L1TF mitigation is not needed. > > > + * > > > + * For Intel CPU, if it has 46 or less physical address bits, then set > > > + * an appropriate mask to guard against L1TF attacks. Otherwise, it is > > > * assumed that the CPU is not vulnerable to L1TF. > > > */ > > > + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && > > > > Again, checking vendors is bad. Not employing a mitigation technique due > > to an assumption about what processors are affected by a vulnerability is > > even worse. > > Why, isn't it a fact that L1TF only impacts Intel CPUs? What's the benefit of > employing a mitigation to CPUs that don't have L1TF issue? Such mitigation > only makes performance worse, even not noticable. For example, why not > employing such mitigation regardless to physical address bits at all? This particular mitigation has minimal effect on performance, e.g. adds a few SHL/SHR/AND/OR operations, and that minimal overhead is incurred regardless of whether or not reserved bits are set. I.e. KVM is paying the penalty no matter what, so being extra paranoid is "free". > > The existing code makes the assumption regarding processors > > with >46 bits of address space because a) no such processor existed before > > the discovery of L1TF, and it's reasonable to assume hardware vendors > > won't ship future processors with such an obvious vulnerability, and b) > > hardcoding the number of reserved bits to set simplifies the code. > > Yes, but we cannot simply use 'shadow_phys_bits' to check against 46 anymore, > right? > > For example, if AMD has 52 phys bits, but it reduces 5 or more bits, then > current KVM code would employ l1tf mitigation, but actually it really > shouldn't? What do you mean by "shouldn't"? Sure, it's not absolutely necessary, but again setting the bits is free since adjusting the GPA is hardcoded into mark_mmio_spte() and get_mmio_spte_gfn(). > > > + (shadow_first_rsvd_bit < > > > + 52 - shadow_nonpresent_or_rsvd_mask_len)) > > > + need_l1tf = true; > > > + else > > > + need_l1tf = false; > > > + > > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > > - if (boot_cpu_data.x86_phys_bits < > > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > > + shadow_nonpresent_or_rsvd_mask = 0; > > > + if (need_l1tf) { > > > shadow_nonpresent_or_rsvd_mask = > > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > > shadow_nonpresent_or_rsvd_mask_len, > > > > This is broken, the reserved bits mask is being calculated with the wrong > > number of physical bits. I think fixing this would eliminate the need for > > the high_gpa_offset shenanigans. > > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether high_gpa_offset > can be avoided. > > > > > > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, > > > static void > > > __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > > > struct rsvd_bits_validate *rsvd_check, > > > - int maxphyaddr, int level, bool nx, bool gbpages, > > > + int first_rsvd_bit, int level, bool nx, bool gbpages, > > > bool pse, bool amd) > > > > Similar to the earlier comment regarding 'first', it's probably less > > confusing overall to just leave this as 'maxphyaddr'. > > Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? Maybe? My personal preference would be to stay with maxphyaddr.
On Tue, 2019-04-23 at 08:08 -0700, Sean Christopherson wrote: > On Mon, Apr 22, 2019 at 06:57:01PM -0700, Huang, Kai wrote: > > On Mon, 2019-04-22 at 09:39 -0700, Sean Christopherson wrote: > > > On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote: > > > > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced > > > > due to several high bits of physical address are repurposed for memory > > > > encryption. To honor such behavior those repurposed bits are reduced from > > > > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus > > > > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported > > > > by CPUID anymore. > > > > > > This neglects to mention the most relevant tidbit of information in terms > > > of justification for this patch: the number of bits stolen for MKTME is > > > programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of > > > kernel support. > > > > I can add BIOS part. But the key issue is kernel adjusts > > boot_cpu_data.x86_phys_bits, isn't it? > > > > If kernel doesn't adjust boot_cpu_data.x86_phys_bits then this patch > > theoretically is not needed? > > True, but the context matters, e.g. readers might wonder why this code > doesn't simply check a feature flag to see if MKTME is enabled. Knowing > that PA bits can be repurposed regardless of (full) kernel support is just > as important as knowing that the kernel adjusts boot_cpu_data.x86_phys_bits. Fine. [...] > > > > > > > + > > > > > > > > static void mmu_spte_set(u64 *sptep, u64 spte); > > > > static union kvm_mmu_page_role > > > > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value) > > > > } > > > > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); > > > > > > > > +void kvm_set_mmio_spte_mask(void) > > > > > > Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is > > > silly. The current call site is immediately after kvm_mmu_module_init(), > > > I don't see any reason not to just move the call there. > > > > I don't know whether calling kvm_set_mmio_spte_mask() from kvm_arch_init() is > > silly but maybe there's histroy -- KVM calls kvm_mmu_set_mask_ptes() from > > kvm_arch_init() too, which may also be silly according to your judgement. I > > have no problem calling kvm_set_mmio_spte_mask() from kvm_mmu_module_init(), > > but IMHO this logic is irrelevant to this patch, and it's better to have a > > separate patch for this purpose if necessary? > > A separate patch would be fine, but I would do it as a prereq, i.e. move > the function first and modify it second. That would help review the > functional changes. I think I am still a little bit confused. Assuming you want to call kvm_set_mmio_spte_mask() from kvm_mmu_module_init(), right after kvm_mmu_reset_all_pte_masks() is called, does this change do any real fix related to this issue? Or exactly what are you suggesting if I am misunderstanding? Btw I think calling kvm_set_mmio_spte_mask() from kvm_arch_init() has benefits, since we can change to call it after kvm_x86_ops has been setup, and then we can do some kvm_x86_ops->xxx() if needed inside. And we also have kvm_mmu_set_mask_ptes(), which seems to be similar thing as kvm_set_mmio_spte_mask(). I don't see why we should call kvm_set_mmio_spte_mask() in kvm_mmu_module_init() but leave kvm_mmu_set_mask_ptes() outside. [...] > > > > > > > > {MK}TME provides a CPUID feature bit and MSRs to query support, and this > > > code only runs at module init so the overhead of a RDMSR is negligible. > > > > Yes can check against CPUID feature bit but not sure why it is better than CPU vendor. > > It helps readers understand the code flow, i.e. it's a mental cue that the > bit stealing only applies to MKTME. Checking for "Intel" could easiliy be > misinterpreted as "x86_phys_bits isn't accurate for Intel CPUs." OK. Fine to me. > > > > > + shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits; > > > > + else > > > > + shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits(); > > > > > > If you rename the helper to be less specific and actually check for MKTME > > > support then the MKTME comment doesn't need to be as verbose, e.g.: > > > > > > static u8 kvm_get_shadow_phys_bits(void) > > > { > > > if (!<has MKTME> || > > > WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008)) > > > return boot_cpu_data.x86_phys_bits; > > > > Why do we need WARN_ON_ONCE here? > > Because KVM would essentially be consuming known bad data since we've > already established that 'x86_phys_bits' is wrong when MKTME is enabled. > I.e. we should never encounter a platform with MKTME but not CPUID leaf > 0x80000008. Can you make assumption that compiler won't do WARN_ON_ONCE part before checking !<has_mktme>? > > > I don't have problem using as you suggested, but I don't get why checking > > against CPU vendor is last resort? > > A few reasons of the top of my head: > > - CPU vendor is less precise, e.g. by checking for MKTME support KVM can > WARN if it's consuming known bad data. > > - Checking for features makes the code self-documenting to some extent. > > - Multiple vendors may support a feature, now or in the future. E.g. the > Hygon case is a great example. OK. > > > > > > > /* > > > * MKTME steals physical address bits for key IDs, but the key ID bits > > > * are not treated as reserved. x86_phys_bits is adjusted to account > > > * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports > > > * the number of software-available bits irrespective of MKTME. > > > */ > > > return cpuid_eax(0x80000008) & 0xff; > > > } > > > > > > > + > > > > + /* > > > > + * Only Intel is impacted by L1TF, therefore for AMD and other x86 > > > > + * vendors L1TF mitigation is not needed. > > > > + * > > > > + * For Intel CPU, if it has 46 or less physical address bits, then set > > > > + * an appropriate mask to guard against L1TF attacks. Otherwise, it is > > > > * assumed that the CPU is not vulnerable to L1TF. > > > > */ > > > > + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && > > > > > > Again, checking vendors is bad. Not employing a mitigation technique due > > > to an assumption about what processors are affected by a vulnerability is > > > even worse. > > > > Why, isn't it a fact that L1TF only impacts Intel CPUs? What's the benefit of > > employing a mitigation to CPUs that don't have L1TF issue? Such mitigation > > only makes performance worse, even not noticable. For example, why not > > employing such mitigation regardless to physical address bits at all? > > This particular mitigation has minimal effect on performance, e.g. adds > a few SHL/SHR/AND/OR operations, and that minimal overhead is incurred > regardless of whether or not reserved bits are set. I.e. KVM is paying > the penalty no matter what, so being extra paranoid is "free". > > > > The existing code makes the assumption regarding processors > > > with >46 bits of address space because a) no such processor existed before > > > the discovery of L1TF, and it's reasonable to assume hardware vendors > > > won't ship future processors with such an obvious vulnerability, and b) > > > hardcoding the number of reserved bits to set simplifies the code. > > > > Yes, but we cannot simply use 'shadow_phys_bits' to check against 46 anymore, > > right? > > > > For example, if AMD has 52 phys bits, but it reduces 5 or more bits, then > > current KVM code would employ l1tf mitigation, but actually it really > > shouldn't? > > What do you mean by "shouldn't"? Sure, it's not absolutely necessary, but > again setting the bits is free since adjusting the GPA is hardcoded into > mark_mmio_spte() and get_mmio_spte_gfn(). OK. I am not sure whether CPU has any difference when doing x << 5 and x << 0 but this is really trival so will do what you suggested. Or I found there's one CPU bug flag (software) X86_BUG_L1TF, do you think we should check whether boot_cpu_has(X86_BUG_L1TF), and don't setup shadow_nonpresent_or_rsvd_mask if it doesn't? > > > > > + (shadow_first_rsvd_bit < > > > > + 52 - shadow_nonpresent_or_rsvd_mask_len)) > > > > + need_l1tf = true; > > > > + else > > > > + need_l1tf = false; > > > > + > > > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > > > - if (boot_cpu_data.x86_phys_bits < > > > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > > > + shadow_nonpresent_or_rsvd_mask = 0; > > > > + if (need_l1tf) { > > > > shadow_nonpresent_or_rsvd_mask = > > > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > > > shadow_nonpresent_or_rsvd_mask_len, > > > > > > This is broken, the reserved bits mask is being calculated with the wrong > > > number of physical bits. I think fixing this would eliminate the need for > > > the high_gpa_offset shenanigans. > > > > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether > > high_gpa_offset > > can be avoided. > > > > > > > > > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, > > > > static void > > > > __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > > > > struct rsvd_bits_validate *rsvd_check, > > > > - int maxphyaddr, int level, bool nx, bool gbpages, > > > > + int first_rsvd_bit, int level, bool nx, bool gbpages, > > > > bool pse, bool amd) > > > > > > Similar to the earlier comment regarding 'first', it's probably less > > > confusing overall to just leave this as 'maxphyaddr'. > > > > Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? > > Maybe? My personal preference would be to stay with maxphyaddr. Fine to maxphyaddr. Btw, which one do you think is better, shadow_phys_bits, or more explicitly, shadow_non_rsvd_phys_bits? Thanks, -Kai
> > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > > - if (boot_cpu_data.x86_phys_bits < > > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > > + shadow_nonpresent_or_rsvd_mask = 0; > > > + if (need_l1tf) { > > > shadow_nonpresent_or_rsvd_mask = > > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > > shadow_nonpresent_or_rsvd_mask_len, > > > > This is broken, the reserved bits mask is being calculated with the wrong > > number of physical bits. I think fixing this would eliminate the need for > > the high_gpa_offset shenanigans. > > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether high_gpa_offset > can be avoided. > Hi Sean, Paolo, and others, After re-thinking, I think we should even use boot_cpu_data.x86_cache_bits to calculate shadow_nonpresent_or_rsvd_mask, but not shadow_phys_bits, since for some particular Intel CPU, the internal cache bits are larger than physical address bits reported by CPUID. To make this KVM L1TF migitation work, we actually have to set the highest bit of cache bits, but not the physical address bits in SPTE (which means the original code also has a bug if I understand correctly). Comments? Thanks, -Kai
On Tue, Apr 23, 2019 at 04:38:17PM -0700, Huang, Kai wrote: > On Tue, 2019-04-23 at 08:08 -0700, Sean Christopherson wrote: > > On Mon, Apr 22, 2019 at 06:57:01PM -0700, Huang, Kai wrote: > > > I don't know whether calling kvm_set_mmio_spte_mask() from > > > kvm_arch_init() is silly but maybe there's histroy -- KVM calls > > > kvm_mmu_set_mask_ptes() from kvm_arch_init() too, which may also be silly > > > according to your judgement. I have no problem calling > > > kvm_set_mmio_spte_mask() from kvm_mmu_module_init(), but IMHO this logic > > > is irrelevant to this patch, and it's better to have a separate patch for > > > this purpose if necessary? > > > > A separate patch would be fine, but I would do it as a prereq, i.e. move > > the function first and modify it second. That would help review the > > functional changes. > > I think I am still a little bit confused. Assuming you want to call > kvm_set_mmio_spte_mask() from kvm_mmu_module_init(), right after > kvm_mmu_reset_all_pte_masks() is called, does this change do any real fix > related to this issue? Or exactly what are you suggesting if I am > misunderstanding? Moving kvm_set_mmio_spte_mask() to mmu.c allows it to be a static function local to mmu.c. AFAICT the original decision to define and call it in x86.c was completely arbitrary. Now that it depends on variables in mmu.c then it makes sense to move it. > Btw I think calling kvm_set_mmio_spte_mask() from kvm_arch_init() has > benefits, since we can change to call it after kvm_x86_ops has been setup, > and then we can do some kvm_x86_ops->xxx() if needed inside. Alternatively, @ops could be passed to kvm_mmu_module_init(). Regardless, using 'struct kvm_x86_ops' for unadjusting the phys bits is probably overkill, but it might make sense if we're doing RDMSR to process MKTME information. I guess I don't have a strong opinion :-) > And we also have kvm_mmu_set_mask_ptes(), which seems to be similar thing as > kvm_set_mmio_spte_mask(). I don't see why we should call > kvm_set_mmio_spte_mask() in kvm_mmu_module_init() but leave > kvm_mmu_set_mask_ptes() outside. True, although kvm_mmu_set_mask_ptes() is pretty clear cut since the call from VMX when EPT is enabled consumes information that is already available in VMX-specific code, e.g. A/D bit enabling, whereas the MKTME handling is new one-off code no matter where it lands. > > Because KVM would essentially be consuming known bad data since we've > > already established that 'x86_phys_bits' is wrong when MKTME is enabled. > > I.e. we should never encounter a platform with MKTME but not CPUID leaf > > 0x80000008. > > Can you make assumption that compiler won't do WARN_ON_ONCE part before > checking !<has_mktme>? Absolutely, it's part of the C99 spec[1] under '6.5.14 Logical OR operator'. Without strict left-to-right evaluation, things like 'if (ptr && ptr->deref)' and 'if (!ptr || ptr->invalid)' fall apart. [1] http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf > > What do you mean by "shouldn't"? Sure, it's not absolutely necessary, but > > again setting the bits is free since adjusting the GPA is hardcoded into > > mark_mmio_spte() and get_mmio_spte_gfn(). > > OK. I am not sure whether CPU has any difference when doing x << 5 and x << 0 > but this is really trival so will do what you suggested. 'x << 5' vs. 'x << 0' doesn't matter, but ensuring the shift amount is a compile-time constant absolutely does matter since it allows the compiler to emit a single 'SH{L,R} reg, imm' instead of having to load CL to do 'SH{L,R} reg, CL'. E.g. setting the reserved bits compiles to: constant: 0x0000000000025ce6 <+86>: and %rsi,%rcx 0x0000000000025cf0 <+96>: shl $0x5,%rcx 0x0000000000025d38 <+168>: or %rax,%rcx 0x0000000000025d43 <+179>: mov %rcx,%r15 variable: 0x0000000000025d74 <+148>: and %rcx,%rdi 0x0000000000025d77 <+151>: mov 0x0(%rip),%rcx 0x0000000000025d7e <+158>: mov %rdi,%rdx 0x0000000000025d84 <+164>: shl %cl,%rdx 0x0000000000025d87 <+167>: mov %rdx,%rcx 0x0000000000025d94 <+180>: or %rax,%rcx 0x0000000000025d9f <+191>: mov %rcx,%r15 > Or I found there's one CPU bug flag (software) X86_BUG_L1TF, do you think we > should check whether boot_cpu_has(X86_BUG_L1TF), and don't setup > shadow_nonpresent_or_rsvd_mask if it doesn't? No. Again, the cost is paid regardless, being paranoid is free. > > > > > + (shadow_first_rsvd_bit < > > > > > + 52 - shadow_nonpresent_or_rsvd_mask_len)) > > > > > + need_l1tf = true; > > > > > + else > > > > > + need_l1tf = false; > > > > > + > > > > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > > > > - if (boot_cpu_data.x86_phys_bits < > > > > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > > > > + shadow_nonpresent_or_rsvd_mask = 0; > > > > > + if (need_l1tf) { > > > > > shadow_nonpresent_or_rsvd_mask = > > > > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > > > > shadow_nonpresent_or_rsvd_mask_len, > > > > > > > > This is broken, the reserved bits mask is being calculated with the wrong > > > > number of physical bits. I think fixing this would eliminate the need for > > > > the high_gpa_offset shenanigans. > > > > > > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether > > > high_gpa_offset > > > can be avoided. > > > > > > > > > > > > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, > > > > > static void > > > > > __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, > > > > > struct rsvd_bits_validate *rsvd_check, > > > > > - int maxphyaddr, int level, bool nx, bool gbpages, > > > > > + int first_rsvd_bit, int level, bool nx, bool gbpages, > > > > > bool pse, bool amd) > > > > > > > > Similar to the earlier comment regarding 'first', it's probably less > > > > confusing overall to just leave this as 'maxphyaddr'. > > > > > > Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? > > > > Maybe? My personal preference would be to stay with maxphyaddr. > > Fine to maxphyaddr. Btw, which one do you think is better, shadow_phys_bits, or more explicitly, > shadow_non_rsvd_phys_bits? shadow_phys_bits. In both cases, I prefer avoiding 'non_rsvd' because it's redundant in the common (non-MKTME) case and seems more likely to confuse than clarify. Again, just personal preference...
On Wed, Apr 24, 2019 at 05:13:12AM -0700, Huang, Kai wrote: > > > > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > > > - if (boot_cpu_data.x86_phys_bits < > > > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > > > + shadow_nonpresent_or_rsvd_mask = 0; > > > > + if (need_l1tf) { > > > > shadow_nonpresent_or_rsvd_mask = > > > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > > > shadow_nonpresent_or_rsvd_mask_len, > > > > > > This is broken, the reserved bits mask is being calculated with the wrong > > > number of physical bits. I think fixing this would eliminate the need for > > > the high_gpa_offset shenanigans. > > > > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think > > whether high_gpa_offset can be avoided. > > > > Hi Sean, Paolo, and others, > > After re-thinking, I think we should even use boot_cpu_data.x86_cache_bits to > calculate shadow_nonpresent_or_rsvd_mask, but not shadow_phys_bits, since for > some particular Intel CPU, the internal cache bits are larger than physical > address bits reported by CPUID. To make this KVM L1TF migitation work, we > actually have to set the highest bit of cache bits, but not the physical > address bits in SPTE (which means the original code also has a bug if I > understand correctly). What's the exact CPU behavior you're referencing? Unless it's doing some crazy PA aliasing it should be a non-issue.
On Wed, 2019-04-24 at 08:57 -0700, Sean Christopherson wrote: > On Wed, Apr 24, 2019 at 05:13:12AM -0700, Huang, Kai wrote: > > > > > > > low_phys_bits = boot_cpu_data.x86_phys_bits; > > > > > - if (boot_cpu_data.x86_phys_bits < > > > > > - 52 - shadow_nonpresent_or_rsvd_mask_len) { > > > > > + shadow_nonpresent_or_rsvd_mask = 0; > > > > > + if (need_l1tf) { > > > > > shadow_nonpresent_or_rsvd_mask = > > > > > rsvd_bits(boot_cpu_data.x86_phys_bits - > > > > > shadow_nonpresent_or_rsvd_mask_len, > > > > > > > > This is broken, the reserved bits mask is being calculated with the wrong > > > > number of physical bits. I think fixing this would eliminate the need for > > > > the high_gpa_offset shenanigans. > > > > > > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think > > > whether high_gpa_offset can be avoided. > > > > > > > Hi Sean, Paolo, and others, > > > > After re-thinking, I think we should even use boot_cpu_data.x86_cache_bits to > > calculate shadow_nonpresent_or_rsvd_mask, but not shadow_phys_bits, since for > > some particular Intel CPU, the internal cache bits are larger than physical > > address bits reported by CPUID. To make this KVM L1TF migitation work, we > > actually have to set the highest bit of cache bits, but not the physical > > address bits in SPTE (which means the original code also has a bug if I > > understand correctly). > > What's the exact CPU behavior you're referencing? Unless it's doing some > crazy PA aliasing it should be a non-issue. There's no spec saying the exact behavior, but I am looking at this: https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-l1-terminal- fault "Some processors may internally implement more address bits in the L1D cache than are reported in MAXPHYADDR. This is not reported by CPUID, so the following table can be used: Table 2: Processors Implementing More L1D Address Bits than Reported Processor code name Implemented L1D bits Nehalem, Westmere 44 Sandy Bridge and newer 46 On these systems the OS can set one or more bits above MAXPHYADDR but below the L1D limit to ensure that the PTE does not reference any physical memory address. This can often be used to avoid limiting the amount of usable physical memory." And kernel actually set limit of usable physical memory by checking boot_cpu_data.x86_cache_bits: static inline unsigned long long l1tf_pfn_limit(void) { return BIT_ULL(boot_cpu_data.x86_cache_bits - 1 - PAGE_SHIFT); } Thanks, -Kai
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e10962dfc203..3add7283980e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5; */ static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; +/* + * The position of first reserved bit. If it equals to 52, then CPU doesn't + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are + * reserved bits. + * + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce + * physical address bits, and the reduced bits are taken away from + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved + * bits, thus physical address bits reported by CPUID cannot be used to + * determine reserved bits position either. + */ +static u64 __read_mostly shadow_first_rsvd_bit; + static void mmu_spte_set(u64 *sptep, u64 spte); static union kvm_mmu_page_role @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value) } EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); +void kvm_set_mmio_spte_mask(void) +{ + u64 mask; + + /* + * Set the reserved bits and the present bit of an paging-structure + * entry to generate page fault with PFER.RSV = 1. + */ + + /* + * Mask the uppermost physical address bit, which would be reserved as + * long as the supported physical address width is less than 52. + */ + mask = 1ull << 51; + + /* Set the present bit. */ + mask |= 1ull; + + /* + * If reserved bit is not supported, clear the present bit to disable + * mmio page fault. + */ + if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52) + mask &= ~1ull; + + kvm_mmu_set_mmio_spte_mask(mask, mask); +} + static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) { return sp->role.ad_disabled; @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK; u64 mask = generation_mmio_spte_mask(gen); u64 gpa = gfn << PAGE_SHIFT; + u64 high_gpa_offset; access &= ACC_WRITE_MASK | ACC_USER_MASK; mask |= shadow_mmio_value | access; mask |= gpa | shadow_nonpresent_or_rsvd_mask; + /* + * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to + * shadow_first_rsvd_bit - 1 are actually keyID bits but not + * reserved bits. We need to put high GPA bits to actual reserved + * bits to mitigate L1TF attack. + */ + high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; mask |= (gpa & shadow_nonpresent_or_rsvd_mask) - << shadow_nonpresent_or_rsvd_mask_len; + << high_gpa_offset; page_header(__pa(sptep))->mmio_cached = true; @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte) static gfn_t get_mmio_spte_gfn(u64 spte) { u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask; + /* See comments in mark_mmio_spte */ + u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len + + shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits; - gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len) + gpa |= (spte >> high_gpa_offset) & shadow_nonpresent_or_rsvd_mask; return gpa >> PAGE_SHIFT; @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, } EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes); +static u8 kvm_get_cpuid_phys_bits(void) +{ + u32 eax, ebx, ecx, edx; + + if (boot_cpu_data.extended_cpuid_level < 0x80000008) + return boot_cpu_data.x86_phys_bits; + + cpuid(0x80000008, &eax, &ebx, &ecx, &edx); + + return eax & 0xff; +} + static void kvm_mmu_reset_all_pte_masks(void) { u8 low_phys_bits; + bool need_l1tf; shadow_user_mask = 0; shadow_accessed_mask = 0; @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void) shadow_acc_track_mask = 0; /* - * If the CPU has 46 or less physical address bits, then set an - * appropriate mask to guard against L1TF attacks. Otherwise, it is + * Calcualte the first reserved bit position. Although both Intel + * MKTME and AMD SME/SEV reduce physical address bits for memory + * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect + * such fact), they treat those reduced bits differently -- Intel + * MKTME treats those as 'keyID' thus not reserved bits, but AMD + * SME/SEV still treats those bits as reserved bits, so for AMD + * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for + * Intel (and other x86 vendors that don't support memory encryption + * at all), shadow_first_rsvd_bit is physical address bits reported + * by CPUID. + */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) + shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits; + else + shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits(); + + /* + * Only Intel is impacted by L1TF, therefore for AMD and other x86 + * vendors L1TF mitigation is not needed. + * + * For Intel CPU, if it has 46 or less physical address bits, then set + * an appropriate mask to guard against L1TF attacks. Otherwise, it is * assumed that the CPU is not vulnerable to L1TF. */ + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && + (shadow_first_rsvd_bit < + 52 - shadow_nonpresent_or_rsvd_mask_len)) + need_l1tf = true; + else + need_l1tf = false; + low_phys_bits = boot_cpu_data.x86_phys_bits; - if (boot_cpu_data.x86_phys_bits < - 52 - shadow_nonpresent_or_rsvd_mask_len) { + shadow_nonpresent_or_rsvd_mask = 0; + if (need_l1tf) { shadow_nonpresent_or_rsvd_mask = rsvd_bits(boot_cpu_data.x86_phys_bits - shadow_nonpresent_or_rsvd_mask_len, @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, static void __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, struct rsvd_bits_validate *rsvd_check, - int maxphyaddr, int level, bool nx, bool gbpages, + int first_rsvd_bit, int level, bool nx, bool gbpages, bool pse, bool amd) { u64 exb_bit_rsvd = 0; @@ -4369,14 +4465,14 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, break; case PT32E_ROOT_LEVEL: rsvd_check->rsvd_bits_mask[0][2] = - rsvd_bits(maxphyaddr, 63) | + rsvd_bits(first_rsvd_bit, 63) | rsvd_bits(5, 8) | rsvd_bits(1, 2); /* PDPTE */ rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 62); /* PDE */ + rsvd_bits(first_rsvd_bit, 62); /* PDE */ rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 62); /* PTE */ + rsvd_bits(first_rsvd_bit, 62); /* PTE */ rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 62) | + rsvd_bits(first_rsvd_bit, 62) | rsvd_bits(13, 20); /* large page */ rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0]; @@ -4384,28 +4480,28 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, case PT64_ROOT_5LEVEL: rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd | nonleaf_bit8_rsvd | rsvd_bits(7, 7) | - rsvd_bits(maxphyaddr, 51); + rsvd_bits(first_rsvd_bit, 51); rsvd_check->rsvd_bits_mask[1][4] = rsvd_check->rsvd_bits_mask[0][4]; /* fall through */ case PT64_ROOT_4LEVEL: rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd | nonleaf_bit8_rsvd | rsvd_bits(7, 7) | - rsvd_bits(maxphyaddr, 51); + rsvd_bits(first_rsvd_bit, 51); rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd | nonleaf_bit8_rsvd | gbpages_bit_rsvd | - rsvd_bits(maxphyaddr, 51); + rsvd_bits(first_rsvd_bit, 51); rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51); + rsvd_bits(first_rsvd_bit, 51); rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51); + rsvd_bits(first_rsvd_bit, 51); rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3]; rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd | - gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) | + gbpages_bit_rsvd | rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(13, 29); rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd | - rsvd_bits(maxphyaddr, 51) | + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(13, 20); /* large page */ rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0]; @@ -4425,27 +4521,27 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, static void __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check, - int maxphyaddr, bool execonly) + int first_rsvd_bit, bool execonly) { u64 bad_mt_xwr; rsvd_check->rsvd_bits_mask[0][4] = - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7); + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7); rsvd_check->rsvd_bits_mask[0][3] = - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7); + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7); rsvd_check->rsvd_bits_mask[0][2] = - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6); + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6); rsvd_check->rsvd_bits_mask[0][1] = - rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6); - rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6); + rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(first_rsvd_bit, 51); /* large page */ rsvd_check->rsvd_bits_mask[1][4] = rsvd_check->rsvd_bits_mask[0][4]; rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3]; rsvd_check->rsvd_bits_mask[1][2] = - rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29); + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 29); rsvd_check->rsvd_bits_mask[1][1] = - rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20); + rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 20); rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0]; bad_mt_xwr = 0xFFull << (2 * 8); /* bits 3..5 must not be 2 */ @@ -4523,13 +4619,13 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, if (boot_cpu_is_amd()) __reset_rsvds_bits_mask(vcpu, shadow_zero_check, - boot_cpu_data.x86_phys_bits, + shadow_first_rsvd_bit, context->shadow_root_level, false, boot_cpu_has(X86_FEATURE_GBPAGES), true, true); else __reset_rsvds_bits_mask_ept(shadow_zero_check, - boot_cpu_data.x86_phys_bits, + shadow_first_rsvd_bit, false); if (!shadow_me_mask) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 54c2a377795b..45e579e112ea 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -51,6 +51,7 @@ static inline u64 rsvd_bits(int s, int e) return ((1ULL << (e - s + 1)) - 1) << s; } +void kvm_set_mmio_spte_mask(void); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value); void diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 38440316a806..3a71aab495b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6848,35 +6848,6 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = { .get_guest_ip = kvm_get_guest_ip, }; -static void kvm_set_mmio_spte_mask(void) -{ - u64 mask; - int maxphyaddr = boot_cpu_data.x86_phys_bits; - - /* - * Set the reserved bits and the present bit of an paging-structure - * entry to generate page fault with PFER.RSV = 1. - */ - - /* - * Mask the uppermost physical address bit, which would be reserved as - * long as the supported physical address width is less than 52. - */ - mask = 1ull << 51; - - /* Set the present bit. */ - mask |= 1ull; - - /* - * If reserved bit is not supported, clear the present bit to disable - * mmio page fault. - */ - if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52) - mask &= ~1ull; - - kvm_mmu_set_mmio_spte_mask(mask, mask); -} - #ifdef CONFIG_X86_64 static void pvclock_gtod_update_fn(struct work_struct *work) {
With both Intel MKTME and AMD SME/SEV, physical address bits are reduced due to several high bits of physical address are repurposed for memory encryption. To honor such behavior those repurposed bits are reduced from cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported by CPUID anymore. KVM uses boot_cpu_data.x86_phys_bits to calculate several SPTE masks based on assumption that: 1) boot_cpu_data.x86_phys_bits equals to physical address bits reported by CPUID -- this is used to check CPU has reserved bits when KVM calculates shadow_mmio_{value|mask}; and whether shadow_nonpresent_or_rsvd_mask should be setup (KVM assumes L1TF is not present if CPU has 52 physical address bits); 2) if it is smaller than 52, bits [x86_phys_bits, 51] are reserved bits. With Intel MKTME or AMD SME/SEV above assumption is not valid any more, especially when calculating reserved bits with Intel MKTME, since Intel MKTME treats the reduced bits as 'keyID', thus they are not reduced bits, therefore boot_cpu_data.x86_phys_bits cannot be used to calcualte reserved bits anymore, although we can still use it for AMD SME/SEV since SME/SEV treats the reduced bits differently -- they are treated as reserved bits, the same as other reserved bits in page table entity [1]. Fix by introducing a new 'shadow_first_rsvd_bit' variable in kvm x86 MMU code to store the actual position of reserved bits -- for Intel MKTME, it equals to physical address reported by CPUID, and for AMD SME/SEV, it is boot_cpu_data.x86_phys_bits. And in reserved bits related calculation it is used instead of boot_cpu_data.x86_phys_bits. Some other code changes too to make code more reasonable, ie, kvm_set_mmio_spte_mask is moved to x86/kvm/mmu.c from x86/kvm/x86.c to use shadow_first_rsvd_bit; shadow_nonpresent_or_rsvd_mask calculation logic is slightly changed, based on the fact that only Intel CPU is impacted by L1TF, so that KVM can use shadow_first_rsvd_bit to check whether KVM should set shadow_nonpresent_or_rsvd_mask or not. Note that for the physical address bits reported to guest should remain unchanged -- KVM should report physical address reported by CPUID to guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME, there's no harm if guest sets up 'keyID' bits in guest page table (since MKTME only works at physical address level), and KVM doesn't even expose MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore KVM should still reports physcial address reported by CPUID to guest. [1] Section 7.10.1 Determining Support for Secure Memory Encryption, AMD Architecture Programmer's Manual Volume 2: System Programming). Signed-off-by: Kai Huang <kai.huang@linux.intel.com> --- arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++---------- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/x86.c | 29 ---------- 3 files changed, 125 insertions(+), 57 deletions(-)