Message ID | 20221202061347.1070246-1-chao.p.peng@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: mm: fd-based approach for supporting KVM | expand |
On Fri, Dec 02, 2022, Chao Peng wrote: > This patch series implements KVM guest private memory for confidential > computing scenarios like Intel TDX[1]. If a TDX host accesses > TDX-protected guest memory, machine check can happen which can further > crash the running host system, this is terrible for multi-tenant > configurations. The host accesses include those from KVM userspace like > QEMU. This series addresses KVM userspace induced crash by introducing > new mm and KVM interfaces so KVM userspace can still manage guest memory > via a fd-based approach, but it can never access the guest memory > content. > > The patch series touches both core mm and KVM code. I appreciate > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > reviews are always welcome. > - 01: mm change, target for mm tree > - 02-09: KVM change, target for KVM tree A version with all of my feedback, plus reworked versions of Vishal's selftest, is available here: git@github.com:sean-jc/linux.git x86/upm_base_support It compiles and passes the selftest, but it's otherwise barely tested. There are a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still a WIP. As for next steps, can you (handwaving all of the TDX folks) take a look at what I pushed and see if there's anything horrifically broken, and that it still works for TDX? Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush (and I mean that). On my side, the two things on my mind are (a) tests and (b) downstream dependencies (SEV and TDX). For tests, I want to build a lists of tests that are required for merging so that the criteria for merging are clear, and so that if the list is large (haven't thought much yet), the work of writing and running tests can be distributed. Regarding downstream dependencies, before this lands, I want to pull in all the TDX and SNP series and see how everything fits together. Specifically, I want to make sure that we don't end up with a uAPI that necessitates ugly code, and that we don't miss an opportunity to make things simpler. The patches in the SNP series to add "legacy" SEV support for UPM in particular made me slightly rethink some minor details. Nothing remotely major, but something that needs attention since it'll be uAPI. I'm off Monday, so it'll be at least Tuesday before I make any more progress on my side. Thanks!
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? Minor build fix: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6eb5336ccc65..4a9e9fa2552a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm, int level; bool mixed; - lockdep_assert_held_write(kvm->mmu_lock); - lockdep_assert_held(kvm->slots_lock); + lockdep_assert_held_write(&kvm->mmu_lock); + lockdep_assert_held(&kvm->slots_lock); /* * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 467916943c73..4ef60ba7eb1d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) { - lockdep_assert_held(kvm->mmu_lock); + lockdep_assert_held(&kvm->mmu_lock); return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); }
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. Thanks very much for doing this. Almost all of your comments are well received, except for two cases that need more discussions which have replied individually. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? I have integrated this into my local TDX repo, with some changes (as I replied individually), the new code basically still works with TDX. I have also asked other TDX folks to take a look. > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. > > I'm off Monday, so it'll be at least Tuesday before I make any more progress on > my side. Appreciate your effort. As for the next steps, if you see something we can do parallel, feel free to let me know. Thanks, Chao
Hi Sean, On Sat, Jan 14, 2023 at 12:38 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). Thanks for sharing this. I've had a look at the patches, and have ported them to work with pKVM. At a high level, the new interface seems fine and it works with the arm64/pKVM port. I have a couple of comments regarding some of the details, but they can wait until v11 is posted. Cheers, /fuad > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. > > I'm off Monday, so it'll be at least Tuesday before I make any more progress on > my side. > > Thanks!
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson <seanjc@google.com> wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. Although I'm still debuging with TDX KVM, I needed the following. kvm_faultin_pfn() is called without mmu_lock held. the race to change private/shared is handled by mmu_seq. Maybe dedicated function only for kvm_faultin_pfn(). diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 02be5e1cba1e..38699ca75ab8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2322,7 +2322,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) { - lockdep_assert_held(&kvm->mmu_lock); + // lockdep_assert_held(&kvm->mmu_lock); return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); }
On Thu, Jan 19, 2023, Isaku Yamahata wrote: > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > Sean Christopherson <seanjc@google.com> wrote: > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > This patch series implements KVM guest private memory for confidential > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > TDX-protected guest memory, machine check can happen which can further > > > crash the running host system, this is terrible for multi-tenant > > > configurations. The host accesses include those from KVM userspace like > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > via a fd-based approach, but it can never access the guest memory > > > content. > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > reviews are always welcome. > > > - 01: mm change, target for mm tree > > > - 02-09: KVM change, target for KVM tree > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > is available here: > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > a WIP. > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > I pushed and see if there's anything horrifically broken, and that it still works > > for TDX? > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > (and I mean that). > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > merging so that the criteria for merging are clear, and so that if the list is large > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > TDX and SNP series and see how everything fits together. Specifically, I want to > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > details. Nothing remotely major, but something that needs attention since it'll > > be uAPI. > > Although I'm still debuging with TDX KVM, I needed the following. > kvm_faultin_pfn() is called without mmu_lock held. the race to change > private/shared is handled by mmu_seq. Maybe dedicated function only for > kvm_faultin_pfn(). Gah, you're not on the other thread where this was discussed[*]. Simply deleting the lockdep assertion is safe, for guest types that rely on the attributes to define shared vs. private, KVM rechecks the attributes under the protection of mmu_seq. I'll get a fixed version pushed out today. [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
On Thu, Jan 19, 2023 at 03:25:08PM +0000, Sean Christopherson <seanjc@google.com> wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > This patch series implements KVM guest private memory for confidential > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > TDX-protected guest memory, machine check can happen which can further > > > > crash the running host system, this is terrible for multi-tenant > > > > configurations. The host accesses include those from KVM userspace like > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > via a fd-based approach, but it can never access the guest memory > > > > content. > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > reviews are always welcome. > > > > - 01: mm change, target for mm tree > > > > - 02-09: KVM change, target for KVM tree > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > is available here: > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > a WIP. > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > I pushed and see if there's anything horrifically broken, and that it still works > > > for TDX? > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > (and I mean that). > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > merging so that the criteria for merging are clear, and so that if the list is large > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > details. Nothing remotely major, but something that needs attention since it'll > > > be uAPI. > > > > Although I'm still debuging with TDX KVM, I needed the following. > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > kvm_faultin_pfn(). > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > the lockdep assertion is safe, for guest types that rely on the attributes to > define shared vs. private, KVM rechecks the attributes under the protection of > mmu_seq. > > I'll get a fixed version pushed out today. > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com Now I have tdx kvm working. I've uploaded at the followings. It's rebased to v6.2-rc3. git@github.com:yamahata/linux.git tdx/upm git@github.com:yamahata/qemu.git tdx/upm kvm_mmu_do_page_fault() needs the following change. kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't make sense. This change would belong to TDX KVM patches, though. diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 72b0da8e27e0..f45ac438bbf4 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -430,7 +430,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, .max_level = vcpu->kvm->arch.tdp_max_page_level, .req_level = PG_LEVEL_4K, .goal_level = PG_LEVEL_4K, - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), + .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa), }; int r;
On Thu, Jan 19, 2023, Isaku Yamahata wrote: > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > Sean Christopherson <seanjc@google.com> wrote: > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > This patch series implements KVM guest private memory for confidential > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > crash the running host system, this is terrible for multi-tenant > > > > > configurations. The host accesses include those from KVM userspace like > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > via a fd-based approach, but it can never access the guest memory > > > > > content. > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > reviews are always welcome. > > > > > - 01: mm change, target for mm tree > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > is available here: > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > a WIP. > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > for TDX? > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > (and I mean that). > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > be uAPI. > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > kvm_faultin_pfn(). > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > the lockdep assertion is safe, for guest types that rely on the attributes to > > define shared vs. private, KVM rechecks the attributes under the protection of > > mmu_seq. > > > > I'll get a fixed version pushed out today. > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > Now I have tdx kvm working. I've uploaded at the followings. > It's rebased to v6.2-rc3. > git@github.com:yamahata/linux.git tdx/upm > git@github.com:yamahata/qemu.git tdx/upm And I finally got a working, building version updated and pushed out (again to): git@github.com:sean-jc/linux.git x86/upm_base_support Took longer than expected to get the memslot restrictions sussed out. I'm done working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks to resolves any remaining todos (that no one else tackles) and to do the whole "merge the world" excersise. > kvm_mmu_do_page_fault() needs the following change. > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't > make sense. This change would belong to TDX KVM patches, though. Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.
On 14/01/2023 00:37, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: >> This patch series implements KVM guest private memory for confidential >> computing scenarios like Intel TDX[1]. If a TDX host accesses >> TDX-protected guest memory, machine check can happen which can further >> crash the running host system, this is terrible for multi-tenant >> configurations. The host accesses include those from KVM userspace like >> QEMU. This series addresses KVM userspace induced crash by introducing >> new mm and KVM interfaces so KVM userspace can still manage guest memory >> via a fd-based approach, but it can never access the guest memory >> content. >> >> The patch series touches both core mm and KVM code. I appreciate >> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other >> reviews are always welcome. >> - 01: mm change, target for mm tree >> - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > When running LTP (https://github.com/linux-test-project/ltp) on the v10 bits (and also with Sean's branch above) I encounter the following NULL pointer dereference with testcases/kernel/syscalls/madvise/madvise01 (100% reproducible). It appears that in restrictedmem_error_page() inode->i_mapping->private_data is NULL in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I don't know why. [ 5365.177168] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 5365.178881] #PF: supervisor read access in kernel mode [ 5365.180006] #PF: error_code(0x0000) - not-present page [ 5365.181322] PGD 8000000109dad067 P4D 8000000109dad067 PUD 107707067 PMD 0 [ 5365.183474] Oops: 0000 [#1] PREEMPT SMP PTI [ 5365.184792] CPU: 0 PID: 22086 Comm: madvise01 Not tainted 6.1.0-1.el8.seanjcupm.x86_64 #1 [ 5365.186572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.5.1 06/16/2021 [ 5365.188816] RIP: 0010:restrictedmem_error_page+0xc7/0x1b0 [ 5365.190081] Code: 99 00 48 8b 55 00 48 8b 02 48 8d 8a e8 fe ff ff 48 2d 18 01 00 00 48 39 d5 0f 84 8a 00 00 00 48 8b 51 30 48 8b 92 b8 00 00 00 <48> 8b 4a 28 4c 39 b1 d8 00 00 00 74 22 48 8b 88 18 01 00 00 48 8d [ 5365.193984] RSP: 0018:ffff9b7343c07d80 EFLAGS: 00010206 [ 5365.195142] RAX: ffff8e5b410cfc70 RBX: 0000000000000001 RCX: ffff8e5b4048e580 [ 5365.196888] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 5365.198399] RBP: ffff8e5b410cfd88 R08: 0000000000000000 R09: 0000000000000000 [ 5365.200200] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5365.201843] R13: ffff8e5b410cfd80 R14: ffff8e5b47cc7618 R15: ffffd49d44c05080 [ 5365.203472] FS: 00007fc96de9b5c0(0000) GS:ffff8e5deda00000(0000) knlGS:0000000000000000 [ 5365.205485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5365.206791] CR2: 0000000000000028 CR3: 000000012047e002 CR4: 0000000000170ef0 [ 5365.208131] Call Trace: [ 5365.208752] <TASK> [ 5365.209229] me_pagecache_clean+0x58/0x100 [ 5365.210196] identify_page_state+0x84/0xd0 [ 5365.211180] memory_failure+0x231/0x8b0 [ 5365.212148] madvise_inject_error.cold+0x8d/0xa4 [ 5365.213317] do_madvise+0x363/0x3a0 [ 5365.214177] __x64_sys_madvise+0x2c/0x40 [ 5365.215159] do_syscall_64+0x3f/0xa0 [ 5365.216016] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 5365.217130] RIP: 0033:0x7fc96d8399ab [ 5365.217953] Code: 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 54 38 00 f7 d8 64 89 01 48 [ 5365.222323] RSP: 002b:00007fff62a99b18 EFLAGS: 00000206 ORIG_RAX: 000000000000001c [ 5365.224026] RAX: ffffffffffffffda RBX: 000000000041ce00 RCX: 00007fc96d8399ab [ 5365.225375] RDX: 0000000000000064 RSI: 000000000000a000 RDI: 00007fc96de8e000 [ 5365.226999] RBP: 00007fc96de9b540 R08: 0000000000000001 R09: 0000000000415c80 [ 5365.228641] R10: 0000000000000001 R11: 0000000000000206 R12: 0000000000000008 [ 5365.230074] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Regards, Liam > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. > > I'm off Monday, so it'll be at least Tuesday before I make any more progress on > my side. > > Thanks! >
On Tue, Jan 24, 2023, Liam Merwick wrote: > On 14/01/2023 00:37, Sean Christopherson wrote: > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > This patch series implements KVM guest private memory for confidential > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > TDX-protected guest memory, machine check can happen which can further > > > crash the running host system, this is terrible for multi-tenant > > > configurations. The host accesses include those from KVM userspace like > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > via a fd-based approach, but it can never access the guest memory > > > content. > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > reviews are always welcome. > > > - 01: mm change, target for mm tree > > > - 02-09: KVM change, target for KVM tree > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > is available here: > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > a WIP. > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > bits (and also with Sean's branch above) I encounter the following NULL > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > (100% reproducible). > > It appears that in restrictedmem_error_page() inode->i_mapping->private_data > is NULL > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) > but I don't know why. Kirill, can you take a look? Or pass the buck to someone who can? :-)
On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: > On Tue, Jan 24, 2023, Liam Merwick wrote: > > On 14/01/2023 00:37, Sean Christopherson wrote: > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > This patch series implements KVM guest private memory for confidential > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > TDX-protected guest memory, machine check can happen which can further > > > > crash the running host system, this is terrible for multi-tenant > > > > configurations. The host accesses include those from KVM userspace like > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > via a fd-based approach, but it can never access the guest memory > > > > content. > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > reviews are always welcome. > > > > - 01: mm change, target for mm tree > > > > - 02-09: KVM change, target for KVM tree > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > is available here: > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > a WIP. > > > > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > > bits (and also with Sean's branch above) I encounter the following NULL > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > > (100% reproducible). > > > > It appears that in restrictedmem_error_page() inode->i_mapping->private_data > > is NULL > > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) > > but I don't know why. > > Kirill, can you take a look? Or pass the buck to someone who can? :-) The patch below should help. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index 15c52301eeb9..39ada985c7c0 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) spin_lock(&sb->s_inode_list_lock); list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { - struct restrictedmem *rm = inode->i_mapping->private_data; struct restrictedmem_notifier *notifier; - struct file *memfd = rm->memfd; + struct restrictedmem *rm; unsigned long index; + struct file *memfd; - if (memfd->f_mapping != mapping) + if (atomic_read(&inode->i_count)) continue; + spin_lock(&inode->i_lock); + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { + spin_unlock(&inode->i_lock); + continue; + } + + rm = inode->i_mapping->private_data; + memfd = rm->memfd; + + if (memfd->f_mapping != mapping) { + spin_unlock(&inode->i_lock); + continue; + } + spin_unlock(&inode->i_lock); + xa_for_each_range(&rm->bindings, index, notifier, start, end) notifier->ops->error(notifier, start, end); break;
On 25/01/2023 12:53, Kirill A. Shutemov wrote: > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: >> On Tue, Jan 24, 2023, Liam Merwick wrote: >>> On 14/01/2023 00:37, Sean Christopherson wrote: >>>> On Fri, Dec 02, 2022, Chao Peng wrote: ... >>> >>> When running LTP (https://github.com/linux-test-project/ltp) on the v10 >>> bits (and also with Sean's branch above) I encounter the following NULL >>> pointer dereference with testcases/kernel/syscalls/madvise/madvise01 >>> (100% reproducible). >>> >>> It appears that in restrictedmem_error_page() inode->i_mapping->private_data >>> is NULL >>> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) >>> but I don't know why. >> >> Kirill, can you take a look? Or pass the buck to someone who can? :-) > > The patch below should help. Thanks, this works for me. Regards, Liam > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index 15c52301eeb9..39ada985c7c0 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > spin_lock(&sb->s_inode_list_lock); > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > - struct restrictedmem *rm = inode->i_mapping->private_data; > struct restrictedmem_notifier *notifier; > - struct file *memfd = rm->memfd; > + struct restrictedmem *rm; > unsigned long index; > + struct file *memfd; > > - if (memfd->f_mapping != mapping) > + if (atomic_read(&inode->i_count)) > continue; > > + spin_lock(&inode->i_lock); > + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + > + rm = inode->i_mapping->private_data; > + memfd = rm->memfd; > + > + if (memfd->f_mapping != mapping) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + spin_unlock(&inode->i_lock); > + > xa_for_each_range(&rm->bindings, index, notifier, start, end) > notifier->ops->error(notifier, start, end); > break;
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson <seanjc@google.com> wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > > for TDX? > > > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > > (and I mean that). > > > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > > be uAPI. > > > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > > kvm_faultin_pfn(). > > > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > > the lockdep assertion is safe, for guest types that rely on the attributes to > > > define shared vs. private, KVM rechecks the attributes under the protection of > > > mmu_seq. > > > > > > I'll get a fixed version pushed out today. > > > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > > > Now I have tdx kvm working. I've uploaded at the followings. > > It's rebased to v6.2-rc3. > > git@github.com:yamahata/linux.git tdx/upm > > git@github.com:yamahata/qemu.git tdx/upm > > And I finally got a working, building version updated and pushed out (again to): > > git@github.com:sean-jc/linux.git x86/upm_base_support > Ok, I rebased TDX part to the updated branch. git@github.com:yamahata/linux.git tdx/upm git@github.com:yamahata/qemu.git tdx/upm Now it's v6.2-rc7 based. qemu needs more patches to avoid registering memory slot for SMM.
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > > for TDX? > > > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > > (and I mean that). > > > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > > be uAPI. > > > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > > kvm_faultin_pfn(). > > > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > > the lockdep assertion is safe, for guest types that rely on the attributes to > > > define shared vs. private, KVM rechecks the attributes under the protection of > > > mmu_seq. > > > > > > I'll get a fixed version pushed out today. > > > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > > > Now I have tdx kvm working. I've uploaded at the followings. > > It's rebased to v6.2-rc3. > > git@github.com:yamahata/linux.git tdx/upm > > git@github.com:yamahata/qemu.git tdx/upm > > And I finally got a working, building version updated and pushed out (again to): > > git@github.com:sean-jc/linux.git x86/upm_base_support > > Took longer than expected to get the memslot restrictions sussed out. I'm done > working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks > to resolves any remaining todos (that no one else tackles) and to do the whole > "merge the world" excersise. > > > kvm_mmu_do_page_fault() needs the following change. > > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses > > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't > > make sense. This change would belong to TDX KVM patches, though. > > Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list. Hi Sean, We've rebased the SEV+SNP support onto your updated UPM base support tree and things seem to be working okay, but we needed some fixups on top of the base support get things working, along with 1 workaround for an issue that hasn't been root-caused yet: https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes We plan to post an updated RFC for v8 soon, but also wanted to share the staging tree in case you end up looking at the UPM integration aspects before then. -Mike
Hi, On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote: > This patch series implements KVM guest private memory for confidential > computing scenarios like Intel TDX[1]. If a TDX host accesses > TDX-protected guest memory, machine check can happen which can further > crash the running host system, this is terrible for multi-tenant > configurations. The host accesses include those from KVM userspace like > QEMU. This series addresses KVM userspace induced crash by introducing > new mm and KVM interfaces so KVM userspace can still manage guest memory > via a fd-based approach, but it can never access the guest memory > content. Sorry for jumping late. Unless I'm missing something, hibernation will also cause an machine check when there is TDX-protected memory in the system. When the hibernation creates memory snapshot it essentially walks all physical pages and saves their contents, so for TDX memory this will trigger machine check, right? > Documentation/virt/kvm/api.rst | 125 ++++++- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/include/asm/kvm_host.h | 9 + > arch/x86/kvm/Kconfig | 3 + > arch/x86/kvm/mmu/mmu.c | 205 ++++++++++- > arch/x86/kvm/mmu/mmu_internal.h | 14 +- > arch/x86/kvm/mmu/mmutrace.h | 1 + > arch/x86/kvm/mmu/tdp_mmu.c | 2 +- > arch/x86/kvm/x86.c | 17 +- > include/linux/kvm_host.h | 103 +++++- > include/linux/restrictedmem.h | 71 ++++ > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 5 +- > include/uapi/linux/kvm.h | 53 +++ > include/uapi/linux/magic.h | 1 + > kernel/sys_ni.c | 3 + > mm/Kconfig | 4 + > mm/Makefile | 1 + > mm/memory-failure.c | 3 + > mm/restrictedmem.c | 318 +++++++++++++++++ > virt/kvm/Kconfig | 6 + > virt/kvm/kvm_main.c | 469 +++++++++++++++++++++---- > 23 files changed, 1323 insertions(+), 93 deletions(-) > create mode 100644 include/linux/restrictedmem.h > create mode 100644 mm/restrictedmem.c
On 16.02.23 06:13, Mike Rapoport wrote: > Hi, > > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote: >> This patch series implements KVM guest private memory for confidential >> computing scenarios like Intel TDX[1]. If a TDX host accesses >> TDX-protected guest memory, machine check can happen which can further >> crash the running host system, this is terrible for multi-tenant >> configurations. The host accesses include those from KVM userspace like >> QEMU. This series addresses KVM userspace induced crash by introducing >> new mm and KVM interfaces so KVM userspace can still manage guest memory >> via a fd-based approach, but it can never access the guest memory >> content. > > Sorry for jumping late. > > Unless I'm missing something, hibernation will also cause an machine check > when there is TDX-protected memory in the system. When the hibernation > creates memory snapshot it essentially walks all physical pages and saves > their contents, so for TDX memory this will trigger machine check, right? I recall bringing that up in the past (also memory access due to kdump, /prov/kcore) and was told that the main focus for now is preventing unprivileged users from crashing the system, that is, not mapping such memory into user space (e.g., QEMU). In the long run, we'll want to handle such pages also properly in the other events where the kernel might access them.
> Hi Sean, > > We've rebased the SEV+SNP support onto your updated UPM base support > tree and things seem to be working okay, but we needed some fixups on > top of the base support get things working, along with 1 workaround > for an issue that hasn't been root-caused yet: > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check What I'm seeing is Slot#3 gets added first and then deleted. When it's gets added, Slot#0 already has the same range bound to restrictedmem so trigger the exclusive check. This check is exactly the current code for. > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations As many kernel APIs treat 'end' as exclusive, I would rather keep using exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind and notifier callbacks) but fix it internally in the restrictedmem. E.g. all the places where xarray API needs a 'last'/'max' we use 'end - 1'. See below for the change. > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations Subtracting slot->restrictedmem.index for start/end in restrictedmem_get_gfn_range() is the correct fix. > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes > > We plan to post an updated RFC for v8 soon, but also wanted to share > the staging tree in case you end up looking at the UPM integration aspects > before then. > > -Mike This is the restrictedmem fix to solve 'end' being stored and checked in xarray: --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, */ down_read(&rm->lock); - xa_for_each_range(&rm->bindings, index, notifier, start, end) + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_start(notifier, start, end); ret = memfd->f_op->fallocate(memfd, mode, offset, len); - xa_for_each_range(&rm->bindings, index, notifier, start, end) + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_end(notifier, start, end); up_read(&rm->lock); @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping, } spin_unlock(&inode->i_lock); - xa_for_each_range(&rm->bindings, index, notifier, start, end) + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->error(notifier, start, end); break; } @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, if (exclusive != rm->exclusive) goto out_unlock; - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) + if (exclusive && + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT)) goto out_unlock; } - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL); rm->exclusive = exclusive; ret = 0; out_unlock: @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, struct restrictedmem *rm = file->f_mapping->private_data; down_write(&rm->lock); - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL); synchronize_rcu(); up_write(&rm->lock); }
On Thu, Feb 16, 2023, David Hildenbrand wrote: > On 16.02.23 06:13, Mike Rapoport wrote: > > Hi, > > > > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote: > > > This patch series implements KVM guest private memory for confidential > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > TDX-protected guest memory, machine check can happen which can further > > > crash the running host system, this is terrible for multi-tenant > > > configurations. The host accesses include those from KVM userspace like > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > via a fd-based approach, but it can never access the guest memory > > > content. > > > > Sorry for jumping late. > > > > Unless I'm missing something, hibernation will also cause an machine check > > when there is TDX-protected memory in the system. When the hibernation > > creates memory snapshot it essentially walks all physical pages and saves > > their contents, so for TDX memory this will trigger machine check, right? For hibernation specifically, I think that should be handled elsewhere as hibernation is simply incompatible with TDX, SNP, pKVM, etc. without paravirtualizing the guest, as none of those technologies support auto-export a la s390. I suspect the right approach is to disallow hibernation if KVM is running any protected guests. > I recall bringing that up in the past (also memory access due to kdump, > /prov/kcore) and was told that the main focus for now is preventing > unprivileged users from crashing the system, that is, not mapping such > memory into user space (e.g., QEMU). In the long run, we'll want to handle > such pages also properly in the other events where the kernel might access > them. Ya, unless someone strongly objects, the plan is to essentially treat "attacks" from privileged users as out of to scope for initial support, and then iterate as needed to fix/enable more features. FWIW, read accesses, e.g. kdump, should be ok for TDX and SNP as they both play nice with "bad" reads. pKVM is a different beast though as I believe any access to guest private memory will fault. But my understanding is that this series would be a big step forward for pKVM, which currently doesn't have any safeguards.
On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote: > > Hi Sean, > > > > We've rebased the SEV+SNP support onto your updated UPM base support > > tree and things seem to be working okay, but we needed some fixups on > > top of the base support get things working, along with 1 workaround > > for an issue that hasn't been root-caused yet: > > > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip > > > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation > > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check > > What I'm seeing is Slot#3 gets added first and then deleted. When it's > gets added, Slot#0 already has the same range bound to restrictedmem so > trigger the exclusive check. This check is exactly the current code for. With the following change in QEMU, we no longer trigger this check: diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c index 20da121374..849b5de469 100644 --- a/hw/pci-host/q35.c +++ b/hw/pci-host/q35.c @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp) memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high", mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE, MCH_HOST_BRIDGE_SMRAM_C_SIZE); + memory_region_set_enabled(&mch->open_high_smram, false); memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000, &mch->open_high_smram, 1); - memory_region_set_enabled(&mch->open_high_smram, false); I'm not sure if QEMU is actually doing something wrong here though or if this check is putting tighter restrictions on userspace than what was expected before. Will look into it more. > > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations > > As many kernel APIs treat 'end' as exclusive, I would rather keep using > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind > and notifier callbacks) but fix it internally in the restrictedmem. E.g. > all the places where xarray API needs a 'last'/'max' we use 'end - 1'. > See below for the change. Yes I did feel like I was fighting the kernel a bit on that; your suggestion seems like it would be a better fit. > > > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations > > Subtracting slot->restrictedmem.index for start/end in > restrictedmem_get_gfn_range() is the correct fix. > > > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes > > > > We plan to post an updated RFC for v8 soon, but also wanted to share > > the staging tree in case you end up looking at the UPM integration aspects > > before then. > > > > -Mike > > This is the restrictedmem fix to solve 'end' being stored and checked in xarray: Looks good. Thanks! -Mike > > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > */ > down_read(&rm->lock); > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->invalidate_start(notifier, start, end); > > ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->invalidate_end(notifier, start, end); > > up_read(&rm->lock); > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping, > } > spin_unlock(&inode->i_lock); > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->error(notifier, start, end); > break; > } > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > if (exclusive != rm->exclusive) > goto out_unlock; > > - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > + if (exclusive && > + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT)) > goto out_unlock; > } > > - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); > + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL); > rm->exclusive = exclusive; > ret = 0; > out_unlock: > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem *rm = file->f_mapping->private_data; > > down_write(&rm->lock); > - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL); > synchronize_rcu(); > up_write(&rm->lock); > }
On Wed, Mar 22, 2023 at 08:27:37PM -0500, Michael Roth wrote: > On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote: > > > Hi Sean, > > > > > > We've rebased the SEV+SNP support onto your updated UPM base support > > > tree and things seem to be working okay, but we needed some fixups on > > > top of the base support get things working, along with 1 workaround > > > for an issue that hasn't been root-caused yet: > > > > > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip > > > > > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation > > > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check > > > > What I'm seeing is Slot#3 gets added first and then deleted. When it's > > gets added, Slot#0 already has the same range bound to restrictedmem so > > trigger the exclusive check. This check is exactly the current code for. > > With the following change in QEMU, we no longer trigger this check: > > diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c > index 20da121374..849b5de469 100644 > --- a/hw/pci-host/q35.c > +++ b/hw/pci-host/q35.c > @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp) > memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high", > mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE, > MCH_HOST_BRIDGE_SMRAM_C_SIZE); > + memory_region_set_enabled(&mch->open_high_smram, false); > memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000, > &mch->open_high_smram, 1); > - memory_region_set_enabled(&mch->open_high_smram, false); > > I'm not sure if QEMU is actually doing something wrong here though or if > this check is putting tighter restrictions on userspace than what was > expected before. Will look into it more. I don't think above QEMU change is upstream acceptable. It may break functionality for 'normal' VMs. The UPM check does putting tighter restriction, the restriction is that you can't bind the same fd range to more than one memslot. For SMRAM in QEMU however, it violates this restriction. The right 'fix' is disabling SMM in QEMU for UPM usages rather than trying to work around it. There is more discussion in below link: https://lore.kernel.org/all/Y8bOB7VuVIsxoMcn@google.com/ Chao > > > > > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations > > > > As many kernel APIs treat 'end' as exclusive, I would rather keep using > > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind > > and notifier callbacks) but fix it internally in the restrictedmem. E.g. > > all the places where xarray API needs a 'last'/'max' we use 'end - 1'. > > See below for the change. > > Yes I did feel like I was fighting the kernel a bit on that; your > suggestion seems like it would be a better fit. > > > > > > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations > > > > Subtracting slot->restrictedmem.index for start/end in > > restrictedmem_get_gfn_range() is the correct fix. > > > > > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes > > > > > > We plan to post an updated RFC for v8 soon, but also wanted to share > > > the staging tree in case you end up looking at the UPM integration aspects > > > before then. > > > > > > -Mike > > > > This is the restrictedmem fix to solve 'end' being stored and checked in xarray: > > Looks good. > > Thanks! > > -Mike > > > > > --- a/mm/restrictedmem.c > > +++ b/mm/restrictedmem.c > > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > > */ > > down_read(&rm->lock); > > > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > > notifier->ops->invalidate_start(notifier, start, end); > > > > ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > > notifier->ops->invalidate_end(notifier, start, end); > > > > up_read(&rm->lock); > > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping, > > } > > spin_unlock(&inode->i_lock); > > > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > > notifier->ops->error(notifier, start, end); > > break; > > } > > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > > if (exclusive != rm->exclusive) > > goto out_unlock; > > > > - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > > + if (exclusive && > > + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT)) > > goto out_unlock; > > } > > > > - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); > > + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL); > > rm->exclusive = exclusive; > > ret = 0; > > out_unlock: > > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > > struct restrictedmem *rm = file->f_mapping->private_data; > > > > down_write(&rm->lock); > > - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > > + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL); > > synchronize_rcu(); > > up_write(&rm->lock); > > }
On Wed, Mar 22, 2023, Michael Roth wrote: > On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote: > > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations > > > > As many kernel APIs treat 'end' as exclusive, I would rather keep using > > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind > > and notifier callbacks) but fix it internally in the restrictedmem. E.g. > > all the places where xarray API needs a 'last'/'max' we use 'end - 1'. > > See below for the change. > > Yes I did feel like I was fighting the kernel a bit on that; your > suggestion seems like it would be a better fit. Comically belated +1, XArray is the odd one here.
On Wed, Jan 25, 2023, Kirill A. Shutemov wrote: > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: > > On Tue, Jan 24, 2023, Liam Merwick wrote: > > > On 14/01/2023 00:37, Sean Christopherson wrote: > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > This patch series implements KVM guest private memory for confidential > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > crash the running host system, this is terrible for multi-tenant > > > > > configurations. The host accesses include those from KVM userspace like > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > via a fd-based approach, but it can never access the guest memory > > > > > content. > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > reviews are always welcome. > > > > > - 01: mm change, target for mm tree > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > is available here: > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > a WIP. > > > > > > > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > > > bits (and also with Sean's branch above) I encounter the following NULL > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > > > (100% reproducible). > > > > > > It appears that in restrictedmem_error_page() > > > inode->i_mapping->private_data is NULL in the > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I > > > don't know why. > > > > Kirill, can you take a look? Or pass the buck to someone who can? :-) > > The patch below should help. > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index 15c52301eeb9..39ada985c7c0 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > spin_lock(&sb->s_inode_list_lock); > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > - struct restrictedmem *rm = inode->i_mapping->private_data; > struct restrictedmem_notifier *notifier; > - struct file *memfd = rm->memfd; > + struct restrictedmem *rm; > unsigned long index; > + struct file *memfd; > > - if (memfd->f_mapping != mapping) > + if (atomic_read(&inode->i_count)) Kirill, should this be if (!atomic_read(&inode->i_count)) continue; i.e. skip unreferenced inodes, not skip referenced inodes?
On Wed, Apr 12, 2023 at 06:07:28PM -0700, Sean Christopherson wrote: > On Wed, Jan 25, 2023, Kirill A. Shutemov wrote: > > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: > > > On Tue, Jan 24, 2023, Liam Merwick wrote: > > > > On 14/01/2023 00:37, Sean Christopherson wrote: > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > > > > bits (and also with Sean's branch above) I encounter the following NULL > > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > > > > (100% reproducible). > > > > > > > > It appears that in restrictedmem_error_page() > > > > inode->i_mapping->private_data is NULL in the > > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I > > > > don't know why. > > > > > > Kirill, can you take a look? Or pass the buck to someone who can? :-) > > > > The patch below should help. > > > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > > index 15c52301eeb9..39ada985c7c0 100644 > > --- a/mm/restrictedmem.c > > +++ b/mm/restrictedmem.c > > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > > > spin_lock(&sb->s_inode_list_lock); > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > > - struct restrictedmem *rm = inode->i_mapping->private_data; > > struct restrictedmem_notifier *notifier; > > - struct file *memfd = rm->memfd; > > + struct restrictedmem *rm; > > unsigned long index; > > + struct file *memfd; > > > > - if (memfd->f_mapping != mapping) > > + if (atomic_read(&inode->i_count)) > > Kirill, should this be > > if (!atomic_read(&inode->i_count)) > continue; > > i.e. skip unreferenced inodes, not skip referenced inodes? Ouch. Yes. But looking at other instances of s_inodes usage, I think we can drop the check altogether. inode cannot be completely free until it is removed from s_inodes list. While there, replace list_for_each_entry_safe() with list_for_each_entry() as we don't remove anything from the list. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index 55e99e6c09a1..8e8a4420d3d1 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -194,22 +194,19 @@ static int restricted_error_remove_page(struct address_space *mapping, struct page *page) { struct super_block *sb = restrictedmem_mnt->mnt_sb; - struct inode *inode, *next; + struct inode *inode; pgoff_t start, end; start = page->index; end = start + thp_nr_pages(page); spin_lock(&sb->s_inode_list_lock); - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { struct restrictedmem_notifier *notifier; struct restrictedmem *rm; unsigned long index; struct file *memfd; - if (atomic_read(&inode->i_count)) - continue; - spin_lock(&inode->i_lock); if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { spin_unlock(&inode->i_lock);
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > > for TDX? > > > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > > (and I mean that). > > > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > > be uAPI. > > > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > > kvm_faultin_pfn(). > > > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > > the lockdep assertion is safe, for guest types that rely on the attributes to > > > define shared vs. private, KVM rechecks the attributes under the protection of > > > mmu_seq. > > > > > > I'll get a fixed version pushed out today. > > > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > > > Now I have tdx kvm working. I've uploaded at the followings. > > It's rebased to v6.2-rc3. > > git@github.com:yamahata/linux.git tdx/upm > > git@github.com:yamahata/qemu.git tdx/upm > > And I finally got a working, building version updated and pushed out (again to): > > git@github.com:sean-jc/linux.git x86/upm_base_support > > Took longer than expected to get the memslot restrictions sussed out. I'm done > working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks > to resolves any remaining todos (that no one else tackles) and to do the whole > "merge the world" excersise. Hi Sean, In case you started working on the code again, I have a branch [1] originally planned as v11 candidate which I believe I addressed all the discussions we had for v10 except the very latest one [2] and integrated all the newly added selftests from Ackerley and myself. The branch was based on your original upm_base_support and then rebased to your kvm-x86/mmu head. Feel free to take anything you think useful( most of them are trivial things but also some fixes for bugs). [1] https://github.com/chao-p/linux/commits/privmem-v11.6 [2] https://lore.kernel.org/all/20230413160405.h6ov2yl6l3i7mvsj@box.shutemov.name/ Chao > > > kvm_mmu_do_page_fault() needs the following change. > > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses > > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't > > make sense. This change would belong to TDX KVM patches, though. > > Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.
On Mon, Apr 17, 2023, Chao Peng wrote: > In case you started working on the code again, I have a branch [1] > originally planned as v11 candidate which I believe I addressed all the > discussions we had for v10 except the very latest one [2] and integrated > all the newly added selftests from Ackerley and myself. The branch was > based on your original upm_base_support and then rebased to your > kvm-x86/mmu head. Feel free to take anything you think useful( most of > them are trivial things but also some fixes for bugs). Nice! I am going to work on splicing together the various series this week, I'll make sure to grab your work. Thanks much!