Message ID | 20191008041334.3235-1-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/sgx: WARN once if EREMOVE fails when killing an enclave | expand |
On Mon, Oct 07, 2019 at 09:13:34PM -0700, Sean Christopherson wrote: > WARN if EREMOVE fails when destroying an enclave. sgx_encl_release() > uses the non-WARN __sgx_free_page() when freeing pages as some pages may > be in the process of being reclaimed, i.e. are owned by the reclaimer. > But EREMOVE should never fail as sgx_encl_destroy() is only called when > the enclave cannot have active threads, e.g. prior to EINIT and when the > enclave is being released. > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kernel/cpu/sgx/encl.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 54ca827e68a9..a6786e7ae40e 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -463,16 +463,23 @@ void sgx_encl_destroy(struct sgx_encl *encl) > struct sgx_encl_page *entry; > struct radix_tree_iter iter; > void **slot; > + int r; > > atomic_or(SGX_ENCL_DEAD, &encl->flags); > > radix_tree_for_each_slot(slot, &encl->page_tree, &iter, 0) { > entry = *slot; > if (entry->epc_page) { > - if (!__sgx_free_page(entry->epc_page)) { > + /* > + * Freeing the page can fail if it's in the process of > + * being reclaimed (-EBUSY), but EREMOVE itself should > + * not fail at this point. > + */ > + r = __sgx_free_page(entry->epc_page); > + WARN_ONCE(r > 0, "sgx: EREMOVE returned %d (0x%x)", r, r); > + if (!r) { > encl->secs_child_cnt--; > entry->epc_page = NULL; > - > } > > radix_tree_delete(&entry->encl->page_tree, > -- > 2.22.0 Intended for v23, forgot to tag the subject...
On Mon, Oct 07, 2019 at 09:13:34PM -0700, Sean Christopherson wrote: > WARN if EREMOVE fails when destroying an enclave. sgx_encl_release() > uses the non-WARN __sgx_free_page() when freeing pages as some pages may > be in the process of being reclaimed, i.e. are owned by the reclaimer. > But EREMOVE should never fail as sgx_encl_destroy() is only called when > the enclave cannot have active threads, e.g. prior to EINIT and when the > enclave is being released. > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> For me this concludes that I will manually convert all the call sites to use __sgx_free_page() and add appropriate warnings. I agree with Borislav's conclusions here. /Jarkko
On Wed, Oct 09, 2019 at 03:04:50AM +0300, Jarkko Sakkinen wrote: > On Mon, Oct 07, 2019 at 09:13:34PM -0700, Sean Christopherson wrote: > > WARN if EREMOVE fails when destroying an enclave. sgx_encl_release() > > uses the non-WARN __sgx_free_page() when freeing pages as some pages may > > be in the process of being reclaimed, i.e. are owned by the reclaimer. > > But EREMOVE should never fail as sgx_encl_destroy() is only called when > > the enclave cannot have active threads, e.g. prior to EINIT and when the > > enclave is being released. > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > For me this concludes that I will manually convert all the call sites > to use __sgx_free_page() and add appropriate warnings. I agree with > Borislav's conclusions here. Argh, now we have a bunch of call sites that can silently leak EPC pages, and I'm seeing timeouts during testing that strongly suggest pages are being leaked...
On Thu, Oct 10, 2019 at 11:35:48AM -0700, Sean Christopherson wrote: > On Wed, Oct 09, 2019 at 03:04:50AM +0300, Jarkko Sakkinen wrote: > > On Mon, Oct 07, 2019 at 09:13:34PM -0700, Sean Christopherson wrote: > > > WARN if EREMOVE fails when destroying an enclave. sgx_encl_release() > > > uses the non-WARN __sgx_free_page() when freeing pages as some pages may > > > be in the process of being reclaimed, i.e. are owned by the reclaimer. > > > But EREMOVE should never fail as sgx_encl_destroy() is only called when > > > the enclave cannot have active threads, e.g. prior to EINIT and when the > > > enclave is being released. > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > > For me this concludes that I will manually convert all the call sites > > to use __sgx_free_page() and add appropriate warnings. I agree with > > Borislav's conclusions here. > > Argh, now we have a bunch of call sites that can silently leak EPC pages, > and I'm seeing timeouts during testing that strongly suggest pages are > being leaked... Confirmed that we're leaking pages, but it's not related to the -EBUSY case in sgx_free_page(). Debug in progress... As to the sgx_free_page() thing, I think we can invert the old WARN logic and make everyone happy. I'll send a patch.
On Thu, Oct 10, 2019 at 11:56:07AM -0700, Sean Christopherson wrote: > On Thu, Oct 10, 2019 at 11:35:48AM -0700, Sean Christopherson wrote: > > On Wed, Oct 09, 2019 at 03:04:50AM +0300, Jarkko Sakkinen wrote: > > > On Mon, Oct 07, 2019 at 09:13:34PM -0700, Sean Christopherson wrote: > > > > WARN if EREMOVE fails when destroying an enclave. sgx_encl_release() > > > > uses the non-WARN __sgx_free_page() when freeing pages as some pages may > > > > be in the process of being reclaimed, i.e. are owned by the reclaimer. > > > > But EREMOVE should never fail as sgx_encl_destroy() is only called when > > > > the enclave cannot have active threads, e.g. prior to EINIT and when the > > > > enclave is being released. > > > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > > For me this concludes that I will manually convert all the call sites > > > to use __sgx_free_page() and add appropriate warnings. I agree with > > > Borislav's conclusions here. > > > > Argh, now we have a bunch of call sites that can silently leak EPC pages, > > and I'm seeing timeouts during testing that strongly suggest pages are > > being leaked... > > Confirmed that we're leaking pages, but it's not related to the -EBUSY > case in sgx_free_page(). Debug in progress... > > As to the sgx_free_page() thing, I think we can invert the old WARN logic > and make everyone happy. I'll send a patch. Figured out what's up. I'm testing in a VM with multiple EPC sections. Because of a change in v23[*], sgx_nr_free_pages is getting corrupted due to non-atomic concurrent writes. When it drops below 0 and wraps to a high value the swap thread stops reclaiming and things grind to a halt. [*] https://patchwork.kernel.org/patch/11146733/#22887361
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 54ca827e68a9..a6786e7ae40e 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -463,16 +463,23 @@ void sgx_encl_destroy(struct sgx_encl *encl) struct sgx_encl_page *entry; struct radix_tree_iter iter; void **slot; + int r; atomic_or(SGX_ENCL_DEAD, &encl->flags); radix_tree_for_each_slot(slot, &encl->page_tree, &iter, 0) { entry = *slot; if (entry->epc_page) { - if (!__sgx_free_page(entry->epc_page)) { + /* + * Freeing the page can fail if it's in the process of + * being reclaimed (-EBUSY), but EREMOVE itself should + * not fail at this point. + */ + r = __sgx_free_page(entry->epc_page); + WARN_ONCE(r > 0, "sgx: EREMOVE returned %d (0x%x)", r, r); + if (!r) { encl->secs_child_cnt--; entry->epc_page = NULL; - } radix_tree_delete(&entry->encl->page_tree,
WARN if EREMOVE fails when destroying an enclave. sgx_encl_release() uses the non-WARN __sgx_free_page() when freeing pages as some pages may be in the process of being reclaimed, i.e. are owned by the reclaimer. But EREMOVE should never fail as sgx_encl_destroy() is only called when the enclave cannot have active threads, e.g. prior to EINIT and when the enclave is being released. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/x86/kernel/cpu/sgx/encl.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)