Message ID | 20220825080802.259528-1-jarkko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] x86/sgx: Do not consider unsanitized pages an error | expand |
On 8/25/22 01:08, Jarkko Sakkinen wrote: > However, if the SGX subsystem initialization is retracted, the sanitization > process could end up in the middle, and sgx_dirty_page_list be left > non-empty for legit reasons. What does "retraction" mean in this context?
Hi Jarkko, On Thu, 25 Aug 2022 03:08:02 -0500, Jarkko Sakkinen <jarkko@kernel.org> wrote: > If sgx_dirty_page_list ends up being non-empty, currently this triggers > WARN_ON(), which produces a lot of noise, and can potentially crash the > kernel, depending on the kernel command line. > > However, if the SGX subsystem initialization is retracted, the > sanitization > process could end up in the middle, and sgx_dirty_page_list be left > non-empty for legit reasons. > > Replace this faulty behavior with more verbose version > __sgx_sanitize_pages(), which can optionally print EREMOVE error code and > the number of unsanitized pages. > > Link: > https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u > Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> > Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with > sgx_dirty_page_list") > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > Cc: Haitao Huang <haitao.huang@linux.intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Reinette Chatre <reinette.chatre@intel.com> > --- > v3: > - Remove WARN_ON(). > - Tuned comments and the commit message a bit. > > v2: > - Replaced WARN_ON() with optional pr_info() inside > __sgx_sanitize_pages(). > - Rewrote the commit message. > - Added the fixes tag. > --- > arch/x86/kernel/cpu/sgx/main.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/main.c > b/arch/x86/kernel/cpu/sgx/main.c > index 515e2a5f25bb..d204520a5e26 100644 > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -50,16 +50,17 @@ static LIST_HEAD(sgx_dirty_page_list); > * from the input list, and made available for the page allocator. SECS > pages > * prepending their children in the input list are left intact. > */ > -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > +static void __sgx_sanitize_pages(struct list_head *dirty_page_list, > bool verbose) > { > struct sgx_epc_page *page; > + int dirty_count = 0; > LIST_HEAD(dirty); > int ret; > /* dirty_page_list is thread-local, no need for a lock: */ Just a nitpick, Although it is not added in this patch, the above comment is not accurate. The list is accessed one thread only: filled first in main thread, then only ever accessed here. IIUC, could you remove or update that comment? Other than that, FWIW: Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com> Thanks Haitao
On Thu, Aug 25, 2022 at 07:07:44AM -0700, Dave Hansen wrote: > On 8/25/22 01:08, Jarkko Sakkinen wrote: > > However, if the SGX subsystem initialization is retracted, the sanitization > > process could end up in the middle, and sgx_dirty_page_list be left > > non-empty for legit reasons. > > What does "retraction" mean in this context? Rest of the initialization failing or features not detected (-ENODEV). BR, Jarkko
On 8/25/22 11:27, Jarkko Sakkinen wrote: > On Thu, Aug 25, 2022 at 07:07:44AM -0700, Dave Hansen wrote: >> On 8/25/22 01:08, Jarkko Sakkinen wrote: >>> However, if the SGX subsystem initialization is retracted, the sanitization >>> process could end up in the middle, and sgx_dirty_page_list be left >>> non-empty for legit reasons. >> What does "retraction" mean in this context? > Rest of the initialization failing or features not detected (-ENODEV). Can you please work on communicating better descriptions of the problems? This really isn't good enough. I think you're talking about sgx_init(). It launches ksgxd from sgx_page_reclaimer_init() which sets about sanitizing the 'dirty_page_list'. After launching ksgxd, if later actions in sgx_init() (misc_register(), sgx_drv_init(), sgx_vepc_init()) fail, ksgxd will be stopped prematurely. This will leave pages in 'sgx_dirty_page_list' after __sgx_sanitize_pages() has completed, which results in a WARN_ON(). The WARN_ON() is really only valid when __sgx_sanitize_pages() runs to completion *and* fails to empty 'sgx_dirty_page_list'. Is that it? If so, could you please give the changelog another go?
On Thu, Aug 25, 2022 at 09:57:11AM -0500, Haitao Huang wrote: > Hi Jarkko, > > On Thu, 25 Aug 2022 03:08:02 -0500, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > > > If sgx_dirty_page_list ends up being non-empty, currently this triggers > > WARN_ON(), which produces a lot of noise, and can potentially crash the > > kernel, depending on the kernel command line. > > > > However, if the SGX subsystem initialization is retracted, the > > sanitization > > process could end up in the middle, and sgx_dirty_page_list be left > > non-empty for legit reasons. > > > > Replace this faulty behavior with more verbose version > > __sgx_sanitize_pages(), which can optionally print EREMOVE error code and > > the number of unsanitized pages. > > > > Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u > > Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> > > Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with > > sgx_dirty_page_list") > > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > > > Cc: Haitao Huang <haitao.huang@linux.intel.com> > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > Cc: Reinette Chatre <reinette.chatre@intel.com> > > --- > > v3: > > - Remove WARN_ON(). > > - Tuned comments and the commit message a bit. > > > > v2: > > - Replaced WARN_ON() with optional pr_info() inside > > __sgx_sanitize_pages(). > > - Rewrote the commit message. > > - Added the fixes tag. > > --- > > arch/x86/kernel/cpu/sgx/main.c | 20 +++++++++++++------- > > 1 file changed, 13 insertions(+), 7 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/sgx/main.c > > b/arch/x86/kernel/cpu/sgx/main.c > > index 515e2a5f25bb..d204520a5e26 100644 > > --- a/arch/x86/kernel/cpu/sgx/main.c > > +++ b/arch/x86/kernel/cpu/sgx/main.c > > @@ -50,16 +50,17 @@ static LIST_HEAD(sgx_dirty_page_list); > > * from the input list, and made available for the page allocator. SECS > > pages > > * prepending their children in the input list are left intact. > > */ > > -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > +static void __sgx_sanitize_pages(struct list_head *dirty_page_list, > > bool verbose) > > { > > struct sgx_epc_page *page; > > + int dirty_count = 0; > > LIST_HEAD(dirty); > > int ret; > > /* dirty_page_list is thread-local, no need for a lock: */ > > Just a nitpick, > Although it is not added in this patch, the above comment is not accurate. > The list is accessed one thread only: filled first in main thread, then > only ever accessed here. > > IIUC, could you remove or update that comment? Well, if we cut hairs here, it's actually expectation for the caller, right? :-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index d204520a5e26..c6f416307812 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -49,6 +49,8 @@ static LIST_HEAD(sgx_dirty_page_list); * Reset post-kexec EPC pages to the uninitialized state. The pages are removed * from the input list, and made available for the page allocator. SECS pages * prepending their children in the input list are left intact. + * + * @dirty_page_list must be thread-local, i.e. no need for a lock. */ static void __sgx_sanitize_pages(struct list_head *dirty_page_list, bool verbose) { @@ -57,7 +59,6 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list, bool verbose LIST_HEAD(dirty); int ret; - /* dirty_page_list is thread-local, no need for a lock: */ while (!list_empty(dirty_page_list)) { if (kthread_should_stop()) break; > > Other than that, FWIW: > Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com> > Thanks > Haitao Thank you. BR, Jarkko
On 8/25/22 01:08, Jarkko Sakkinen wrote: > + /* Can happen, when the initialization is retracted: */ > + if (verbose && dirty_count > 0) > + pr_info("%d unsanitized pages\n", dirty_count); > } > > static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) > @@ -394,11 +403,8 @@ static int ksgxd(void *p) > * Sanitize pages in order to recover from kexec(). The 2nd pass is > * required for SECS pages, whose child pages blocked EREMOVE. > */ > - __sgx_sanitize_pages(&sgx_dirty_page_list); > - __sgx_sanitize_pages(&sgx_dirty_page_list); > - > - /* sanity check: */ > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > + __sgx_sanitize_pages(&sgx_dirty_page_list, false); > + __sgx_sanitize_pages(&sgx_dirty_page_list, true); This is backwards, IMNHO. Make __sgx_sanitize_pages() return the number of pages that it leaves dirty. __sgx_sanitize_pages(&sgx_dirty_page_list) left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); if (left_dirty) pr_warn(...); That rids us of the mystery true/false and puts the pr_warn() in a place that makes logical sense. Then, let's either *not* do the pr_err_ratelimited(EREMOVE_ERROR_MESSAGE, ret, ret); at all, or make it an unconditional pr_warn_ratelimited(). They're not going to be common and multiple messages are virtually worthless anyway. I actually think a common tracepoint, or out-of-line ENCLS/ENCLU functions that can be easily ftraced are a much better idea than a one-off pr_whatever().
On Thu, Aug 25, 2022 at 11:38:00AM -0700, Dave Hansen wrote: > On 8/25/22 11:27, Jarkko Sakkinen wrote: > > On Thu, Aug 25, 2022 at 07:07:44AM -0700, Dave Hansen wrote: > >> On 8/25/22 01:08, Jarkko Sakkinen wrote: > >>> However, if the SGX subsystem initialization is retracted, the sanitization > >>> process could end up in the middle, and sgx_dirty_page_list be left > >>> non-empty for legit reasons. > >> What does "retraction" mean in this context? > > Rest of the initialization failing or features not detected (-ENODEV). > > Can you please work on communicating better descriptions of the > problems? This really isn't good enough. Sure, I can put more detail into this patch. If you speak in general about commit messages, picking the correct granularity is somewhat easy to fail because different people have different expectations on that. If denoted, I'm happy to write more detailed description, if the original is not granular enough. > I think you're talking about sgx_init(). It launches ksgxd from > sgx_page_reclaimer_init() which sets about sanitizing the > 'dirty_page_list'. After launching ksgxd, if later actions in > sgx_init() (misc_register(), sgx_drv_init(), sgx_vepc_init()) fail, > ksgxd will be stopped prematurely. It's a bit more complicated, as either sgx_drv_init() or sgx_vepc_init() can fail without premature end for ksgxd. So the exact conditions for premature stop are: "In sgx_init(), if misc_register() for the provision device fails, and neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped." > This will leave pages in 'sgx_dirty_page_list' after > __sgx_sanitize_pages() has completed, which results in a WARN_ON(). > > The WARN_ON() is really only valid when __sgx_sanitize_pages() runs to > completion *and* fails to empty 'sgx_dirty_page_list'. This is correct. > Is that it? Just thinking if pr_warn() should be used if running to the completion and failing to empty the list. A bit more information to the klog on conditions, and not much extra complexity. What do you think? > If so, could you please give the changelog another go? BR, Jarkko
On Thu, Aug 25, 2022 at 11:51:18AM -0700, Dave Hansen wrote: > On 8/25/22 01:08, Jarkko Sakkinen wrote: > > + /* Can happen, when the initialization is retracted: */ > > + if (verbose && dirty_count > 0) > > + pr_info("%d unsanitized pages\n", dirty_count); > > } > > > > static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) > > @@ -394,11 +403,8 @@ static int ksgxd(void *p) > > * Sanitize pages in order to recover from kexec(). The 2nd pass is > > * required for SECS pages, whose child pages blocked EREMOVE. > > */ > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > - > > - /* sanity check: */ > > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > > + __sgx_sanitize_pages(&sgx_dirty_page_list, false); > > + __sgx_sanitize_pages(&sgx_dirty_page_list, true); > > This is backwards, IMNHO. > > Make __sgx_sanitize_pages() return the number of pages that it leaves > dirty. > > __sgx_sanitize_pages(&sgx_dirty_page_list) > left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); > if (left_dirty) > pr_warn(...); I like this and my patch has already the counter in place so why not. > That rids us of the mystery true/false and puts the pr_warn() in a place > that makes logical sense. Then, let's either *not* do the > > pr_err_ratelimited(EREMOVE_ERROR_MESSAGE, ret, ret); > > at all, or make it an unconditional pr_warn_ratelimited(). They're not > going to be common and multiple messages are virtually worthless anyway. > > I actually think a common tracepoint, or out-of-line ENCLS/ENCLU > functions that can be easily ftraced are a much better idea than a > one-off pr_whatever(). I like the tracepoint idea more than out-of-line ENCLS/ENCLU because out-of-line is more "intrusive" change to the code semantics than a tracepoint. BR, Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 515e2a5f25bb..d204520a5e26 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -50,16 +50,17 @@ static LIST_HEAD(sgx_dirty_page_list); * from the input list, and made available for the page allocator. SECS pages * prepending their children in the input list are left intact. */ -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) +static void __sgx_sanitize_pages(struct list_head *dirty_page_list, bool verbose) { struct sgx_epc_page *page; + int dirty_count = 0; LIST_HEAD(dirty); int ret; /* dirty_page_list is thread-local, no need for a lock: */ while (!list_empty(dirty_page_list)) { if (kthread_should_stop()) - return; + break; page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); @@ -90,14 +91,22 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) list_del(&page->list); sgx_free_epc_page(page); } else { + if (verbose) + pr_err_ratelimited(EREMOVE_ERROR_MESSAGE, ret, ret); + /* The page is not yet clean - move to the dirty list. */ list_move_tail(&page->list, &dirty); + dirty_count++; } cond_resched(); } list_splice(&dirty, dirty_page_list); + + /* Can happen, when the initialization is retracted: */ + if (verbose && dirty_count > 0) + pr_info("%d unsanitized pages\n", dirty_count); } static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) @@ -394,11 +403,8 @@ static int ksgxd(void *p) * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */ - __sgx_sanitize_pages(&sgx_dirty_page_list); - __sgx_sanitize_pages(&sgx_dirty_page_list); - - /* sanity check: */ - WARN_ON(!list_empty(&sgx_dirty_page_list)); + __sgx_sanitize_pages(&sgx_dirty_page_list, false); + __sgx_sanitize_pages(&sgx_dirty_page_list, true); while (!kthread_should_stop()) { if (try_to_freeze())
If sgx_dirty_page_list ends up being non-empty, currently this triggers WARN_ON(), which produces a lot of noise, and can potentially crash the kernel, depending on the kernel command line. However, if the SGX subsystem initialization is retracted, the sanitization process could end up in the middle, and sgx_dirty_page_list be left non-empty for legit reasons. Replace this faulty behavior with more verbose version __sgx_sanitize_pages(), which can optionally print EREMOVE error code and the number of unsanitized pages. Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Cc: Haitao Huang <haitao.huang@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Reinette Chatre <reinette.chatre@intel.com> --- v3: - Remove WARN_ON(). - Tuned comments and the commit message a bit. v2: - Replaced WARN_ON() with optional pr_info() inside __sgx_sanitize_pages(). - Rewrote the commit message. - Added the fixes tag. --- arch/x86/kernel/cpu/sgx/main.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)