Message ID | 20220830031206.13449-2-jarkko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/sgx: Test and fixes | expand |
Hi Jarkko, On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > In sgx_init(), if misc_register() for the provision device fails, and > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > prematurely stopped. I do not think misc_register() is required to fail for the scenario to be triggered (rather use "or" than "and"?). Perhaps just "In sgx_init(), if a failure is encountered after ksgxd is started (via sgx_page_reclaimer_init()) ...". To help the reader understand the subject of this patch it may help to explain that prematurely stopping ksgxd may leave some unsanitized pages, but that is not a problem since SGX cannot be used on the platform anyway. > This triggers WARN_ON() because sgx_dirty_page_list ends up being > non-empty, and dumps the call stack: > Traces like below can be frowned upon. I recommend that you follow the guidance in "Backtraces in commit mesages"(sic) in Documentation/process/submitting-patches.rst. > [ 0.000000] Linux version 6.0.0-rc2 (root@4beb429beb4a) (gcc (Debian > 11.3.0-3) 11.3.0, GNU ld (GNU Binutils for Debian) 2.38) #382 SMP > PREEMPT_DYNAMIC Fri Aug 26 12:52:15 UTC 2022 > [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.0.0-rc2 > root=UUID=56f398e0-1e25-4fda-aa9f-611dece4b333 ro quiet > module_blacklist=psmouse initcall_debug log_buf_len=4M cryptomgr.notests > […] > [ 0.268089] calling sgx_init+0x0/0x409 @ 1 > [ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff > [ 0.268591] ------------[ cut here ]------------ > [ 0.268592] WARNING: CPU: 6 PID: 83 at > arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 > [ 0.268598] Modules linked in: > [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 > [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 > 07/06/2022 > [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 > [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f > 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff > ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 > [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 > [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: > 0000000000000000 > [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: > 00000000ffffffff > [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: > ffff8dcd820a4180 > [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: > ffffb6c74006bce0 > [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: > 0000000000000000 > [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) > knlGS:0000000000000000 > [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: > 00000000003706e0 > [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 0.268622] Call Trace: > [ 0.268624] <TASK> > [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 > [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 > [ 0.268634] ? __kthread_parkme+0x36/0x90 > [ 0.268637] kthread+0xe5/0x110 > [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 > [ 0.268642] ret_from_fork+0x1f/0x30 > [ 0.268647] </TASK> > [ 0.268648] ---[ end trace 0000000000000000 ]--- > [ 0.268694] initcall sgx_init+0x0/0x409 returned -19 after 603 usecs > > Ultimately this can crash the kernel, if the following is set: > > /proc/sys/kernel/panic_on_warn > > Print a simple warning instead, and improve the output by printing the > number of unsanitized pages, in order to provide debug informnation for > future needs. informnation -> information ... > Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u > Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> > Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> > Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Should this go to stable? > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > index 515e2a5f25bb..903100fcfce3 100644 > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); > * Reset post-kexec EPC pages to the uninitialized state. The pages are removed > * from the input list, and made available for the page allocator. SECS pages > * prepending their children in the input list are left intact. > + * > + * Contents of the @dirty_page_list must be thread-local, i.e. > + * not shared by multiple threads. Did you intend to mention something about the needed locking here? It looks like some information is lost during the move to the function description. > */ > -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > +static int __sgx_sanitize_pages(struct list_head *dirty_page_list) > { > struct sgx_epc_page *page; > + int left_dirty = 0; I do not know how many pages this code should be ready for but at least this could handle more by being an unsigned int considering that it is always positive ... maybe even unsigned long? > LIST_HEAD(dirty); > int ret; > > - /* dirty_page_list is thread-local, no need for a lock: */ > while (!list_empty(dirty_page_list)) { > if (kthread_should_stop()) > - return; > + break; > > page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); > > @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > } else { > /* The page is not yet clean - move to the dirty list. */ > list_move_tail(&page->list, &dirty); > + left_dirty++; > } > > cond_resched(); > } > > list_splice(&dirty, dirty_page_list); > + return left_dirty; > } > > static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) > @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) > > static int ksgxd(void *p) > { > + int left_dirty; > + > set_freezable(); > > /* > @@ -395,10 +402,10 @@ static int ksgxd(void *p) > * required for SECS pages, whose child pages blocked EREMOVE. > */ > __sgx_sanitize_pages(&sgx_dirty_page_list); > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > - /* sanity check: */ > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); > + if (left_dirty) > + pr_warn("%d unsanitized pages\n", left_dirty); > > while (!kthread_should_stop()) { > if (try_to_freeze()) Reinette
On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > Hi Jarkko, > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > In sgx_init(), if misc_register() for the provision device fails, and > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > prematurely stopped. > > I do not think misc_register() is required to fail for the scenario to > be triggered (rather use "or" than "and"?). Perhaps just > "In sgx_init(), if a failure is encountered after ksgxd is started > (via sgx_page_reclaimer_init()) ...". IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can be initialized successfully, sgx_init() still returns 0. Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end of sgx_init(), after we make sure at least one of the driver and the KVM SGX is initialized successfully. Then the code change in this patch won't be necessary if I understand correctly. AFAICT there's no good reason to start the ksgxd at early stage before we are sure either the driver or KVM SGX will work. Btw currently EPC pages assigned to KVM guest cannot be reclaimed, so theoretically ksgxd can be moved to sgx_drv_init(), but who knows someday we will decide to make KVM guest EPC pages to be able to be reclaimed. :)
On Tue, Aug 30, 2022 at 03:54:27PM -0700, Reinette Chatre wrote: > Hi Jarkko, > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > In sgx_init(), if misc_register() for the provision device fails, and > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > prematurely stopped. > > I do not think misc_register() is required to fail for the scenario to > be triggered (rather use "or" than "and"?). Perhaps just > "In sgx_init(), if a failure is encountered after ksgxd is started > (via sgx_page_reclaimer_init()) ...". This would be the fixed version of the sentence: " In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle. " I want to keep the end states listed and not make it more abstract. The second sentence addresses the remark below. > To help the reader understand the subject of this patch it may help > to explain that prematurely stopping ksgxd may leave some > unsanitized pages, but that is not a problem since SGX cannot > be used on the platform anyway. > > > This triggers WARN_ON() because sgx_dirty_page_list ends up being > > non-empty, and dumps the call stack: > > > > Traces like below can be frowned upon. I recommend that you follow the > guidance in "Backtraces in commit mesages"(sic) in > Documentation/process/submitting-patches.rst. > > > [ 0.268592] WARNING: CPU: 6 PID: 83 at > > arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 Is this good enough? I had not actually spotted this section before but nice that it exists. Apparently has been added in 5.12. >> > > > Ultimately this can crash the kernel, if the following is set: > > > > /proc/sys/kernel/panic_on_warn > > > > Print a simple warning instead, and improve the output by printing the > > number of unsanitized pages, in order to provide debug informnation for > > future needs. > > informnation -> information +1 > > > ... > > > Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u > > Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> > > Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> > > Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") > > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > Should this go to stable? I guess it should. The hard reason for this that it can panic the kernel. > > > > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > > index 515e2a5f25bb..903100fcfce3 100644 > > --- a/arch/x86/kernel/cpu/sgx/main.c > > +++ b/arch/x86/kernel/cpu/sgx/main.c > > @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); > > * Reset post-kexec EPC pages to the uninitialized state. The pages are removed > > * from the input list, and made available for the page allocator. SECS pages > > * prepending their children in the input list are left intact. > > + * > > + * Contents of the @dirty_page_list must be thread-local, i.e. > > + * not shared by multiple threads. > > Did you intend to mention something about the needed locking here? It looks > like some information is lost during the move to the function description. Nothing about the locking that concerns the parameter, as the sentence defines clear constraints for the caller. > > > */ > > -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > +static int __sgx_sanitize_pages(struct list_head *dirty_page_list) > > { > > struct sgx_epc_page *page; > > + int left_dirty = 0; > > I do not know how many pages this code should be ready for but at least > this could handle more by being an unsigned int considering that it is > always positive ... maybe even unsigned long? I would go for 'long'. More information below. > > > LIST_HEAD(dirty); > > int ret; > > > > - /* dirty_page_list is thread-local, no need for a lock: */ > > while (!list_empty(dirty_page_list)) { > > if (kthread_should_stop()) > > - return; > > + break; > > > > page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); > > > > @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > } else { > > /* The page is not yet clean - move to the dirty list. */ > > list_move_tail(&page->list, &dirty); > > + left_dirty++; > > } > > > > cond_resched(); > > } > > > > list_splice(&dirty, dirty_page_list); > > + return left_dirty; > > } > > > > static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) > > @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) > > > > static int ksgxd(void *p) > > { > > + int left_dirty; > > + > > set_freezable(); > > > > /* > > @@ -395,10 +402,10 @@ static int ksgxd(void *p) > > * required for SECS pages, whose child pages blocked EREMOVE. > > */ > > __sgx_sanitize_pages(&sgx_dirty_page_list); > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > > > - /* sanity check: */ > > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > > + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); > > + if (left_dirty) > > + pr_warn("%d unsanitized pages\n", left_dirty); > > > > while (!kthread_should_stop()) { > > if (try_to_freeze()) > > > Reinette We need to return -ECANCELED on premature stop, and number of pages otherwise. In premature stop, nothing should be printed, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore a non-zero number should be printed pr_err(), if that happens after the second call. Thanks for feedback. BR, Jarkko
On Wed, Aug 31, 2022 at 04:55:24AM +0300, Jarkko Sakkinen wrote: > On Tue, Aug 30, 2022 at 03:54:27PM -0700, Reinette Chatre wrote: > > Hi Jarkko, > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > In sgx_init(), if misc_register() for the provision device fails, and > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > prematurely stopped. > > > > I do not think misc_register() is required to fail for the scenario to > > be triggered (rather use "or" than "and"?). Perhaps just > > "In sgx_init(), if a failure is encountered after ksgxd is started > > (via sgx_page_reclaimer_init()) ...". > > This would be the fixed version of the sentence: > > " > In sgx_init(), if misc_register() fails or misc_register() succeeds but > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > prematurely stopped. This may leave some unsanitized pages, which does > not matter, because SGX will be disabled for the whole power cycle. > " > > I want to keep the end states listed and not make it more abstract. > > The second sentence addresses the remark below. > > > To help the reader understand the subject of this patch it may help > > to explain that prematurely stopping ksgxd may leave some > > unsanitized pages, but that is not a problem since SGX cannot > > be used on the platform anyway. > > > > > This triggers WARN_ON() because sgx_dirty_page_list ends up being > > > non-empty, and dumps the call stack: > > > > > > > Traces like below can be frowned upon. I recommend that you follow the > > guidance in "Backtraces in commit mesages"(sic) in > > Documentation/process/submitting-patches.rst. > > > > > [ 0.268592] WARNING: CPU: 6 PID: 83 at > > > arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 > > Is this good enough? I had not actually spotted this section before but > nice that it exists. Apparently has been added in 5.12. > > >> > > > > Ultimately this can crash the kernel, if the following is set: > > > > > > /proc/sys/kernel/panic_on_warn > > > > > > Print a simple warning instead, and improve the output by printing the > > > number of unsanitized pages, in order to provide debug informnation for > > > future needs. > > > > informnation -> information > > +1 > > > > > > > ... > > > > > Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u > > > Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> > > > Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> > > > Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") > > > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > > > Should this go to stable? > > I guess it should. The hard reason for this that it can panic > the kernel. > > > > > > > > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > > > index 515e2a5f25bb..903100fcfce3 100644 > > > --- a/arch/x86/kernel/cpu/sgx/main.c > > > +++ b/arch/x86/kernel/cpu/sgx/main.c > > > @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); > > > * Reset post-kexec EPC pages to the uninitialized state. The pages are removed > > > * from the input list, and made available for the page allocator. SECS pages > > > * prepending their children in the input list are left intact. > > > + * > > > + * Contents of the @dirty_page_list must be thread-local, i.e. > > > + * not shared by multiple threads. > > > > Did you intend to mention something about the needed locking here? It looks > > like some information is lost during the move to the function description. > > Nothing about the locking that concerns the parameter, as the > sentence defines clear constraints for the caller. > > > > > > */ > > > -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > > +static int __sgx_sanitize_pages(struct list_head *dirty_page_list) > > > { > > > struct sgx_epc_page *page; > > > + int left_dirty = 0; > > > > I do not know how many pages this code should be ready for but at least > > this could handle more by being an unsigned int considering that it is > > always positive ... maybe even unsigned long? > > I would go for 'long'. More information below. > > > > > > LIST_HEAD(dirty); > > > int ret; > > > > > > - /* dirty_page_list is thread-local, no need for a lock: */ > > > while (!list_empty(dirty_page_list)) { > > > if (kthread_should_stop()) > > > - return; > > > + break; > > > > > > page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); > > > > > > @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > > } else { > > > /* The page is not yet clean - move to the dirty list. */ > > > list_move_tail(&page->list, &dirty); > > > + left_dirty++; > > > } > > > > > > cond_resched(); > > > } > > > > > > list_splice(&dirty, dirty_page_list); > > > + return left_dirty; > > > } > > > > > > static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) > > > @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) > > > > > > static int ksgxd(void *p) > > > { > > > + int left_dirty; > > > + > > > set_freezable(); > > > > > > /* > > > @@ -395,10 +402,10 @@ static int ksgxd(void *p) > > > * required for SECS pages, whose child pages blocked EREMOVE. > > > */ > > > __sgx_sanitize_pages(&sgx_dirty_page_list); > > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > > > > > - /* sanity check: */ > > > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > > > + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); > > > + if (left_dirty) > > > + pr_warn("%d unsanitized pages\n", left_dirty); > > > > > > while (!kthread_should_stop()) { > > > if (try_to_freeze()) > > > > > > Reinette > > We need to return -ECANCELED on premature stop, and number of > pages otherwise. > > In premature stop, nothing should be printed, as the number > is by practical means a random number. Otherwise, it is an > indicator of a bug in the driver, and therefore a non-zero > number should be printed pr_err(), if that happens after the > second call. I.e. even though we print less we get more *information* what is going inside the kernel. Warning is not correct for either path IMHO. BR, Jarkko
On Wed, Aug 31, 2022 at 04:58:58AM +0300, Jarkko Sakkinen wrote: > On Wed, Aug 31, 2022 at 04:55:24AM +0300, Jarkko Sakkinen wrote: > > On Tue, Aug 30, 2022 at 03:54:27PM -0700, Reinette Chatre wrote: > > > Hi Jarkko, > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > prematurely stopped. > > > > > > I do not think misc_register() is required to fail for the scenario to > > > be triggered (rather use "or" than "and"?). Perhaps just > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > (via sgx_page_reclaimer_init()) ...". > > > > This would be the fixed version of the sentence: > > > > " > > In sgx_init(), if misc_register() fails or misc_register() succeeds but > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > prematurely stopped. This may leave some unsanitized pages, which does > > not matter, because SGX will be disabled for the whole power cycle. > > " > > > > I want to keep the end states listed and not make it more abstract. > > > > The second sentence addresses the remark below. > > > > > To help the reader understand the subject of this patch it may help > > > to explain that prematurely stopping ksgxd may leave some > > > unsanitized pages, but that is not a problem since SGX cannot > > > be used on the platform anyway. > > > > > > > This triggers WARN_ON() because sgx_dirty_page_list ends up being > > > > non-empty, and dumps the call stack: > > > > > > > > > > Traces like below can be frowned upon. I recommend that you follow the > > > guidance in "Backtraces in commit mesages"(sic) in > > > Documentation/process/submitting-patches.rst. > > > > > > > [ 0.268592] WARNING: CPU: 6 PID: 83 at > > > > arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 > > > > Is this good enough? I had not actually spotted this section before but > > nice that it exists. Apparently has been added in 5.12. > > > > >> > > > > > Ultimately this can crash the kernel, if the following is set: > > > > > > > > /proc/sys/kernel/panic_on_warn > > > > > > > > Print a simple warning instead, and improve the output by printing the > > > > number of unsanitized pages, in order to provide debug informnation for > > > > future needs. > > > > > > informnation -> information > > > > +1 > > > > > > > > > > > ... > > > > > > > Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u > > > > Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> > > > > Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> > > > > Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") > > > > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > > > > > Should this go to stable? > > > > I guess it should. The hard reason for this that it can panic > > the kernel. > > > > > > > > > > > > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > > > > index 515e2a5f25bb..903100fcfce3 100644 > > > > --- a/arch/x86/kernel/cpu/sgx/main.c > > > > +++ b/arch/x86/kernel/cpu/sgx/main.c > > > > @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); > > > > * Reset post-kexec EPC pages to the uninitialized state. The pages are removed > > > > * from the input list, and made available for the page allocator. SECS pages > > > > * prepending their children in the input list are left intact. > > > > + * > > > > + * Contents of the @dirty_page_list must be thread-local, i.e. > > > > + * not shared by multiple threads. > > > > > > Did you intend to mention something about the needed locking here? It looks > > > like some information is lost during the move to the function description. > > > > Nothing about the locking that concerns the parameter, as the > > sentence defines clear constraints for the caller. > > > > > > > > > */ > > > > -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > > > +static int __sgx_sanitize_pages(struct list_head *dirty_page_list) > > > > { > > > > struct sgx_epc_page *page; > > > > + int left_dirty = 0; > > > > > > I do not know how many pages this code should be ready for but at least > > > this could handle more by being an unsigned int considering that it is > > > always positive ... maybe even unsigned long? > > > > I would go for 'long'. More information below. > > > > > > > > > LIST_HEAD(dirty); > > > > int ret; > > > > > > > > - /* dirty_page_list is thread-local, no need for a lock: */ > > > > while (!list_empty(dirty_page_list)) { > > > > if (kthread_should_stop()) > > > > - return; > > > > + break; > > > > > > > > page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); > > > > > > > > @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) > > > > } else { > > > > /* The page is not yet clean - move to the dirty list. */ > > > > list_move_tail(&page->list, &dirty); > > > > + left_dirty++; > > > > } > > > > > > > > cond_resched(); > > > > } > > > > > > > > list_splice(&dirty, dirty_page_list); > > > > + return left_dirty; > > > > } > > > > > > > > static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) > > > > @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) > > > > > > > > static int ksgxd(void *p) > > > > { > > > > + int left_dirty; > > > > + > > > > set_freezable(); > > > > > > > > /* > > > > @@ -395,10 +402,10 @@ static int ksgxd(void *p) > > > > * required for SECS pages, whose child pages blocked EREMOVE. > > > > */ > > > > __sgx_sanitize_pages(&sgx_dirty_page_list); > > > > - __sgx_sanitize_pages(&sgx_dirty_page_list); > > > > > > > > - /* sanity check: */ > > > > - WARN_ON(!list_empty(&sgx_dirty_page_list)); > > > > + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); > > > > + if (left_dirty) > > > > + pr_warn("%d unsanitized pages\n", left_dirty); > > > > > > > > while (!kthread_should_stop()) { > > > > if (try_to_freeze()) > > > > > > > > > Reinette > > > > We need to return -ECANCELED on premature stop, and number of > > pages otherwise. > > > > In premature stop, nothing should be printed, as the number > > is by practical means a random number. Otherwise, it is an > > indicator of a bug in the driver, and therefore a non-zero > > number should be printed pr_err(), if that happens after the > > second call. > > I.e. even though we print less we get more *information* what > is going inside the kernel. Warning is not correct for either > path IMHO. Oh, sorry, I forgot one thing. The devices should be actually deinitialized in the error case. We do not want to leave a broken driver running. BR, Jarkko
On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > Hi Jarkko, > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > In sgx_init(), if misc_register() for the provision device fails, and > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > prematurely stopped. > > > > I do not think misc_register() is required to fail for the scenario to > > be triggered (rather use "or" than "and"?). Perhaps just > > "In sgx_init(), if a failure is encountered after ksgxd is started > > (via sgx_page_reclaimer_init()) ...". > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > be initialized successfully, sgx_init() still returns 0. > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > initialized successfully. Then the code change in this patch won't be necessary > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > early stage before we are sure either the driver or KVM SGX will work. I would focus fixing the existing flow rather than reinventing the flow. It can be made to work, and therefore it is IMHO correct action to take. > Btw currently EPC pages assigned to KVM guest cannot be reclaimed, so > theoretically ksgxd can be moved to sgx_drv_init(), but who knows someday we > will decide to make KVM guest EPC pages to be able to be reclaimed. :) I'm open to changes but it is in my opinion out of context for this. > > > > -- > Thanks, > -Kai > > BR, Jarkko
On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > Hi Jarkko, > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > prematurely stopped. > > > > > > I do not think misc_register() is required to fail for the scenario to > > > be triggered (rather use "or" than "and"?). Perhaps just > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > (via sgx_page_reclaimer_init()) ...". > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > be initialized successfully, sgx_init() still returns 0. > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > initialized successfully. Then the code change in this patch won't be necessary > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > early stage before we are sure either the driver or KVM SGX will work. > > I would focus fixing the existing flow rather than reinventing the flow. > > It can be made to work, and therefore it is IMHO correct action to take. From another perspective, the *existing flow* is the reason which causes this bug. A real fix is to fix the flow itself. > > > Btw currently EPC pages assigned to KVM guest cannot be reclaimed, so > > theoretically ksgxd can be moved to sgx_drv_init(), but who knows someday we > > will decide to make KVM guest EPC pages to be able to be reclaimed. :) > > I'm open to changes but it is in my opinion out of context for this. > > Yeah. I was expressing the reason I suggested to move sgx_page_reclaimer_init() to the end of sgx_init(), but not to sgx_drv_init(). But moving to sgx_drv_init() also makes sense to me given KVM guest EPC pages are not reclaimable now. For now there's no reason to run ksgxd() if only virtual EPC driver is enabled. We can move sgx_page_reclaimer_init() out of sgx_drv_init() when we add KVM guest EPC page reclaiming support (if it happens in the future).
On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > Hi Jarkko, > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > prematurely stopped. > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > be initialized successfully, sgx_init() still returns 0. > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > initialized successfully. Then the code change in this patch won't be necessary > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > early stage before we are sure either the driver or KVM SGX will work. > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > It can be made to work, and therefore it is IMHO correct action to take. > > From another perspective, the *existing flow* is the reason which causes this > bug. A real fix is to fix the flow itself. Any existing flow in part of the kernel can have a bug. That does not mean that switching flow would be proper way to fix a bug. BR, Jarkko
On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > Hi Jarkko, > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > > prematurely stopped. > > > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > > initialized successfully. Then the code change in this patch won't be necessary > > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > > early stage before we are sure either the driver or KVM SGX will work. > > > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > > > It can be made to work, and therefore it is IMHO correct action to take. > > > > From another perspective, the *existing flow* is the reason which causes this > > bug. A real fix is to fix the flow itself. > > Any existing flow in part of the kernel can have a bug. That > does not mean that switching flow would be proper way to fix > a bug. > > BR, Jarkko Yes but I think this is only true when the flow is reasonable. If the flow itself isn't reasonable, we should fix the flow (given it's easy to fix AFAICT). Anyway, let us also hear from others.
On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > > Hi Jarkko, > > > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > > > prematurely stopped. > > > > > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > > > initialized successfully. Then the code change in this patch won't be necessary > > > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > > > early stage before we are sure either the driver or KVM SGX will work. > > > > > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > > > > > It can be made to work, and therefore it is IMHO correct action to take. > > > > > > From another perspective, the *existing flow* is the reason which causes this > > > bug. A real fix is to fix the flow itself. > > > > Any existing flow in part of the kernel can have a bug. That > > does not mean that switching flow would be proper way to fix > > a bug. > > > > BR, Jarkko > > Yes but I think this is only true when the flow is reasonable. If the flow > itself isn't reasonable, we should fix the flow (given it's easy to fix AFAICT). > > Anyway, let us also hear from others. The flow can be made to work without issues, which in the context of a bug fix is exactly what a bug fix should do. Not more or less. You don't gain any measurable value for the user with this switch idea. BR, Jarkko
On Wed, Aug 31, 2022 at 05:57:22AM +0300, jarkko@kernel.org wrote: > On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: > > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > > > Hi Jarkko, > > > > > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > > > > prematurely stopped. > > > > > > > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > > > > initialized successfully. Then the code change in this patch won't be necessary > > > > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > > > > early stage before we are sure either the driver or KVM SGX will work. > > > > > > > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > > > > > > > It can be made to work, and therefore it is IMHO correct action to take. > > > > > > > > From another perspective, the *existing flow* is the reason which causes this > > > > bug. A real fix is to fix the flow itself. > > > > > > Any existing flow in part of the kernel can have a bug. That > > > does not mean that switching flow would be proper way to fix > > > a bug. > > > > > > BR, Jarkko > > > > Yes but I think this is only true when the flow is reasonable. If the flow > > itself isn't reasonable, we should fix the flow (given it's easy to fix AFAICT). > > > > Anyway, let us also hear from others. > > The flow can be made to work without issues, which in the > context of a bug fix is exactly what a bug fix should do. > Not more or less. > > You don't gain any measurable value for the user with this > switch idea. And besides this not proper way to review patch anyway because you did not review the code. I'll focus on fix what is broken e.g. so that it is easy to backport to stable and distro kernels, and call it a day. It certainly does not have to make code "perfect", as long as known bugs are sorted out. You are welcome to review the next version of the patch, once I've resolved the issues that were pointed out by Reinette, if you still see some issue but this type of speculative discussion is frankly just wasting everyones time. (need to check my mutt config, do not know why it is not always putting real name to from field) BR, Jarkko
On Wed, 2022-08-31 at 05:57 +0300, jarkko@kernel.org wrote: > On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: > > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > > > Hi Jarkko, > > > > > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > > > > prematurely stopped. > > > > > > > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > > > > initialized successfully. Then the code change in this patch won't be necessary > > > > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > > > > early stage before we are sure either the driver or KVM SGX will work. > > > > > > > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > > > > > > > It can be made to work, and therefore it is IMHO correct action to take. > > > > > > > > From another perspective, the *existing flow* is the reason which causes this > > > > bug. A real fix is to fix the flow itself. > > > > > > Any existing flow in part of the kernel can have a bug. That > > > does not mean that switching flow would be proper way to fix > > > a bug. > > > > > > BR, Jarkko > > > > Yes but I think this is only true when the flow is reasonable. If the flow > > itself isn't reasonable, we should fix the flow (given it's easy to fix AFAICT). > > > > Anyway, let us also hear from others. > > The flow can be made to work without issues, which in the > context of a bug fix is exactly what a bug fix should do. > Not more or less. No. To me the flow itself is buggy. There's no reason to start ksgxd() before at least SGX driver is initialized to work. Patching the buggy flow is more like a workaround, but isn't a real fix. > > You don't gain any measurable value for the user with this > switch idea. There is actual gain by moving sgx_page_reclaimer_init() to sgx_drv_init(), or only calling sgx_page_reclaimer_init() when sgx_drv_init() returns success: If somehow sgx_drv_init() fails to initialize, ksgxd() won't run. Currently, if SGX driver fails to initialize but virtual EPC initializes successfully, ksgxd() still runs. However it achieves nothing but only wastes CPU cycles.
On Wed, 2022-08-31 at 06:10 +0300, Jarkko Sakkinen wrote: > On Wed, Aug 31, 2022 at 05:57:22AM +0300, jarkko@kernel.org wrote: > > On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: > > > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > > > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > > > > Hi Jarkko, > > > > > > > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > > > > > prematurely stopped. > > > > > > > > > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > > > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > > > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > > > > > initialized successfully. Then the code change in this patch won't be necessary > > > > > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > > > > > early stage before we are sure either the driver or KVM SGX will work. > > > > > > > > > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > > > > > > > > > It can be made to work, and therefore it is IMHO correct action to take. > > > > > > > > > > From another perspective, the *existing flow* is the reason which causes this > > > > > bug. A real fix is to fix the flow itself. > > > > > > > > Any existing flow in part of the kernel can have a bug. That > > > > does not mean that switching flow would be proper way to fix > > > > a bug. > > > > > > > > BR, Jarkko > > > > > > Yes but I think this is only true when the flow is reasonable. If the flow > > > itself isn't reasonable, we should fix the flow (given it's easy to fix AFAICT). > > > > > > Anyway, let us also hear from others. > > > > The flow can be made to work without issues, which in the > > context of a bug fix is exactly what a bug fix should do. > > Not more or less. > > > > You don't gain any measurable value for the user with this > > switch idea. > > And besides this not proper way to review patch anyway because you did > not review the code. > I did review the code, but I couldn't agree on the fix. That's why I expressed my view here. > I'll focus on fix what is broken e.g. so that it > is easy to backport to stable and distro kernels, and call it a day. > It certainly does not have to make code "perfect", as long as known > bugs are sorted out. Why cannot the fix which fixes the flow go to stable? > > You are welcome to review the next version of the patch, once I've > resolved the issues that were pointed out by Reinette, if you still > see some issue but this type of speculative discussion is frankly just > wasting everyones time. Hmm.. Why pointing out a better fix (my perspective of course) is wasting everyone's time?
On Wed, Aug 31, 2022 at 03:28:20AM +0000, Huang, Kai wrote: > On Wed, 2022-08-31 at 06:10 +0300, Jarkko Sakkinen wrote: > > On Wed, Aug 31, 2022 at 05:57:22AM +0300, jarkko@kernel.org wrote: > > > On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: > > > > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > > > > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > > > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > > > > > Hi Jarkko, > > > > > > > > > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > > > > > In sgx_init(), if misc_register() for the provision device fails, and > > > > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > > > > > > > > > > prematurely stopped. > > > > > > > > > > > > > > > > > > I do not think misc_register() is required to fail for the scenario to > > > > > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > > > > > "In sgx_init(), if a failure is encountered after ksgxd is started > > > > > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > > > > > > > > > IMHO "a failure" might be too vague. For instance, failure to sgx_drv_init() > > > > > > > > won't immediately result in ksgxd to stop prematurally. As long as KVM SGX can > > > > > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > > > > > > > > > Btw I was thinking whether we should move sgx_page_reclaimer_init() to the end > > > > > > > > of sgx_init(), after we make sure at least one of the driver and the KVM SGX is > > > > > > > > initialized successfully. Then the code change in this patch won't be necessary > > > > > > > > if I understand correctly. AFAICT there's no good reason to start the ksgxd at > > > > > > > > early stage before we are sure either the driver or KVM SGX will work. > > > > > > > > > > > > > > I would focus fixing the existing flow rather than reinventing the flow. > > > > > > > > > > > > > > It can be made to work, and therefore it is IMHO correct action to take. > > > > > > > > > > > > From another perspective, the *existing flow* is the reason which causes this > > > > > > bug. A real fix is to fix the flow itself. > > > > > > > > > > Any existing flow in part of the kernel can have a bug. That > > > > > does not mean that switching flow would be proper way to fix > > > > > a bug. > > > > > > > > > > BR, Jarkko > > > > > > > > Yes but I think this is only true when the flow is reasonable. If the flow > > > > itself isn't reasonable, we should fix the flow (given it's easy to fix AFAICT). > > > > > > > > Anyway, let us also hear from others. > > > > > > The flow can be made to work without issues, which in the > > > context of a bug fix is exactly what a bug fix should do. > > > Not more or less. > > > > > > You don't gain any measurable value for the user with this > > > switch idea. > > > > And besides this not proper way to review patch anyway because you did > > not review the code. > > > > I did review the code, but I couldn't agree on the fix. That's why I expressed > my view here. > > > > I'll focus on fix what is broken e.g. so that it > > is easy to backport to stable and distro kernels, and call it a day. > > It certainly does not have to make code "perfect", as long as known > > bugs are sorted out. > > Why cannot the fix which fixes the flow go to stable? > > > > > You are welcome to review the next version of the patch, once I've > > resolved the issues that were pointed out by Reinette, if you still > > see some issue but this type of speculative discussion is frankly just > > wasting everyones time. > > Hmm.. Why pointing out a better fix (my perspective of course) is wasting > everyone's time? There was not a single inline comment. BR, Jarkko
Hi Kai On Tue, 30 Aug 2022 22:17:08 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Wed, 2022-08-31 at 05:57 +0300, jarkko@kernel.org wrote: >> On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: >> > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: >> > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: >> > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: >> > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: >> > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: >> > > > > > > Hi Jarkko, >> > > > > > > >> > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: >> > > > > > > > In sgx_init(), if misc_register() for the provision >> device fails, and >> > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then >> ksgxd will be >> > > > > > > > prematurely stopped. >> > > > > > > >> > > > > > > I do not think misc_register() is required to fail for the >> scenario to >> > > > > > > be triggered (rather use "or" than "and"?). Perhaps just >> > > > > > > "In sgx_init(), if a failure is encountered after ksgxd is >> started >> > > > > > > (via sgx_page_reclaimer_init()) ...". >> > > > > > >> > > > > > IMHO "a failure" might be too vague. For instance, failure >> to sgx_drv_init() >> > > > > > won't immediately result in ksgxd to stop prematurally. As >> long as KVM SGX can >> > > > > > be initialized successfully, sgx_init() still returns 0. >> > > > > > >> > > > > > Btw I was thinking whether we should move >> sgx_page_reclaimer_init() to the end >> > > > > > of sgx_init(), after we make sure at least one of the driver >> and the KVM SGX is >> > > > > > initialized successfully. Then the code change in this patch >> won't be necessary >> > > > > > if I understand correctly. AFAICT there's no good reason to >> start the ksgxd at >> > > > > > early stage before we are sure either the driver or KVM SGX >> will work. >> > > > > >> > > > > I would focus fixing the existing flow rather than reinventing >> the flow. >> > > > > >> > > > > It can be made to work, and therefore it is IMHO correct action >> to take. >> > > > >> > > > From another perspective, the *existing flow* is the reason which >> causes this >> > > > bug. A real fix is to fix the flow itself. >> > > >> > > Any existing flow in part of the kernel can have a bug. That >> > > does not mean that switching flow would be proper way to fix >> > > a bug. >> > > >> > > BR, Jarkko >> > >> > Yes but I think this is only true when the flow is reasonable. If >> the flow >> > itself isn't reasonable, we should fix the flow (given it's easy to >> fix AFAICT). >> > >> > Anyway, let us also hear from others. >> >> The flow can be made to work without issues, which in the >> context of a bug fix is exactly what a bug fix should do. >> Not more or less. > > No. To me the flow itself is buggy. There's no reason to start ksgxd() > before > at least SGX driver is initialized to work. > Will it cause racing if we expose dev nodes to user space before ksgxd is started and sensitization done? > Patching the buggy flow is more like a workaround, but isn't a real fix. > >> >> You don't gain any measurable value for the user with this >> switch idea. > > There is actual gain by moving sgx_page_reclaimer_init() to > sgx_drv_init(), or > only calling sgx_page_reclaimer_init() when sgx_drv_init() returns > success: > > If somehow sgx_drv_init() fails to initialize, ksgxd() won't run. > > Currently, if SGX driver fails to initialize but virtual EPC initializes > successfully, ksgxd() still runs. However it achieves nothing but only > wastes > CPU cycles. > > You still need ksgxd for sanitizing (at least) and swapping (potentially) even if only virtual EPC initializes. Thanks Haitao
Hi Jarkko, On 8/30/2022 6:55 PM, Jarkko Sakkinen wrote: > On Tue, Aug 30, 2022 at 03:54:27PM -0700, Reinette Chatre wrote: >> Hi Jarkko, >> >> On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: >>> In sgx_init(), if misc_register() for the provision device fails, and >>> neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be >>> prematurely stopped. >> >> I do not think misc_register() is required to fail for the scenario to >> be triggered (rather use "or" than "and"?). Perhaps just >> "In sgx_init(), if a failure is encountered after ksgxd is started >> (via sgx_page_reclaimer_init()) ...". > > This would be the fixed version of the sentence: > > " > In sgx_init(), if misc_register() fails or misc_register() succeeds but > neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be > prematurely stopped. This may leave some unsanitized pages, which does > not matter, because SGX will be disabled for the whole power cycle. > " > > I want to keep the end states listed and not make it more abstract. > > The second sentence addresses the remark below. Thank you for capturing this. It sounds good. > >> To help the reader understand the subject of this patch it may help >> to explain that prematurely stopping ksgxd may leave some >> unsanitized pages, but that is not a problem since SGX cannot >> be used on the platform anyway. >> >>> This triggers WARN_ON() because sgx_dirty_page_list ends up being >>> non-empty, and dumps the call stack: >>> >> >> Traces like below can be frowned upon. I recommend that you follow the >> guidance in "Backtraces in commit mesages"(sic) in >> Documentation/process/submitting-patches.rst. >> >>> [ 0.268592] WARNING: CPU: 6 PID: 83 at >>> arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 > > Is this good enough? I had not actually spotted this section before but > nice that it exists. Apparently has been added in 5.12. The timestamp should be removed, it is among the things listed as "distracting information". In this case the backtrace and registers within the trace are not adding value but I think it is important to mention the kernel version somewhere for folks to be able to interpret the line number provided. Yet I see you kept the whole trace in V2 ? >>> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c >>> index 515e2a5f25bb..903100fcfce3 100644 >>> --- a/arch/x86/kernel/cpu/sgx/main.c >>> +++ b/arch/x86/kernel/cpu/sgx/main.c >>> @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); >>> * Reset post-kexec EPC pages to the uninitialized state. The pages are removed >>> * from the input list, and made available for the page allocator. SECS pages >>> * prepending their children in the input list are left intact. >>> + * >>> + * Contents of the @dirty_page_list must be thread-local, i.e. >>> + * not shared by multiple threads. >> >> Did you intend to mention something about the needed locking here? It looks >> like some information is lost during the move to the function description. > > Nothing about the locking that concerns the parameter, as the > sentence defines clear constraints for the caller. ok > >> >>> */ >>> -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) >>> +static int __sgx_sanitize_pages(struct list_head *dirty_page_list) >>> { >>> struct sgx_epc_page *page; >>> + int left_dirty = 0; >> >> I do not know how many pages this code should be ready for but at least >> this could handle more by being an unsigned int considering that it is >> always positive ... maybe even unsigned long? > > I would go for 'long'. More information below. > >> >>> LIST_HEAD(dirty); >>> int ret; >>> >>> - /* dirty_page_list is thread-local, no need for a lock: */ >>> while (!list_empty(dirty_page_list)) { >>> if (kthread_should_stop()) >>> - return; >>> + break; >>> >>> page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); >>> >>> @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) >>> } else { >>> /* The page is not yet clean - move to the dirty list. */ >>> list_move_tail(&page->list, &dirty); >>> + left_dirty++; >>> } >>> >>> cond_resched(); >>> } >>> >>> list_splice(&dirty, dirty_page_list); >>> + return left_dirty; >>> } >>> >>> static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) >>> @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) >>> >>> static int ksgxd(void *p) >>> { >>> + int left_dirty; >>> + >>> set_freezable(); >>> >>> /* >>> @@ -395,10 +402,10 @@ static int ksgxd(void *p) >>> * required for SECS pages, whose child pages blocked EREMOVE. >>> */ >>> __sgx_sanitize_pages(&sgx_dirty_page_list); >>> - __sgx_sanitize_pages(&sgx_dirty_page_list); >>> >>> - /* sanity check: */ >>> - WARN_ON(!list_empty(&sgx_dirty_page_list)); >>> + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); >>> + if (left_dirty) >>> + pr_warn("%d unsanitized pages\n", left_dirty); >>> >>> while (!kthread_should_stop()) { >>> if (try_to_freeze()) >> >> >> Reinette > > We need to return -ECANCELED on premature stop, and number of > pages otherwise. > > In premature stop, nothing should be printed, as the number > is by practical means a random number. Otherwise, it is an > indicator of a bug in the driver, and therefore a non-zero > number should be printed pr_err(), if that happens after the > second call. > Good point. > Oh, sorry, I forgot one thing. The devices should be actually > deinitialized in the error case. We do not want to leave a > broken driver running. Is this not already done on the error path of sgx_init()? Reinette
On Wed, Aug 31, 2022 at 10:18:00AM -0500, Haitao Huang wrote: > Hi Kai > On Tue, 30 Aug 2022 22:17:08 -0500, Huang, Kai <kai.huang@intel.com> wrote: > > > On Wed, 2022-08-31 at 05:57 +0300, jarkko@kernel.org wrote: > > > On Wed, Aug 31, 2022 at 02:55:52AM +0000, Huang, Kai wrote: > > > > On Wed, 2022-08-31 at 05:44 +0300, jarkko@kernel.org wrote: > > > > > On Wed, Aug 31, 2022 at 02:35:53AM +0000, Huang, Kai wrote: > > > > > > On Wed, 2022-08-31 at 05:15 +0300, jarkko@kernel.org wrote: > > > > > > > On Wed, Aug 31, 2022 at 01:27:58AM +0000, Huang, Kai wrote: > > > > > > > > On Tue, 2022-08-30 at 15:54 -0700, Reinette Chatre wrote: > > > > > > > > > Hi Jarkko, > > > > > > > > > > > > > > > > > > On 8/29/2022 8:12 PM, Jarkko Sakkinen wrote: > > > > > > > > > > In sgx_init(), if misc_register() for the provision > > > device fails, and > > > > > > > > > > neither sgx_drv_init() nor sgx_vepc_init() succeeds, > > > then ksgxd will be > > > > > > > > > > prematurely stopped. > > > > > > > > > > > > > > > > > > I do not think misc_register() is required to fail for > > > the scenario to > > > > > > > > > be triggered (rather use "or" than "and"?). Perhaps just > > > > > > > > > "In sgx_init(), if a failure is encountered after ksgxd > > > is started > > > > > > > > > (via sgx_page_reclaimer_init()) ...". > > > > > > > > > > > > > > > > IMHO "a failure" might be too vague. For instance, > > > failure to sgx_drv_init() > > > > > > > > won't immediately result in ksgxd to stop prematurally. > > > As long as KVM SGX can > > > > > > > > be initialized successfully, sgx_init() still returns 0. > > > > > > > > > > > > > > > > Btw I was thinking whether we should move > > > sgx_page_reclaimer_init() to the end > > > > > > > > of sgx_init(), after we make sure at least one of the > > > driver and the KVM SGX is > > > > > > > > initialized successfully. Then the code change in this > > > patch won't be necessary > > > > > > > > if I understand correctly. AFAICT there's no good reason > > > to start the ksgxd at > > > > > > > > early stage before we are sure either the driver or KVM > > > SGX will work. > > > > > > > > > > > > > > I would focus fixing the existing flow rather than > > > reinventing the flow. > > > > > > > > > > > > > > It can be made to work, and therefore it is IMHO correct > > > action to take. > > > > > > > > > > > > From another perspective, the *existing flow* is the reason > > > which causes this > > > > > > bug. A real fix is to fix the flow itself. > > > > > > > > > > Any existing flow in part of the kernel can have a bug. That > > > > > does not mean that switching flow would be proper way to fix > > > > > a bug. > > > > > > > > > > BR, Jarkko > > > > > > > > Yes but I think this is only true when the flow is reasonable. If > > > the flow > > > > itself isn't reasonable, we should fix the flow (given it's easy > > > to fix AFAICT). > > > > > > > > Anyway, let us also hear from others. > > > > > > The flow can be made to work without issues, which in the > > > context of a bug fix is exactly what a bug fix should do. > > > Not more or less. > > > > No. To me the flow itself is buggy. There's no reason to start ksgxd() > > before > > at least SGX driver is initialized to work. > > > > Will it cause racing if we expose dev nodes to user space before > ksgxd is started and sensitization done? I'll to explain this. So the point is to fix the issue at hand, and fix it locally. Changing initialization order is simply out of context. It's not really an argument for or against changing it We are fixing sanitization here, and only that with zero side-effects to any other semantics. It's dictated by the development process [*] but more importantly it's also just plain common sense. [*] https://www.kernel.org/doc/html/v5.19/process/submitting-patches.html#separate-your-changes BR, Jarkko
Jarkko, Kai and Haitao, Can you three please start trimming your replies? You don't need to and should not quote the entirety of your messages every time you reply. On 8/31/22 11:28, jarkko@kernel.org wrote: >> Will it cause racing if we expose dev nodes to user space before >> ksgxd is started and sensitization done? > I'll to explain this. > > So the point is to fix the issue at hand, and fix it locally. > > Changing initialization order is simply out of context. It's > not really an argument for or against changing it > > We are fixing sanitization here, and only that with zero > side-effects to any other semantics. > > It's dictated by the development process [*] but more > importantly it's also just plain common sense. Kai, I think your suggestion is reasonable. You make a good point about not needing ksgxd for vepc. *But*, I think it's a bit too much for a bugfix that's headed to -stable. I'm concerned that it will have unintended side effects, *especially* when there's a working, tested alternative.
On Wed, Aug 31, 2022 at 11:35:10AM -0700, Dave Hansen wrote: > Jarkko, Kai and Haitao, > > Can you three please start trimming your replies? You don't need to and > should not quote the entirety of your messages every time you reply. > > On 8/31/22 11:28, jarkko@kernel.org wrote: > >> Will it cause racing if we expose dev nodes to user space before > >> ksgxd is started and sensitization done? > > I'll to explain this. > > > > So the point is to fix the issue at hand, and fix it locally. > > > > Changing initialization order is simply out of context. It's > > not really an argument for or against changing it > > > > We are fixing sanitization here, and only that with zero > > side-effects to any other semantics. > > > > It's dictated by the development process [*] but more > > importantly it's also just plain common sense. > > Kai, I think your suggestion is reasonable. You make a good point about > not needing ksgxd for vepc. > > *But*, I think it's a bit too much for a bugfix that's headed to > -stable. I'm concerned that it will have unintended side effects, > *especially* when there's a working, tested alternative. Yeah, I also actually *do* agree that the suggestions could be reasonable. BR, Jarkko
On Wed, Aug 31, 2022 at 11:35:10AM -0700, Dave Hansen wrote: > Jarkko, Kai and Haitao, > > Can you three please start trimming your replies? You don't need to and > should not quote the entirety of your messages every time you reply. Sure, sorry about that. BR, Jarkko
On Wed, 2022-08-31 at 11:35 -0700, Dave Hansen wrote: > Jarkko, Kai and Haitao, > > Can you three please start trimming your replies? You don't need to and > should not quote the entirety of your messages every time you reply. > > On 8/31/22 11:28, jarkko@kernel.org wrote: > > > Will it cause racing if we expose dev nodes to user space before > > > ksgxd is started and sensitization done? > > I'll to explain this. > > > > So the point is to fix the issue at hand, and fix it locally. > > > > Changing initialization order is simply out of context. It's > > not really an argument for or against changing it > > > > We are fixing sanitization here, and only that with zero > > side-effects to any other semantics. > > > > It's dictated by the development process [*] but more > > importantly it's also just plain common sense. > > Kai, I think your suggestion is reasonable. You make a good point about > not needing ksgxd for vepc. > > *But*, I think it's a bit too much for a bugfix that's headed to > -stable. I'm concerned that it will have unintended side effects, > *especially* when there's a working, tested alternative. Agreed. Thanks Dave/Jarkko.
On Wed, Aug 31, 2022 at 08:42:59PM +0000, Huang, Kai wrote: > On Wed, 2022-08-31 at 11:35 -0700, Dave Hansen wrote: > > Jarkko, Kai and Haitao, > > > > Can you three please start trimming your replies? You don't need to and > > should not quote the entirety of your messages every time you reply. > > > > On 8/31/22 11:28, jarkko@kernel.org wrote: > > > > Will it cause racing if we expose dev nodes to user space before > > > > ksgxd is started and sensitization done? > > > I'll to explain this. > > > > > > So the point is to fix the issue at hand, and fix it locally. > > > > > > Changing initialization order is simply out of context. It's > > > not really an argument for or against changing it > > > > > > We are fixing sanitization here, and only that with zero > > > side-effects to any other semantics. > > > > > > It's dictated by the development process [*] but more > > > importantly it's also just plain common sense. > > > > Kai, I think your suggestion is reasonable. You make a good point about > > not needing ksgxd for vepc. > > > > *But*, I think it's a bit too much for a bugfix that's headed to > > -stable. I'm concerned that it will have unintended side effects, > > *especially* when there's a working, tested alternative. > > Agreed. Thanks Dave/Jarkko. Please do a patch. It's a very reasonable suggestion when considered out of context of this bug. If you go really rigid with this, the compilation process should not compile in sanitization process in the case when only vepc is enabled. It's useless functionality in that case. BR, Jarkko
On Fri, 2022-09-02 at 01:27 +0300, jarkko@kernel.org wrote: > On Wed, Aug 31, 2022 at 08:42:59PM +0000, Huang, Kai wrote: > > On Wed, 2022-08-31 at 11:35 -0700, Dave Hansen wrote: > > > Jarkko, Kai and Haitao, > > > > > > Can you three please start trimming your replies? You don't need to and > > > should not quote the entirety of your messages every time you reply. > > > > > > On 8/31/22 11:28, jarkko@kernel.org wrote: > > > > > Will it cause racing if we expose dev nodes to user space before > > > > > ksgxd is started and sensitization done? > > > > I'll to explain this. > > > > > > > > So the point is to fix the issue at hand, and fix it locally. > > > > > > > > Changing initialization order is simply out of context. It's > > > > not really an argument for or against changing it > > > > > > > > We are fixing sanitization here, and only that with zero > > > > side-effects to any other semantics. > > > > > > > > It's dictated by the development process [*] but more > > > > importantly it's also just plain common sense. > > > > > > Kai, I think your suggestion is reasonable. You make a good point about > > > not needing ksgxd for vepc. > > > > > > *But*, I think it's a bit too much for a bugfix that's headed to > > > -stable. I'm concerned that it will have unintended side effects, > > > *especially* when there's a working, tested alternative. > > > > Agreed. Thanks Dave/Jarkko. > > Please do a patch. It's a very reasonable suggestion when > considered out of context of this bug. > > If you go really rigid with this, the compilation process > should not compile in sanitization process in the case when > only vepc is enabled. It's useless functionality in that > case. > > BR, Jarkko Yeah I am planning to work out one to see how it goes.
On Thu, Sep 01, 2022 at 10:41:54PM +0000, Huang, Kai wrote: > On Fri, 2022-09-02 at 01:27 +0300, jarkko@kernel.org wrote: > > On Wed, Aug 31, 2022 at 08:42:59PM +0000, Huang, Kai wrote: > > > On Wed, 2022-08-31 at 11:35 -0700, Dave Hansen wrote: > > > > Jarkko, Kai and Haitao, > > > > > > > > Can you three please start trimming your replies? You don't need to and > > > > should not quote the entirety of your messages every time you reply. > > > > > > > > On 8/31/22 11:28, jarkko@kernel.org wrote: > > > > > > Will it cause racing if we expose dev nodes to user space before > > > > > > ksgxd is started and sensitization done? > > > > > I'll to explain this. > > > > > > > > > > So the point is to fix the issue at hand, and fix it locally. > > > > > > > > > > Changing initialization order is simply out of context. It's > > > > > not really an argument for or against changing it > > > > > > > > > > We are fixing sanitization here, and only that with zero > > > > > side-effects to any other semantics. > > > > > > > > > > It's dictated by the development process [*] but more > > > > > importantly it's also just plain common sense. > > > > > > > > Kai, I think your suggestion is reasonable. You make a good point about > > > > not needing ksgxd for vepc. > > > > > > > > *But*, I think it's a bit too much for a bugfix that's headed to > > > > -stable. I'm concerned that it will have unintended side effects, > > > > *especially* when there's a working, tested alternative. > > > > > > Agreed. Thanks Dave/Jarkko. > > > > Please do a patch. It's a very reasonable suggestion when > > considered out of context of this bug. > > > > If you go really rigid with this, the compilation process > > should not compile in sanitization process in the case when > > only vepc is enabled. It's useless functionality in that > > case. > > > > BR, Jarkko > > Yeah I am planning to work out one to see how it goes. Looking forward to try it out :-) BR, Jarkko
> > > > Yeah I am planning to work out one to see how it goes. > > Looking forward to try it out :-) Just let you know I'll work on it perhaps late next week or the week after, since I have other higher priority tasks to do. :) Thanks, -Kai
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 515e2a5f25bb..903100fcfce3 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); * Reset post-kexec EPC pages to the uninitialized state. The pages are removed * from the input list, and made available for the page allocator. SECS pages * prepending their children in the input list are left intact. + * + * Contents of the @dirty_page_list must be thread-local, i.e. + * not shared by multiple threads. */ -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) +static int __sgx_sanitize_pages(struct list_head *dirty_page_list) { struct sgx_epc_page *page; + int left_dirty = 0; LIST_HEAD(dirty); int ret; - /* dirty_page_list is thread-local, no need for a lock: */ while (!list_empty(dirty_page_list)) { if (kthread_should_stop()) - return; + break; page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) } else { /* The page is not yet clean - move to the dirty list. */ list_move_tail(&page->list, &dirty); + left_dirty++; } cond_resched(); } list_splice(&dirty, dirty_page_list); + return left_dirty; } static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) @@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) { + int left_dirty; + set_freezable(); /* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list); - __sgx_sanitize_pages(&sgx_dirty_page_list); - /* sanity check: */ - WARN_ON(!list_empty(&sgx_dirty_page_list)); + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); + if (left_dirty) + pr_warn("%d unsanitized pages\n", left_dirty); while (!kthread_should_stop()) { if (try_to_freeze())