Message ID | 20201211113230.28909-1-jarkko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/sgx: Synchronize encl->srcu in sgx_encl_release(). | expand |
On Fri, Dec 11, 2020, Jarkko Sakkinen wrote: > Each sgx_mmun_notifier_release() starts a grace period, which means that Should be sgx_mmu_notifier_release(), here and in the comment. > one extra synchronize_rcu() in sgx_encl_release(). Add it there. > > sgx_release() has the loop that drains the list but with bad luck the > entry is already gone from the list before that loop processes it. Why not include the actual analysis that "proves" the bug? The splat that Haitao reported would also be useful info. > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Reported-by: Sean Christopherson <seanjc@google.com> Haitao reported the bug, and for all intents and purposes provided the fix. I just did the analysis to verify that there was a legitimate bug and that the synchronization in sgx_encl_release() was indeed necessary. > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > --- > arch/x86/kernel/cpu/sgx/encl.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index ee50a5010277..48539a6ee315 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -438,6 +438,13 @@ void sgx_encl_release(struct kref *ref) > if (encl->backing) > fput(encl->backing); > > + /* > + * Each sgx_mmun_notifier_release() starts a grace period. Thus one > + * "extra" synchronize_rcu() is required here. This can go undetected by > + * sgx_release() when it drains the mm list. > + */ > + synchronize_srcu(&encl->srcu); > + > cleanup_srcu_struct(&encl->srcu); > > WARN_ON_ONCE(!list_empty(&encl->mm_list)); > -- > 2.27.0 >
On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote: > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote: > > Each sgx_mmun_notifier_release() starts a grace period, which means that > > Should be sgx_mmu_notifier_release(), here and in the comment. Thanks. > > one extra synchronize_rcu() in sgx_encl_release(). Add it there. > > > > sgx_release() has the loop that drains the list but with bad luck the > > entry is already gone from the list before that loop processes it. > > Why not include the actual analysis that "proves" the bug? The splat that > Haitao reported would also be useful info. True. I can include a snippet of dmesg to the commit message. > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > > Cc: Borislav Petkov <bp@alien8.de> > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > Reported-by: Sean Christopherson <seanjc@google.com> > > Haitao reported the bug, and for all intents and purposes provided the fix. I > just did the analysis to verify that there was a legitimate bug and that the > synchronization in sgx_encl_release() was indeed necessary. Good and valid point. The way I see it, the tags should be: Reported-by: Haitao Huang <haitao.huang@linux.intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Haitao pointed out the bug but from your analysis I could resolve that this is the fix to implement, and was able to write the long description for the commit. Does this make sense to you? /Jarkko
On Tue, Dec 15, 2020 at 07:56:01AM +0200, Jarkko Sakkinen wrote: > On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote: > > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote: > > > Each sgx_mmun_notifier_release() starts a grace period, which means that > > > > Should be sgx_mmu_notifier_release(), here and in the comment. > > Thanks. > > > > one extra synchronize_rcu() in sgx_encl_release(). Add it there. > > > > > > sgx_release() has the loop that drains the list but with bad luck the > > > entry is already gone from the list before that loop processes it. > > > > Why not include the actual analysis that "proves" the bug? The splat that > > Haitao reported would also be useful info. > > True. I can include a snippet of dmesg to the commit message. > > > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > > > Cc: Borislav Petkov <bp@alien8.de> > > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > > Reported-by: Sean Christopherson <seanjc@google.com> > > > > Haitao reported the bug, and for all intents and purposes provided the fix. I > > just did the analysis to verify that there was a legitimate bug and that the > > synchronization in sgx_encl_release() was indeed necessary. > > Good and valid point. The way I see it, the tags should be: > > Reported-by: Haitao Huang <haitao.huang@linux.intel.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > > Haitao pointed out the bug but from your analysis I could resolve that > this is the fix to implement, and was able to write the long > description for the commit. > > Does this make sense to you? I'm sending v2 next week (this week on vacation). /Jarkko
On Mon, 14 Dec 2020 23:59:55 -0600, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Tue, Dec 15, 2020 at 07:56:01AM +0200, Jarkko Sakkinen wrote: >> On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote: >> > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote: >> > > Each sgx_mmun_notifier_release() starts a grace period, which means >> that >> > >> > Should be sgx_mmu_notifier_release(), here and in the comment. >> >> Thanks. >> >> > > one extra synchronize_rcu() in sgx_encl_release(). Add it there. >> > > >> > > sgx_release() has the loop that drains the list but with bad luck >> the >> > > entry is already gone from the list before that loop processes it. >> > >> > Why not include the actual analysis that "proves" the bug? The splat >> that >> > Haitao reported would also be useful info. >> >> True. I can include a snippet of dmesg to the commit message. >> >> > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") >> > > Cc: Borislav Petkov <bp@alien8.de> >> > > Cc: Dave Hansen <dave.hansen@linux.intel.com> >> > > Reported-by: Sean Christopherson <seanjc@google.com> >> > >> > Haitao reported the bug, and for all intents and purposes provided >> the fix. I >> > just did the analysis to verify that there was a legitimate bug and >> that the >> > synchronization in sgx_encl_release() was indeed necessary. >> >> Good and valid point. The way I see it, the tags should be: >> >> Reported-by: Haitao Huang <haitao.huang@linux.intel.com> >> Suggested-by: Sean Christopherson <seanjc@google.com> >> >> Haitao pointed out the bug but from your analysis I could resolve that >> this is the fix to implement, and was able to write the long >> description for the commit. >> >> Does this make sense to you? > > I'm sending v2 next week (this week on vacation). > > /Jarkko I don't mind either how tags are assigned. But our testing reveals significant latency introduced in scenarios of heavy loading/unloading enclaves. synchronize_srcu_expedited fixed the issue. Please analyze and confirm if that's more appropriate than synchronize_srcu here.
On Tue, Dec 15, 2020 at 11:34:37AM -0600, Haitao Huang wrote: > On Mon, 14 Dec 2020 23:59:55 -0600, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > > > On Tue, Dec 15, 2020 at 07:56:01AM +0200, Jarkko Sakkinen wrote: > > > On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote: > > > > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote: > > > > > Each sgx_mmun_notifier_release() starts a grace period, which > > > means that > > > > > > > > Should be sgx_mmu_notifier_release(), here and in the comment. > > > > > > Thanks. > > > > > > > > one extra synchronize_rcu() in sgx_encl_release(). Add it there. > > > > > > > > > > sgx_release() has the loop that drains the list but with bad > > > luck the > > > > > entry is already gone from the list before that loop processes it. > > > > > > > > Why not include the actual analysis that "proves" the bug? The > > > splat that > > > > Haitao reported would also be useful info. > > > > > > True. I can include a snippet of dmesg to the commit message. > > > > > > > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > > > > > Cc: Borislav Petkov <bp@alien8.de> > > > > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > > > > Reported-by: Sean Christopherson <seanjc@google.com> > > > > > > > > Haitao reported the bug, and for all intents and purposes provided > > > the fix. I > > > > just did the analysis to verify that there was a legitimate bug > > > and that the > > > > synchronization in sgx_encl_release() was indeed necessary. > > > > > > Good and valid point. The way I see it, the tags should be: > > > > > > Reported-by: Haitao Huang <haitao.huang@linux.intel.com> > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > > > > Haitao pointed out the bug but from your analysis I could resolve that > > > this is the fix to implement, and was able to write the long > > > description for the commit. > > > > > > Does this make sense to you? > > > > I'm sending v2 next week (this week on vacation). > > > > /Jarkko > > I don't mind either how tags are assigned. But our testing reveals > significant latency introduced in scenarios of heavy loading/unloading > enclaves. synchronize_srcu_expedited fixed the issue. Please analyze and > confirm if that's more appropriate than synchronize_srcu here. I don't see any obvious reason why *_expedited could not be used here, as most of the time sync's are taken care of sgx_release() loop, and the final sync is with sgx_mmu_notifier_release(). More aggressive spinning should not do any harm here. About the tags. I just try to get them right, and it is sometimes not straight-forward. So I guess, with all things considered, I'll put suggested-by from you. Once I get a refined patch out, try it out with your workloads and provide me tested-by, if it is working for you. /Jarkko
On Tue, Dec 15, 2020, Jarkko Sakkinen wrote: > On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote: > > Haitao reported the bug, and for all intents and purposes provided the fix. I > > just did the analysis to verify that there was a legitimate bug and that the > > synchronization in sgx_encl_release() was indeed necessary. > > Good and valid point. The way I see it, the tags should be: > > Reported-by: Haitao Huang <haitao.huang@linux.intel.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > > Haitao pointed out the bug but from your analysis I could resolve that > this is the fix to implement, and was able to write the long > description for the commit. > > Does this make sense to you? Yep, works for me.
On Tue, Dec 15, 2020 at 02:04:10PM -0800, Sean Christopherson wrote: > On Tue, Dec 15, 2020, Jarkko Sakkinen wrote: > > On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote: > > > Haitao reported the bug, and for all intents and purposes provided the fix. I > > > just did the analysis to verify that there was a legitimate bug and that the > > > synchronization in sgx_encl_release() was indeed necessary. > > > > Good and valid point. The way I see it, the tags should be: > > > > Reported-by: Haitao Huang <haitao.huang@linux.intel.com> > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > > Haitao pointed out the bug but from your analysis I could resolve that > > this is the fix to implement, and was able to write the long > > description for the commit. > > > > Does this make sense to you? > > Yep, works for me. I'll just add two suggested-by's. Process guide does not forbid that and it best describes matters. /Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index ee50a5010277..48539a6ee315 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -438,6 +438,13 @@ void sgx_encl_release(struct kref *ref) if (encl->backing) fput(encl->backing); + /* + * Each sgx_mmun_notifier_release() starts a grace period. Thus one + * "extra" synchronize_rcu() is required here. This can go undetected by + * sgx_release() when it drains the mm list. + */ + synchronize_srcu(&encl->srcu); + cleanup_srcu_struct(&encl->srcu); WARN_ON_ONCE(!list_empty(&encl->mm_list));
Each sgx_mmun_notifier_release() starts a grace period, which means that one extra synchronize_rcu() in sgx_encl_release(). Add it there. sgx_release() has the loop that drains the list but with bad luck the entry is already gone from the list before that loop processes it. Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> --- arch/x86/kernel/cpu/sgx/encl.c | 7 +++++++ 1 file changed, 7 insertions(+)