Message ID | 1590097438-28829-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/svm: retry after unhandled NPT fault if gfn was marked for recalculation | expand |
On 21/05/2020 22:43, Igor Druzhinin wrote: > If a recalculation NPT fault hasn't been handled explicitly in > hvm_hap_nested_page_fault() then it's potentially safe to retry - > US bit has been re-instated in PTE and any real fault would be correctly > re-raised next time. > > This covers a specific case of migration with vGPU assigned on AMD: > global log-dirty is enabled and causes immediate recalculation NPT > fault in MMIO area upon access. This type of fault isn't described > explicitly in hvm_hap_nested_page_fault (this isn't called on > EPT misconfig exit on Intel) which results in domain crash. > > Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > --- Alternatively, I can re-raise the fault immediately after recalculation is done which is less efficient (will take one more VMEXIT) but safer IMO - hvm_hap_nested_page_fault might potentially leave VM in inconsistent state in case of a real failure and cause second page fault to conceal it. Another alternative is to inject fall_through bool into hvm_hap_nested_page_fault to give it the idea of expected behavior in that case and avoid guessing in SVM code. I think that's an improvement over suggestion in v1 and a candidate for v2. Igor
On 21/05/2020 22:43, Igor Druzhinin wrote: > If a recalculation NPT fault hasn't been handled explicitly in > hvm_hap_nested_page_fault() then it's potentially safe to retry - > US bit has been re-instated in PTE and any real fault would be correctly > re-raised next time. > > This covers a specific case of migration with vGPU assigned on AMD: > global log-dirty is enabled and causes immediate recalculation NPT > fault in MMIO area upon access. This type of fault isn't described > explicitly in hvm_hap_nested_page_fault (this isn't called on > EPT misconfig exit on Intel) which results in domain crash. > > Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > --- > xen/arch/x86/hvm/svm/svm.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c > index 46a1aac..f0d0bd3 100644 > --- a/xen/arch/x86/hvm/svm/svm.c > +++ b/xen/arch/x86/hvm/svm/svm.c > @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, > /* inject #VMEXIT(NPF) into guest. */ > nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); > return; > + case 0: > + /* If a recalculation page fault hasn't been handled - just retry. */ > + if ( pfec & PFEC_user_mode ) > + return; This smells like it is a recipe for livelocks. Everything should have been handled properly by the call to p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault(). It is legitimate for the MMIO mapping to end up being transiently recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't fix it up suggests that the bug is there. Do you have the complete NPT walk to the bad mapping? Do we have _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault? ~Andrew
On 22/05/2020 10:45, Andrew Cooper wrote: > On 21/05/2020 22:43, Igor Druzhinin wrote: >> If a recalculation NPT fault hasn't been handled explicitly in >> hvm_hap_nested_page_fault() then it's potentially safe to retry - >> US bit has been re-instated in PTE and any real fault would be correctly >> re-raised next time. >> >> This covers a specific case of migration with vGPU assigned on AMD: >> global log-dirty is enabled and causes immediate recalculation NPT >> fault in MMIO area upon access. This type of fault isn't described >> explicitly in hvm_hap_nested_page_fault (this isn't called on >> EPT misconfig exit on Intel) which results in domain crash. >> >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >> --- >> xen/arch/x86/hvm/svm/svm.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c >> index 46a1aac..f0d0bd3 100644 >> --- a/xen/arch/x86/hvm/svm/svm.c >> +++ b/xen/arch/x86/hvm/svm/svm.c >> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, >> /* inject #VMEXIT(NPF) into guest. */ >> nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); >> return; >> + case 0: >> + /* If a recalculation page fault hasn't been handled - just retry. */ >> + if ( pfec & PFEC_user_mode ) >> + return; > > This smells like it is a recipe for livelocks. > > Everything should have been handled properly by the call to > p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault(). > > It is legitimate for the MMIO mapping to end up being transiently > recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't > fix it up suggests that the bug is there. > > Do you have the complete NPT walk to the bad mapping? Do we have > _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault? It does fix it up. The problem is that currently in SVM we enter svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes is finished finished. Yes, we don't have _PAGE_USER initially and, yes, it's fixed up correctly in p2m_pt_handle_deferred_changes but svm_do_nested_pgfault doesn't know about it. Please read my second email about alternatives that suggest to resolve the issue you're worrying about. Igor
On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote: > If a recalculation NPT fault hasn't been handled explicitly in > hvm_hap_nested_page_fault() then it's potentially safe to retry - > US bit has been re-instated in PTE and any real fault would be correctly > re-raised next time. > > This covers a specific case of migration with vGPU assigned on AMD: > global log-dirty is enabled and causes immediate recalculation NPT > fault in MMIO area upon access. This type of fault isn't described > explicitly in hvm_hap_nested_page_fault (this isn't called on > EPT misconfig exit on Intel) which results in domain crash. Couldn't direct MMIO regions be handled like other types of memory for the purposes of logdiry mode? I assume there's already a path here used for other memory types when logdirty is turned on, and hence would seem better to just make direct MMIO regions also use that path? > Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > --- > xen/arch/x86/hvm/svm/svm.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c > index 46a1aac..f0d0bd3 100644 > --- a/xen/arch/x86/hvm/svm/svm.c > +++ b/xen/arch/x86/hvm/svm/svm.c > @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, > /* inject #VMEXIT(NPF) into guest. */ > nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); > return; > + case 0: > + /* If a recalculation page fault hasn't been handled - just retry. */ > + if ( pfec & PFEC_user_mode ) > + return; I'm slightly worried that this diverges from the EPT implementation now, in the sense that returning 0 from hvm_hap_nested_page_fault will no longer trigger a guest crash. Thanks, Roger.
On 22/05/2020 11:08, Roger Pau Monné wrote: > On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote: >> If a recalculation NPT fault hasn't been handled explicitly in >> hvm_hap_nested_page_fault() then it's potentially safe to retry - >> US bit has been re-instated in PTE and any real fault would be correctly >> re-raised next time. >> >> This covers a specific case of migration with vGPU assigned on AMD: >> global log-dirty is enabled and causes immediate recalculation NPT >> fault in MMIO area upon access. This type of fault isn't described >> explicitly in hvm_hap_nested_page_fault (this isn't called on >> EPT misconfig exit on Intel) which results in domain crash. > > Couldn't direct MMIO regions be handled like other types of memory for > the purposes of logdiry mode? > > I assume there's already a path here used for other memory types when > logdirty is turned on, and hence would seem better to just make direct > MMIO regions also use that path? The proble of handling only MMIO case is that the issue still stays. It will be hit with some other memory type since it's not MMIO specific. The issue is that if global recalculation is called, the next hit to this type will cause a transient fault which will not be handled correctly after a due fixup by neither of our handlers. >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >> --- >> xen/arch/x86/hvm/svm/svm.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c >> index 46a1aac..f0d0bd3 100644 >> --- a/xen/arch/x86/hvm/svm/svm.c >> +++ b/xen/arch/x86/hvm/svm/svm.c >> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, >> /* inject #VMEXIT(NPF) into guest. */ >> nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); >> return; >> + case 0: >> + /* If a recalculation page fault hasn't been handled - just retry. */ >> + if ( pfec & PFEC_user_mode ) >> + return; > > I'm slightly worried that this diverges from the EPT implementation > now, in the sense that returning 0 from hvm_hap_nested_page_fault will > no longer trigger a guest crash. My second alternative from my follow up email addresses this. I also didn't like this aspect. Igor
On 22/05/2020 11:05, Igor Druzhinin wrote: > On 22/05/2020 10:45, Andrew Cooper wrote: >> On 21/05/2020 22:43, Igor Druzhinin wrote: >>> If a recalculation NPT fault hasn't been handled explicitly in >>> hvm_hap_nested_page_fault() then it's potentially safe to retry - >>> US bit has been re-instated in PTE and any real fault would be correctly >>> re-raised next time. >>> >>> This covers a specific case of migration with vGPU assigned on AMD: >>> global log-dirty is enabled and causes immediate recalculation NPT >>> fault in MMIO area upon access. This type of fault isn't described >>> explicitly in hvm_hap_nested_page_fault (this isn't called on >>> EPT misconfig exit on Intel) which results in domain crash. >>> >>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >>> --- >>> xen/arch/x86/hvm/svm/svm.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c >>> index 46a1aac..f0d0bd3 100644 >>> --- a/xen/arch/x86/hvm/svm/svm.c >>> +++ b/xen/arch/x86/hvm/svm/svm.c >>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, >>> /* inject #VMEXIT(NPF) into guest. */ >>> nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); >>> return; >>> + case 0: >>> + /* If a recalculation page fault hasn't been handled - just retry. */ >>> + if ( pfec & PFEC_user_mode ) >>> + return; >> This smells like it is a recipe for livelocks. >> >> Everything should have been handled properly by the call to >> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault(). >> >> It is legitimate for the MMIO mapping to end up being transiently >> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't >> fix it up suggests that the bug is there. >> >> Do you have the complete NPT walk to the bad mapping? Do we have >> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault? > It does fix it up. The problem is that currently in SVM we enter > svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes > is finished finished. Oh - so we do. I'd read the entry condition for svm_do_nested_pgfault() incorrectly. Jan - why did you chose to do it this way? If p2m_pt_handle_deferred_changes() has made a modification, there is surely nothing relevant to do in svm_do_nested_pgfault(). ~Andrew
On Fri, May 22, 2020 at 11:14:24AM +0100, Igor Druzhinin wrote: > On 22/05/2020 11:08, Roger Pau Monné wrote: > > On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote: > >> If a recalculation NPT fault hasn't been handled explicitly in > >> hvm_hap_nested_page_fault() then it's potentially safe to retry - > >> US bit has been re-instated in PTE and any real fault would be correctly > >> re-raised next time. > >> > >> This covers a specific case of migration with vGPU assigned on AMD: > >> global log-dirty is enabled and causes immediate recalculation NPT > >> fault in MMIO area upon access. This type of fault isn't described > >> explicitly in hvm_hap_nested_page_fault (this isn't called on > >> EPT misconfig exit on Intel) which results in domain crash. > > > > Couldn't direct MMIO regions be handled like other types of memory for > > the purposes of logdiry mode? > > > > I assume there's already a path here used for other memory types when > > logdirty is turned on, and hence would seem better to just make direct > > MMIO regions also use that path? > > The proble of handling only MMIO case is that the issue still stays. > It will be hit with some other memory type since it's not MMIO specific. > The issue is that if global recalculation is called, the next hit to > this type will cause a transient fault which will not be handled > correctly after a due fixup by neither of our handlers. I admit I should go look at the code, but for example RAM p2m types don't require this fix, so I assume there's some different path taken in that case that avoids all this? Ie: when global logdirty is enabled you will start to get nested page faults for every access, yet only direct MMIO types require this fix? Thanks, Roger
On 22/05/2020 11:19, Andrew Cooper wrote: > On 22/05/2020 11:05, Igor Druzhinin wrote: >> On 22/05/2020 10:45, Andrew Cooper wrote: >>> On 21/05/2020 22:43, Igor Druzhinin wrote: >>>> If a recalculation NPT fault hasn't been handled explicitly in >>>> hvm_hap_nested_page_fault() then it's potentially safe to retry - >>>> US bit has been re-instated in PTE and any real fault would be correctly >>>> re-raised next time. >>>> >>>> This covers a specific case of migration with vGPU assigned on AMD: >>>> global log-dirty is enabled and causes immediate recalculation NPT >>>> fault in MMIO area upon access. This type of fault isn't described >>>> explicitly in hvm_hap_nested_page_fault (this isn't called on >>>> EPT misconfig exit on Intel) which results in domain crash. >>>> >>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >>>> --- >>>> xen/arch/x86/hvm/svm/svm.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c >>>> index 46a1aac..f0d0bd3 100644 >>>> --- a/xen/arch/x86/hvm/svm/svm.c >>>> +++ b/xen/arch/x86/hvm/svm/svm.c >>>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, >>>> /* inject #VMEXIT(NPF) into guest. */ >>>> nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); >>>> return; >>>> + case 0: >>>> + /* If a recalculation page fault hasn't been handled - just retry. */ >>>> + if ( pfec & PFEC_user_mode ) >>>> + return; >>> This smells like it is a recipe for livelocks. >>> >>> Everything should have been handled properly by the call to >>> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault(). >>> >>> It is legitimate for the MMIO mapping to end up being transiently >>> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't >>> fix it up suggests that the bug is there. >>> >>> Do you have the complete NPT walk to the bad mapping? Do we have >>> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault? >> It does fix it up. The problem is that currently in SVM we enter >> svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes >> is finished finished. > > Oh - so we do. I'd read the entry condition for svm_do_nested_pgfault() > incorrectly. > > Jan - why did you chose to do it this way? If > p2m_pt_handle_deferred_changes() has made a modification, there is > surely nothing relevant to do in svm_do_nested_pgfault(). In Jan's defense that saves one additional VMEXIT in rare cases if the fault had other implications (write to RO page in log-dirty) in addition to recalculation. Igor other
On 22/05/2020 11:23, Roger Pau Monné wrote: > On Fri, May 22, 2020 at 11:14:24AM +0100, Igor Druzhinin wrote: >> On 22/05/2020 11:08, Roger Pau Monné wrote: >>> On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote: >>>> If a recalculation NPT fault hasn't been handled explicitly in >>>> hvm_hap_nested_page_fault() then it's potentially safe to retry - >>>> US bit has been re-instated in PTE and any real fault would be correctly >>>> re-raised next time. >>>> >>>> This covers a specific case of migration with vGPU assigned on AMD: >>>> global log-dirty is enabled and causes immediate recalculation NPT >>>> fault in MMIO area upon access. This type of fault isn't described >>>> explicitly in hvm_hap_nested_page_fault (this isn't called on >>>> EPT misconfig exit on Intel) which results in domain crash. >>> >>> Couldn't direct MMIO regions be handled like other types of memory for >>> the purposes of logdiry mode? >>> >>> I assume there's already a path here used for other memory types when >>> logdirty is turned on, and hence would seem better to just make direct >>> MMIO regions also use that path? >> >> The proble of handling only MMIO case is that the issue still stays. >> It will be hit with some other memory type since it's not MMIO specific. >> The issue is that if global recalculation is called, the next hit to >> this type will cause a transient fault which will not be handled >> correctly after a due fixup by neither of our handlers. > > I admit I should go look at the code, but for example RAM p2m types > don't require this fix, so I assume there's some different path taken > in that case that avoids all this? > > Ie: when global logdirty is enabled you will start to get nested page > faults for every access, yet only direct MMIO types require this fix? It's not "only MMIO" - it's just MMIO area is hit in my particular case. I'd prefer this fix to address the general issue otherwise for SVM we would have to write handlers in hvm_hap_nested_page_fault() for every case as soon as we hit it. Igor
On Fri, May 22, 2020 at 11:27:38AM +0100, Igor Druzhinin wrote: > On 22/05/2020 11:23, Roger Pau Monné wrote: > > On Fri, May 22, 2020 at 11:14:24AM +0100, Igor Druzhinin wrote: > >> On 22/05/2020 11:08, Roger Pau Monné wrote: > >>> On Thu, May 21, 2020 at 10:43:58PM +0100, Igor Druzhinin wrote: > >>>> If a recalculation NPT fault hasn't been handled explicitly in > >>>> hvm_hap_nested_page_fault() then it's potentially safe to retry - > >>>> US bit has been re-instated in PTE and any real fault would be correctly > >>>> re-raised next time. > >>>> > >>>> This covers a specific case of migration with vGPU assigned on AMD: > >>>> global log-dirty is enabled and causes immediate recalculation NPT > >>>> fault in MMIO area upon access. This type of fault isn't described > >>>> explicitly in hvm_hap_nested_page_fault (this isn't called on > >>>> EPT misconfig exit on Intel) which results in domain crash. > >>> > >>> Couldn't direct MMIO regions be handled like other types of memory for > >>> the purposes of logdiry mode? > >>> > >>> I assume there's already a path here used for other memory types when > >>> logdirty is turned on, and hence would seem better to just make direct > >>> MMIO regions also use that path? > >> > >> The proble of handling only MMIO case is that the issue still stays. > >> It will be hit with some other memory type since it's not MMIO specific. > >> The issue is that if global recalculation is called, the next hit to > >> this type will cause a transient fault which will not be handled > >> correctly after a due fixup by neither of our handlers. > > > > I admit I should go look at the code, but for example RAM p2m types > > don't require this fix, so I assume there's some different path taken > > in that case that avoids all this? > > > > Ie: when global logdirty is enabled you will start to get nested page > > faults for every access, yet only direct MMIO types require this fix? > > It's not "only MMIO" - it's just MMIO area is hit in my particular case. > I'd prefer this fix to address the general issue otherwise for SVM > we would have to write handlers in hvm_hap_nested_page_fault() for > every case as soon as we hit it. Hm, I'm not sure I agree. p2m memory types are limited, and IMO we want to have strict control about how they are handled. hvm_hap_nested_page_fault is already full of special casing for each memory type for that reason. That being said, I also don't like the fact that logdity is handled differently between EPT and NPT, as on EPT it's handled as a misconfig while on NPT it's handled as a violation. Thanks, Roger.
On 22.05.2020 13:11, Roger Pau Monné wrote: > That being said, I also don't like the fact that logdity is handled > differently between EPT and NPT, as on EPT it's handled as a > misconfig while on NPT it's handled as a violation. Because, well, there is no concept of misconfig in NPT. Jan
On 22/05/2020 14:04, Jan Beulich wrote: > On 22.05.2020 13:11, Roger Pau Monné wrote: >> That being said, I also don't like the fact that logdity is handled >> differently between EPT and NPT, as on EPT it's handled as a >> misconfig while on NPT it's handled as a violation. > Because, well, there is no concept of misconfig in NPT. Indeed. Intel chose to split EPT errors into two - MISCONFIG for structural errors (not present, or reserved bits set) and VIOLATION for permissions errors. AMD reused the same silicon pagewalker design, so have a single NPT_FAULT vmexit which behaves much more like a regular pagefault, encoding structural vs permission errors in the error code. ~Andrew
On Fri, May 22, 2020 at 02:11:15PM +0100, Andrew Cooper wrote: > On 22/05/2020 14:04, Jan Beulich wrote: > > On 22.05.2020 13:11, Roger Pau Monné wrote: > >> That being said, I also don't like the fact that logdity is handled > >> differently between EPT and NPT, as on EPT it's handled as a > >> misconfig while on NPT it's handled as a violation. > > Because, well, there is no concept of misconfig in NPT. > > Indeed. Intel chose to split EPT errors into two - MISCONFIG for > structural errors (not present, or reserved bits set) and VIOLATION for > permissions errors. > > AMD reused the same silicon pagewalker design, so have a single > NPT_FAULT vmexit which behaves much more like a regular pagefault, > encoding structural vs permission errors in the error code. Maybe I should clarify, I understand that NPT doesn't have such differentiation regarding nested page table faults vs EPT, but I feel like it would be clearer if part of the code could be shared, ie: unify EPT resolve_misconfig and NPT do_recalc into a single function for example that uses the necessary p2m-> helpers for the differing implementations. I think we should be able to tell apart when a NPT page fault is a recalc one by looking at the bits in the EXITINFO1 error field? Anyway, this was just a rant, and it's tangential to the issue at hand, sorry for distracting. Roger.
On 22.05.2020 12:19, Andrew Cooper wrote: > On 22/05/2020 11:05, Igor Druzhinin wrote: >> On 22/05/2020 10:45, Andrew Cooper wrote: >>> On 21/05/2020 22:43, Igor Druzhinin wrote: >>>> If a recalculation NPT fault hasn't been handled explicitly in >>>> hvm_hap_nested_page_fault() then it's potentially safe to retry - >>>> US bit has been re-instated in PTE and any real fault would be correctly >>>> re-raised next time. >>>> >>>> This covers a specific case of migration with vGPU assigned on AMD: >>>> global log-dirty is enabled and causes immediate recalculation NPT >>>> fault in MMIO area upon access. This type of fault isn't described >>>> explicitly in hvm_hap_nested_page_fault (this isn't called on >>>> EPT misconfig exit on Intel) which results in domain crash. >>>> >>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >>>> --- >>>> xen/arch/x86/hvm/svm/svm.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c >>>> index 46a1aac..f0d0bd3 100644 >>>> --- a/xen/arch/x86/hvm/svm/svm.c >>>> +++ b/xen/arch/x86/hvm/svm/svm.c >>>> @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, >>>> /* inject #VMEXIT(NPF) into guest. */ >>>> nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); >>>> return; >>>> + case 0: >>>> + /* If a recalculation page fault hasn't been handled - just retry. */ >>>> + if ( pfec & PFEC_user_mode ) >>>> + return; >>> This smells like it is a recipe for livelocks. >>> >>> Everything should have been handled properly by the call to >>> p2m_pt_handle_deferred_changes() which precedes svm_do_nested_pgfault(). >>> >>> It is legitimate for the MMIO mapping to end up being transiently >>> recalculated, but the fact that p2m_pt_handle_deferred_changes() doesn't >>> fix it up suggests that the bug is there. >>> >>> Do you have the complete NPT walk to the bad mapping? Do we have >>> _PAGE_USER in the leaf mapping, or is this perhaps a spurious fault? >> It does fix it up. The problem is that currently in SVM we enter >> svm_do_nested_pgfault immediately after p2m_pt_handle_deferred_changes >> is finished finished. > > Oh - so we do. I'd read the entry condition for svm_do_nested_pgfault() > incorrectly. > > Jan - why did you chose to do it this way? If > p2m_pt_handle_deferred_changes() has made a modification, there is > surely nothing relevant to do in svm_do_nested_pgfault(). Why? There can very well be multiple reasons for a single exit to occur, and hence it did seem to make sense to allow processing of such possible further reasons right away, instead of re- entering the guest just to deal with another VM exit. What I can accept is that the error code may not correctly represent the "remaining" fault reason anymore at this point. Jan
On 22/05/2020 14:32, Roger Pau Monné wrote: > On Fri, May 22, 2020 at 02:11:15PM +0100, Andrew Cooper wrote: >> On 22/05/2020 14:04, Jan Beulich wrote: >>> On 22.05.2020 13:11, Roger Pau Monné wrote: >>>> That being said, I also don't like the fact that logdity is handled >>>> differently between EPT and NPT, as on EPT it's handled as a >>>> misconfig while on NPT it's handled as a violation. >>> Because, well, there is no concept of misconfig in NPT. >> Indeed. Intel chose to split EPT errors into two - MISCONFIG for >> structural errors (not present, or reserved bits set) and VIOLATION for >> permissions errors. >> >> AMD reused the same silicon pagewalker design, so have a single >> NPT_FAULT vmexit which behaves much more like a regular pagefault, >> encoding structural vs permission errors in the error code. > Maybe I should clarify, I understand that NPT doesn't have such > differentiation regarding nested page table faults vs EPT, but I feel > like it would be clearer if part of the code could be shared, ie: > unify EPT resolve_misconfig and NPT do_recalc into a single function > for example that uses the necessary p2m-> helpers for the differing > implementations. I think we should be able to tell apart when a NPT > page fault is a recalc one by looking at the bits in the EXITINFO1 > error field? But they use fundamentally different mechanisms. EPT uses an invalid caching type, while NPT sets the User bit (and even this is going to have to change when we want to support GMET for Windows VBS in the long term). You could abstract a few things out into common logic, but none of the bit positions match (not even the recalc software bit), and the result would be more complicated than its current form. ~Andrew
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c index 46a1aac..f0d0bd3 100644 --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -1726,6 +1726,10 @@ static void svm_do_nested_pgfault(struct vcpu *v, /* inject #VMEXIT(NPF) into guest. */ nestedsvm_vmexit_defer(v, VMEXIT_NPF, pfec, gpa); return; + case 0: + /* If a recalculation page fault hasn't been handled - just retry. */ + if ( pfec & PFEC_user_mode ) + return; } /* Everything else is an error. */
If a recalculation NPT fault hasn't been handled explicitly in hvm_hap_nested_page_fault() then it's potentially safe to retry - US bit has been re-instated in PTE and any real fault would be correctly re-raised next time. This covers a specific case of migration with vGPU assigned on AMD: global log-dirty is enabled and causes immediate recalculation NPT fault in MMIO area upon access. This type of fault isn't described explicitly in hvm_hap_nested_page_fault (this isn't called on EPT misconfig exit on Intel) which results in domain crash. Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> --- xen/arch/x86/hvm/svm/svm.c | 4 ++++ 1 file changed, 4 insertions(+)