x86/svm: do not try to handle recalc NPT faults immediately

Message ID	1590712553-7298-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=vhhj=7L=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2027720897 IronPort-SDR: wNmvSahmlECqvHg8JedH8RH/SKV/ZtpBZD5yQOkwQklXAkkK2tdbw9cYZpQ1grSS6NuAzPm99s ZCXnSApvy4M+vi3fH0e4HHV7wAvDJfCYB/SD0TVVuFra5Qjvq47DmEU6rPR6+SNf/a3RHI/E5D 8uN6QYuSjmdZ01rZs7t0svsO6e6zsZETKLHsDF6lij6av30wpHXGEQqLHoBWOyls7C9BPL2UkN Ei5WSmaOOuIal6G+xvSEBedIlOpzj98LxXyfPtOTUGZ9Abo3SS8eRkcRU2ajQgGkbzEXWrC7AV riY= From: Igor Druzhinin <igor.druzhinin@citrix.com> To: <xen-devel@lists.xenproject.org> Subject: [PATCH] x86/svm: do not try to handle recalc NPT faults immediately Date: Fri, 29 May 2020 01:35:53 +0100 Message-ID: <1590712553-7298-1-git-send-email-igor.druzhinin@citrix.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: list Cc: Igor Druzhinin <igor.druzhinin@citrix.com>, wl@xen.org, andrew.cooper3@citrix.com, george.dunlap@citrix.com, jbeulich@suse.com, roger.pau@citrix.com Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	x86/svm: do not try to handle recalc NPT faults immediately \| expand x86/svm: do not try to handle recalc NPT faults immediately

Igor Druzhinin May 29, 2020, 12:35 a.m. UTC

A recalculation NPT fault doesn't always require additional handling
in hvm_hap_nested_page_fault(), moreover in general case if there is no
explicit handling done there - the fault is wrongly considered fatal.

Instead of trying to be opportunistic - use safer approach and handle
P2M recalculation in a separate NPT fault by attempting to retry after
making the necessary adjustments. This is aligned with Intel behavior
where there are separate VMEXITs for recalculation and EPT violations
(faults) and only faults are handled in hvm_hap_nested_page_fault().
Do it by also unifying do_recalc return code with Intel implementation
where returning 1 means P2M was actually changed.

This covers a specific case of migration with vGPU assigned on AMD:
global log-dirty is enabled and causes immediate recalculation NPT
fault in MMIO area upon access.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---
This is a safer alternative to:
https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
and more correct approach from my PoV.
---
 xen/arch/x86/hvm/svm/svm.c | 5 +++--
 xen/arch/x86/mm/p2m-pt.c   | 8 ++++++--
 2 files changed, 9 insertions(+), 4 deletions(-)

Roger Pau Monné May 29, 2020, 2:33 p.m. UTC | #1

On Fri, May 29, 2020 at 01:35:53AM +0100, Igor Druzhinin wrote:
> A recalculation NPT fault doesn't always require additional handling
> in hvm_hap_nested_page_fault(), moreover in general case if there is no
> explicit handling done there - the fault is wrongly considered fatal.
> 
> Instead of trying to be opportunistic - use safer approach and handle
> P2M recalculation in a separate NPT fault by attempting to retry after
> making the necessary adjustments. This is aligned with Intel behavior
> where there are separate VMEXITs for recalculation and EPT violations
> (faults) and only faults are handled in hvm_hap_nested_page_fault().
> Do it by also unifying do_recalc return code with Intel implementation
> where returning 1 means P2M was actually changed.

That seems like a good approach IMO.

Do you know whether this will make the code slower? (since there are
cases previously handled in a single vmexit that would take two
vmexits now)

> This covers a specific case of migration with vGPU assigned on AMD:
> global log-dirty is enabled and causes immediate recalculation NPT
> fault in MMIO area upon access.
> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> This is a safer alternative to:
> https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
> and more correct approach from my PoV.
> ---
>  xen/arch/x86/hvm/svm/svm.c | 5 +++--
>  xen/arch/x86/mm/p2m-pt.c   | 8 ++++++--
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 46a1aac..7f6f578 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
>              v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
>          rc = vmcb->exitinfo1 & PFEC_page_present
>               ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
> -        if ( rc >= 0 )
> +        if ( rc == 0 )
> +            /* If no recal adjustments were being made - handle this fault */
>              svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
> -        else
> +        else if ( rc < 0 )
>          {
>              printk(XENLOG_G_ERR
>                     "%pv: Error %d handling NPF (gpa=%08lx ec=%04lx)\n",
> diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
> index 5c05017..377565b 100644
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -340,7 +340,7 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>      unsigned long gfn_remainder = gfn;
>      unsigned int level = 4;
>      l1_pgentry_t *pent;
> -    int err = 0;
> +    int err = 0, rc = 0;
>  
>      table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m)));
>      while ( --level )
> @@ -402,6 +402,8 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>                  clear_recalc(l1, e);
>                  err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>                  ASSERT(!err);
> +
> +                rc = 1;
>              }
>          }
>          unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK));
> @@ -448,12 +450,14 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>              clear_recalc(l1, e);
>          err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>          ASSERT(!err);
> +
> +        rc = 1;
>      }
>  
>   out:
>      unmap_domain_page(table);
>  
> -    return err;
> +    return err ? err : rc;

Nit: you can use the elvis operator here: return err ?: rc;

Also I couldn't spot any caller that would have troubles with the
function now returning 1 in certain conditions, can you confirm the
callers have been audited?

Thanks, Roger.

Jan Beulich May 29, 2020, 2:34 p.m. UTC | #2

On 29.05.2020 02:35, Igor Druzhinin wrote:
> A recalculation NPT fault doesn't always require additional handling
> in hvm_hap_nested_page_fault(), moreover in general case if there is no
> explicit handling done there - the fault is wrongly considered fatal.
> 
> Instead of trying to be opportunistic - use safer approach and handle
> P2M recalculation in a separate NPT fault by attempting to retry after
> making the necessary adjustments. This is aligned with Intel behavior
> where there are separate VMEXITs for recalculation and EPT violations
> (faults) and only faults are handled in hvm_hap_nested_page_fault().
> Do it by also unifying do_recalc return code with Intel implementation
> where returning 1 means P2M was actually changed.
> 
> This covers a specific case of migration with vGPU assigned on AMD:
> global log-dirty is enabled and causes immediate recalculation NPT
> fault in MMIO area upon access.

To be honest, from this last paragraph I still can't really derive
what goes wrong exactly why, before this change.

> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---
> This is a safer alternative to:
> https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
> and more correct approach from my PoV.

Indeed - I was about to reply there, but then I thought I'd first
look at this patch, in case it was a replacement.

> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
>              v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
>          rc = vmcb->exitinfo1 & PFEC_page_present
>               ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
> -        if ( rc >= 0 )
> +        if ( rc == 0 )
> +            /* If no recal adjustments were being made - handle this fault */
>              svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
> -        else
> +        else if ( rc < 0 )

So from going through the code and judging by the comment in
finish_type_change() (which btw you will need to update, to avoid
it becoming stale) the >= here was there just in case, without
there actually being any case where a positive value would be
returned. It that's also the conclusion you've drawn, then I
think it would help mentioning this in the description.

It is also desirable to mention finish_type_change() not being
affected, as already dealing with the > 0 case.

> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -340,7 +340,7 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>      unsigned long gfn_remainder = gfn;
>      unsigned int level = 4;
>      l1_pgentry_t *pent;
> -    int err = 0;
> +    int err = 0, rc = 0;
>  
>      table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m)));
>      while ( --level )
> @@ -402,6 +402,8 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>                  clear_recalc(l1, e);
>                  err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>                  ASSERT(!err);
> +
> +                rc = 1;
>              }
>          }
>          unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK));
> @@ -448,12 +450,14 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>              clear_recalc(l1, e);
>          err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>          ASSERT(!err);
> +
> +        rc = 1;
>      }
>  
>   out:
>      unmap_domain_page(table);
>  
> -    return err;
> +    return err ? err : rc;

Typically we write this as "err ?: rc". I'd like to ask that "rc" also
be renamed, to something like "recalc_done", and then to become bool.

Jan

Igor Druzhinin May 29, 2020, 3:06 p.m. UTC | #3

On 29/05/2020 15:33, Roger Pau Monné wrote:
> On Fri, May 29, 2020 at 01:35:53AM +0100, Igor Druzhinin wrote:
>> A recalculation NPT fault doesn't always require additional handling
>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>> explicit handling done there - the fault is wrongly considered fatal.
>>
>> Instead of trying to be opportunistic - use safer approach and handle
>> P2M recalculation in a separate NPT fault by attempting to retry after
>> making the necessary adjustments. This is aligned with Intel behavior
>> where there are separate VMEXITs for recalculation and EPT violations
>> (faults) and only faults are handled in hvm_hap_nested_page_fault().
>> Do it by also unifying do_recalc return code with Intel implementation
>> where returning 1 means P2M was actually changed.
> 
> That seems like a good approach IMO.
> 
> Do you know whether this will make the code slower? (since there are
> cases previously handled in a single vmexit that would take two
> vmexits now)

The only case I could think of is memory writes during migration -
first fault would propagate P2M type recalculation while the second
actually log a dirty page.

The slowdown would be only during live phase obviously but should be
marginal and in line with what we currently have on Intel.

>> This covers a specific case of migration with vGPU assigned on AMD:
>> global log-dirty is enabled and causes immediate recalculation NPT
>> fault in MMIO area upon access.
>>
>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
>> ---
>> This is a safer alternative to:
>> https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
>> and more correct approach from my PoV.
>> ---
>>  xen/arch/x86/hvm/svm/svm.c | 5 +++--
>>  xen/arch/x86/mm/p2m-pt.c   | 8 ++++++--
>>  2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>> index 46a1aac..7f6f578 100644
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
>>              v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
>>          rc = vmcb->exitinfo1 & PFEC_page_present
>>               ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
>> -        if ( rc >= 0 )
>> +        if ( rc == 0 )
>> +            /* If no recal adjustments were being made - handle this fault */
>>              svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
>> -        else
>> +        else if ( rc < 0 )
>>          {
>>              printk(XENLOG_G_ERR
>>                     "%pv: Error %d handling NPF (gpa=%08lx ec=%04lx)\n",
>> diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
>> index 5c05017..377565b 100644
>> --- a/xen/arch/x86/mm/p2m-pt.c
>> +++ b/xen/arch/x86/mm/p2m-pt.c
>> @@ -340,7 +340,7 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>      unsigned long gfn_remainder = gfn;
>>      unsigned int level = 4;
>>      l1_pgentry_t *pent;
>> -    int err = 0;
>> +    int err = 0, rc = 0;
>>  
>>      table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m)));
>>      while ( --level )
>> @@ -402,6 +402,8 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>                  clear_recalc(l1, e);
>>                  err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>>                  ASSERT(!err);
>> +
>> +                rc = 1;
>>              }
>>          }
>>          unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK));
>> @@ -448,12 +450,14 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>              clear_recalc(l1, e);
>>          err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>>          ASSERT(!err);
>> +
>> +        rc = 1;
>>      }
>>  
>>   out:
>>      unmap_domain_page(table);
>>  
>> -    return err;
>> +    return err ? err : rc;
> 
> Nit: you can use the elvis operator here: return err ?: rc;
> 
> Also I couldn't spot any caller that would have troubles with the
> function now returning 1 in certain conditions, can you confirm the
> callers have been audited?

Yes, I checked all the callers before making the change. That's actually
where I spotted Intel side is doing exactly the same already.

Igor

Igor Druzhinin May 29, 2020, 3:17 p.m. UTC | #4

On 29/05/2020 15:34, Jan Beulich wrote:
> On 29.05.2020 02:35, Igor Druzhinin wrote:
>> A recalculation NPT fault doesn't always require additional handling
>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>> explicit handling done there - the fault is wrongly considered fatal.
>>
>> Instead of trying to be opportunistic - use safer approach and handle
>> P2M recalculation in a separate NPT fault by attempting to retry after
>> making the necessary adjustments. This is aligned with Intel behavior
>> where there are separate VMEXITs for recalculation and EPT violations
>> (faults) and only faults are handled in hvm_hap_nested_page_fault().
>> Do it by also unifying do_recalc return code with Intel implementation
>> where returning 1 means P2M was actually changed.
>>
>> This covers a specific case of migration with vGPU assigned on AMD:
>> global log-dirty is enabled and causes immediate recalculation NPT
>> fault in MMIO area upon access.
> 
> To be honest, from this last paragraph I still can't really derive
> what goes wrong exactly why, before this change.

I admit it might require some knowledge of how vGPU is implemented. I will try
to give more info in this paragraph.

>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>> ---
>> This is a safer alternative to:
>> https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
>> and more correct approach from my PoV.
> 
> Indeed - I was about to reply there, but then I thought I'd first
> look at this patch, in case it was a replacement.
> 
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
>>              v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
>>          rc = vmcb->exitinfo1 & PFEC_page_present
>>               ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
>> -        if ( rc >= 0 )
>> +        if ( rc == 0 )
>> +            /* If no recal adjustments were being made - handle this fault */
>>              svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
>> -        else
>> +        else if ( rc < 0 )
> 
> So from going through the code and judging by the comment in
> finish_type_change() (which btw you will need to update, to avoid
> it becoming stale) the >= here was there just in case, without
> there actually being any case where a positive value would be
> returned. It that's also the conclusion you've drawn, then I
> think it would help mentioning this in the description.

I re-read the comments in finish_type_change() and to me they look
pretty much in line with the now common interface between EPT and NPT
recalc calls. 

Ok, I will point out that I concluded there was no additional intent
of necessarily calling svm_do_nested_pgfault() after recalc.

> It is also desirable to mention finish_type_change() not being
> affected, as already dealing with the > 0 case.

Will mention that as well.

>> --- a/xen/arch/x86/mm/p2m-pt.c
>> +++ b/xen/arch/x86/mm/p2m-pt.c
>> @@ -340,7 +340,7 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>      unsigned long gfn_remainder = gfn;
>>      unsigned int level = 4;
>>      l1_pgentry_t *pent;
>> -    int err = 0;
>> +    int err = 0, rc = 0;
>>  
>>      table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m)));
>>      while ( --level )
>> @@ -402,6 +402,8 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>                  clear_recalc(l1, e);
>>                  err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>>                  ASSERT(!err);
>> +
>> +                rc = 1;
>>              }
>>          }
>>          unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK));
>> @@ -448,12 +450,14 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>              clear_recalc(l1, e);
>>          err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>>          ASSERT(!err);
>> +
>> +        rc = 1;
>>      }
>>  
>>   out:
>>      unmap_domain_page(table);
>>  
>> -    return err;
>> +    return err ? err : rc;
> 
> Typically we write this as "err ?: rc". I'd like to ask that "rc" also
> be renamed, to something like "recalc_done", and then to become bool.

Sure.

Igor

Igor Druzhinin May 29, 2020, 3:24 p.m. UTC | #5

On 29/05/2020 16:17, Igor Druzhinin wrote:
> On 29/05/2020 15:34, Jan Beulich wrote:
>> On 29.05.2020 02:35, Igor Druzhinin wrote:
>>> A recalculation NPT fault doesn't always require additional handling
>>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>>> explicit handling done there - the fault is wrongly considered fatal.
>>>
>>> Instead of trying to be opportunistic - use safer approach and handle
>>> P2M recalculation in a separate NPT fault by attempting to retry after
>>> making the necessary adjustments. This is aligned with Intel behavior
>>> where there are separate VMEXITs for recalculation and EPT violations
>>> (faults) and only faults are handled in hvm_hap_nested_page_fault().
>>> Do it by also unifying do_recalc return code with Intel implementation
>>> where returning 1 means P2M was actually changed.
>>>
>>> This covers a specific case of migration with vGPU assigned on AMD:
>>> global log-dirty is enabled and causes immediate recalculation NPT
>>> fault in MMIO area upon access.
>>
>> To be honest, from this last paragraph I still can't really derive
>> what goes wrong exactly why, before this change.
> 
> I admit it might require some knowledge of how vGPU is implemented. I will try
> to give more info in this paragraph.
> 
>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>> ---
>>> This is a safer alternative to:
>>> https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
>>> and more correct approach from my PoV.
>>
>> Indeed - I was about to reply there, but then I thought I'd first
>> look at this patch, in case it was a replacement.
>>
>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>> @@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
>>>              v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
>>>          rc = vmcb->exitinfo1 & PFEC_page_present
>>>               ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
>>> -        if ( rc >= 0 )
>>> +        if ( rc == 0 )
>>> +            /* If no recal adjustments were being made - handle this fault */
>>>              svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
>>> -        else
>>> +        else if ( rc < 0 )
>>
>> So from going through the code and judging by the comment in
>> finish_type_change() (which btw you will need to update, to avoid
>> it becoming stale) the >= here was there just in case, without
>> there actually being any case where a positive value would be
>> returned. It that's also the conclusion you've drawn, then I
>> think it would help mentioning this in the description.
> 
> I re-read the comments in finish_type_change() and to me they look
> pretty much in line with the now common interface between EPT and NPT
> recalc calls. 

Sorry, upon close examination there is indeed a new case missed. Thanks
for pointing out.

Igor

x86/svm: do not try to handle recalc NPT faults immediately

Commit Message

Comments

Patch