diff mbox series

[RFC] x86/p2m-pt: do type recalculations with p2m read lock

Message ID 20230403101449.93323-1-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show
Series [RFC] x86/p2m-pt: do type recalculations with p2m read lock | expand

Commit Message

Roger Pau Monné April 3, 2023, 10:14 a.m. UTC
Global p2m type recalculations (as triggered by logdirty) can create
so much contention on the p2m lock that simple guest operations like
VCPUOP_set_singleshot_timer on guests with a high amount of vCPUs (32)
will cease to work in a timely manner, up to the point that Linux
kernel versions that sill use the VCPU_SSHOTTMR_future flag with the
singleshot timer will cease to work:

[   82.779470] CE: xen increased min_delta_ns to 1000000 nsec
[   82.793075] CE: Reprogramming failure. Giving up
[   82.779470] CE: Reprogramming failure. Giving up
[   82.821864] CE: xen increased min_delta_ns to 506250 nsec
[   82.821864] CE: xen increased min_delta_ns to 759375 nsec
[   82.821864] CE: xen increased min_delta_ns to 1000000 nsec
[   82.821864] CE: Reprogramming failure. Giving up
[   82.856256] CE: Reprogramming failure. Giving up
[   84.566279] CE: Reprogramming failure. Giving up
[   84.649493] Freezing user space processes ...
[  130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130)
[  130.604032] Task dump for CPU 14:
[  130.604032] swapper/14      R  running task        0     0      1 0x00000000
[  130.604032] Call Trace:
[  130.604032]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
[  130.604032]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
[  130.604032]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
[  130.604032]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
[  130.604032]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
[  130.604032]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
[  549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013)
[  549.655463] Task dump for CPU 26:
[  549.655463] swapper/26      R  running task        0     0      1 0x00000000
[  549.655463] Call Trace:
[  549.655463]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
[  549.655463]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
[  549.655463]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
[  549.655463]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
[  549.655463]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
[  549.655463]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
[  821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664)
[  821.888596] Task dump for CPU 26:
[  821.888622] swapper/26      R  running task        0     0      1 0x00000000
[  821.888677] Call Trace:
[  821.888712]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
[  821.888771]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
[  821.888818]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
[  821.888865]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
[  821.888917]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
[  821.888966]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14

This is obviously undesirable.  One way to bodge the issue would be to
ignore VCPU_SSHOTTMR_future, but that's a deliberate breakage of the
hypercall ABI.

Instead lower the contention in the lock by doing the recalculation
with the lock in read mode.  This is safe because only the flags/type
are changed, there's no PTE mfn change in the AMD recalculation logic.
The Intel (EPT) case is likely more complicated, as superpage
splitting for diverging EMT values must be done with the p2m lock in
taken in write mode.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I'm unsure whether such modification is fully safe:  I think changing
the flags/type should be fine: the PTE write is performed using
safwrite_p2m_entry() which must be atomic (as the guest is still
running and accessing the page tables).  I'm slightly worried about
all PTE readers not using atomic accesses to do so (ie: pointer
returned by p2m_find_entry() should be read atomicallly), and code
assuming that a gfn type cannot change while holding the p2m lock in
read mode.

Wanted to post early in case someone knows any showstoppers with this
approach that make it a no-go, before I try to further evaluate
users.
---
 xen/arch/x86/mm/p2m-pt.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comments

Jan Beulich April 3, 2023, 12:39 p.m. UTC | #1
On 03.04.2023 12:14, Roger Pau Monne wrote:
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -486,9 +486,6 @@ static int cf_check do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>          p2m_type_t ot, nt;
>          unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
>  
> -        if ( !valid_recalc(l1, e) )
> -            P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
> -                      p2m->domain->domain_id, gfn, level);
>          ot = p2m_flags_to_type(l1e_get_flags(e));
>          nt = p2m_recalc_type_range(true, ot, p2m, gfn & mask, gfn | ~mask);
>          if ( nt != ot )

I'm afraid I neither understand why you make this change, nor why you
then leave the other use of valid_recalc() in place.

> @@ -538,9 +535,9 @@ int p2m_pt_handle_deferred_changes(uint64_t gpa)
>       */
>      ASSERT(!altp2m_active(current->domain));
>  
> -    p2m_lock(p2m);
> +    p2m_read_lock(p2m);
>      rc = do_recalc(p2m, PFN_DOWN(gpa));
> -    p2m_unlock(p2m);
> +    p2m_read_unlock(p2m);
>  
>      return rc;
>  }

How can this be safe, when do_recalc() involves p2m_next_level(), which
may install new (intermediate) page tables?

Jan
Roger Pau Monné April 3, 2023, 2:27 p.m. UTC | #2
On Mon, Apr 03, 2023 at 02:39:08PM +0200, Jan Beulich wrote:
> On 03.04.2023 12:14, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/mm/p2m-pt.c
> > +++ b/xen/arch/x86/mm/p2m-pt.c
> > @@ -486,9 +486,6 @@ static int cf_check do_recalc(struct p2m_domain *p2m, unsigned long gfn)
> >          p2m_type_t ot, nt;
> >          unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
> >  
> > -        if ( !valid_recalc(l1, e) )
> > -            P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
> > -                      p2m->domain->domain_id, gfn, level);
> >          ot = p2m_flags_to_type(l1e_get_flags(e));
> >          nt = p2m_recalc_type_range(true, ot, p2m, gfn & mask, gfn | ~mask);
> >          if ( nt != ot )
> 
> I'm afraid I neither understand why you make this change, nor why you
> then leave the other use of valid_recalc() in place.

The message can be bogus if we allow concurrent do_recalc(), and I
did miss the previous one.

I missed the one at the top.  Originally I wanted to send the RFC with
just changing the lock to read mode, but then I though I might as
well fix that (now bogus) print message.

> > @@ -538,9 +535,9 @@ int p2m_pt_handle_deferred_changes(uint64_t gpa)
> >       */
> >      ASSERT(!altp2m_active(current->domain));
> >  
> > -    p2m_lock(p2m);
> > +    p2m_read_lock(p2m);
> >      rc = do_recalc(p2m, PFN_DOWN(gpa));
> > -    p2m_unlock(p2m);
> > +    p2m_read_unlock(p2m);
> >  
> >      return rc;
> >  }
> 
> How can this be safe, when do_recalc() involves p2m_next_level(), which
> may install new (intermediate) page tables?

Oh, great, didn't realize it was capable of doing so, it's more hidden
than in the EPT case.  Seems like this will only happen if a superpage
needs to be split because a lower order frame is being used as an
ioreq server page.

Do you think it would be safe to try to attempt to perform the recalc
with the read lock only and fallback to the write lock if there's a
need to call p2m_next_level()?

Do you agree it might be possible to do the recalc with just the read
lock if it's updating of PTE type / recalc flags only?

Thanks, Roger.
Jan Beulich April 3, 2023, 3:30 p.m. UTC | #3
On 03.04.2023 16:27, Roger Pau Monné wrote:
> On Mon, Apr 03, 2023 at 02:39:08PM +0200, Jan Beulich wrote:
>> On 03.04.2023 12:14, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/mm/p2m-pt.c
>>> +++ b/xen/arch/x86/mm/p2m-pt.c
>>> @@ -486,9 +486,6 @@ static int cf_check do_recalc(struct p2m_domain *p2m, unsigned long gfn)
>>>          p2m_type_t ot, nt;
>>>          unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
>>>  
>>> -        if ( !valid_recalc(l1, e) )
>>> -            P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
>>> -                      p2m->domain->domain_id, gfn, level);
>>>          ot = p2m_flags_to_type(l1e_get_flags(e));
>>>          nt = p2m_recalc_type_range(true, ot, p2m, gfn & mask, gfn | ~mask);
>>>          if ( nt != ot )
>>
>> I'm afraid I neither understand why you make this change, nor why you
>> then leave the other use of valid_recalc() in place.
> 
> The message can be bogus if we allow concurrent do_recalc(), and I
> did miss the previous one.
> 
> I missed the one at the top.  Originally I wanted to send the RFC with
> just changing the lock to read mode, but then I though I might as
> well fix that (now bogus) print message.
> 
>>> @@ -538,9 +535,9 @@ int p2m_pt_handle_deferred_changes(uint64_t gpa)
>>>       */
>>>      ASSERT(!altp2m_active(current->domain));
>>>  
>>> -    p2m_lock(p2m);
>>> +    p2m_read_lock(p2m);
>>>      rc = do_recalc(p2m, PFN_DOWN(gpa));
>>> -    p2m_unlock(p2m);
>>> +    p2m_read_unlock(p2m);
>>>  
>>>      return rc;
>>>  }
>>
>> How can this be safe, when do_recalc() involves p2m_next_level(), which
>> may install new (intermediate) page tables?
> 
> Oh, great, didn't realize it was capable of doing so, it's more hidden
> than in the EPT case.  Seems like this will only happen if a superpage
> needs to be split because a lower order frame is being used as an
> ioreq server page.
> 
> Do you think it would be safe to try to attempt to perform the recalc
> with the read lock only and fallback to the write lock if there's a
> need to call p2m_next_level()?

Yes, that ought to be okay.

> Do you agree it might be possible to do the recalc with just the read
> lock if it's updating of PTE type / recalc flags only?

Technically this looks to be possible, yes. Question is whether we do
ourselves much good by introducing such a special case of permitting
a certain kind of writes with the lock only held in read mode. The
latest when we find a second (more or less similar) use case thing
are likely to become difficult.

Jan
Jan Beulich April 3, 2023, 3:32 p.m. UTC | #4
On 03.04.2023 12:14, Roger Pau Monne wrote:
> Global p2m type recalculations (as triggered by logdirty) can create
> so much contention on the p2m lock that simple guest operations like
> VCPUOP_set_singleshot_timer on guests with a high amount of vCPUs (32)
> will cease to work in a timely manner, up to the point that Linux
> kernel versions that sill use the VCPU_SSHOTTMR_future flag with the
> singleshot timer will cease to work:
> 
> [   82.779470] CE: xen increased min_delta_ns to 1000000 nsec
> [   82.793075] CE: Reprogramming failure. Giving up
> [   82.779470] CE: Reprogramming failure. Giving up
> [   82.821864] CE: xen increased min_delta_ns to 506250 nsec
> [   82.821864] CE: xen increased min_delta_ns to 759375 nsec
> [   82.821864] CE: xen increased min_delta_ns to 1000000 nsec
> [   82.821864] CE: Reprogramming failure. Giving up
> [   82.856256] CE: Reprogramming failure. Giving up
> [   84.566279] CE: Reprogramming failure. Giving up
> [   84.649493] Freezing user space processes ...
> [  130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130)
> [  130.604032] Task dump for CPU 14:
> [  130.604032] swapper/14      R  running task        0     0      1 0x00000000
> [  130.604032] Call Trace:
> [  130.604032]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> [  130.604032]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> [  130.604032]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> [  130.604032]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> [  130.604032]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> [  130.604032]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> [  549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013)
> [  549.655463] Task dump for CPU 26:
> [  549.655463] swapper/26      R  running task        0     0      1 0x00000000
> [  549.655463] Call Trace:
> [  549.655463]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> [  549.655463]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> [  549.655463]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> [  549.655463]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> [  549.655463]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> [  549.655463]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> [  821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664)
> [  821.888596] Task dump for CPU 26:
> [  821.888622] swapper/26      R  running task        0     0      1 0x00000000
> [  821.888677] Call Trace:
> [  821.888712]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> [  821.888771]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> [  821.888818]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> [  821.888865]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> [  821.888917]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> [  821.888966]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> 
> This is obviously undesirable.  One way to bodge the issue would be to
> ignore VCPU_SSHOTTMR_future, but that's a deliberate breakage of the
> hypercall ABI.
> 
> Instead lower the contention in the lock by doing the recalculation
> with the lock in read mode.  This is safe because only the flags/type
> are changed, there's no PTE mfn change in the AMD recalculation logic.
> The Intel (EPT) case is likely more complicated, as superpage
> splitting for diverging EMT values must be done with the p2m lock in
> taken in write mode.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> I'm unsure whether such modification is fully safe:  I think changing
> the flags/type should be fine: the PTE write is performed using
> safwrite_p2m_entry() which must be atomic (as the guest is still
> running and accessing the page tables).  I'm slightly worried about
> all PTE readers not using atomic accesses to do so (ie: pointer
> returned by p2m_find_entry() should be read atomicallly), and code
> assuming that a gfn type cannot change while holding the p2m lock in
> read mode.

Coming back to this: Yes, I think reads (at least the ones in do_recalc()
which can now be done in parallel) will need to be tightened if this is a
road we want to follow.

Jan
Roger Pau Monné April 3, 2023, 3:38 p.m. UTC | #5
On Mon, Apr 03, 2023 at 05:32:39PM +0200, Jan Beulich wrote:
> On 03.04.2023 12:14, Roger Pau Monne wrote:
> > Global p2m type recalculations (as triggered by logdirty) can create
> > so much contention on the p2m lock that simple guest operations like
> > VCPUOP_set_singleshot_timer on guests with a high amount of vCPUs (32)
> > will cease to work in a timely manner, up to the point that Linux
> > kernel versions that sill use the VCPU_SSHOTTMR_future flag with the
> > singleshot timer will cease to work:
> > 
> > [   82.779470] CE: xen increased min_delta_ns to 1000000 nsec
> > [   82.793075] CE: Reprogramming failure. Giving up
> > [   82.779470] CE: Reprogramming failure. Giving up
> > [   82.821864] CE: xen increased min_delta_ns to 506250 nsec
> > [   82.821864] CE: xen increased min_delta_ns to 759375 nsec
> > [   82.821864] CE: xen increased min_delta_ns to 1000000 nsec
> > [   82.821864] CE: Reprogramming failure. Giving up
> > [   82.856256] CE: Reprogramming failure. Giving up
> > [   84.566279] CE: Reprogramming failure. Giving up
> > [   84.649493] Freezing user space processes ...
> > [  130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130)
> > [  130.604032] Task dump for CPU 14:
> > [  130.604032] swapper/14      R  running task        0     0      1 0x00000000
> > [  130.604032] Call Trace:
> > [  130.604032]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> > [  130.604032]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> > [  130.604032]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> > [  130.604032]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> > [  130.604032]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> > [  130.604032]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> > [  549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013)
> > [  549.655463] Task dump for CPU 26:
> > [  549.655463] swapper/26      R  running task        0     0      1 0x00000000
> > [  549.655463] Call Trace:
> > [  549.655463]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> > [  549.655463]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> > [  549.655463]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> > [  549.655463]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> > [  549.655463]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> > [  549.655463]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> > [  821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664)
> > [  821.888596] Task dump for CPU 26:
> > [  821.888622] swapper/26      R  running task        0     0      1 0x00000000
> > [  821.888677] Call Trace:
> > [  821.888712]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> > [  821.888771]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> > [  821.888818]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> > [  821.888865]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> > [  821.888917]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> > [  821.888966]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> > 
> > This is obviously undesirable.  One way to bodge the issue would be to
> > ignore VCPU_SSHOTTMR_future, but that's a deliberate breakage of the
> > hypercall ABI.
> > 
> > Instead lower the contention in the lock by doing the recalculation
> > with the lock in read mode.  This is safe because only the flags/type
> > are changed, there's no PTE mfn change in the AMD recalculation logic.
> > The Intel (EPT) case is likely more complicated, as superpage
> > splitting for diverging EMT values must be done with the p2m lock in
> > taken in write mode.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > I'm unsure whether such modification is fully safe:  I think changing
> > the flags/type should be fine: the PTE write is performed using
> > safwrite_p2m_entry() which must be atomic (as the guest is still
> > running and accessing the page tables).  I'm slightly worried about
> > all PTE readers not using atomic accesses to do so (ie: pointer
> > returned by p2m_find_entry() should be read atomicallly), and code
> > assuming that a gfn type cannot change while holding the p2m lock in
> > read mode.
> 
> Coming back to this: Yes, I think reads (at least the ones in do_recalc()
> which can now be done in parallel) will need to be tightened if this is a
> road we want to follow.

There are likely a lot of reads under the p2m read lock outside of
do_recalc() that will ideally need to be switched to use atomic
accesses also?

I'm open to suggestions to other ways to get this sorted.  And that's
a guest with 'just' 32 vCPUs, as we go up the contention on the p2m
lock during recalcs/misconfigs is going to increase massively.

Thanks, Roger.
Roger Pau Monné April 3, 2023, 3:42 p.m. UTC | #6
On Mon, Apr 03, 2023 at 05:30:37PM +0200, Jan Beulich wrote:
> On 03.04.2023 16:27, Roger Pau Monné wrote:
> > On Mon, Apr 03, 2023 at 02:39:08PM +0200, Jan Beulich wrote:
> >> On 03.04.2023 12:14, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/mm/p2m-pt.c
> >>> +++ b/xen/arch/x86/mm/p2m-pt.c
> >>> @@ -486,9 +486,6 @@ static int cf_check do_recalc(struct p2m_domain *p2m, unsigned long gfn)
> >>>          p2m_type_t ot, nt;
> >>>          unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
> >>>  
> >>> -        if ( !valid_recalc(l1, e) )
> >>> -            P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
> >>> -                      p2m->domain->domain_id, gfn, level);
> >>>          ot = p2m_flags_to_type(l1e_get_flags(e));
> >>>          nt = p2m_recalc_type_range(true, ot, p2m, gfn & mask, gfn | ~mask);
> >>>          if ( nt != ot )
> >>
> >> I'm afraid I neither understand why you make this change, nor why you
> >> then leave the other use of valid_recalc() in place.
> > 
> > The message can be bogus if we allow concurrent do_recalc(), and I
> > did miss the previous one.
> > 
> > I missed the one at the top.  Originally I wanted to send the RFC with
> > just changing the lock to read mode, but then I though I might as
> > well fix that (now bogus) print message.
> > 
> >>> @@ -538,9 +535,9 @@ int p2m_pt_handle_deferred_changes(uint64_t gpa)
> >>>       */
> >>>      ASSERT(!altp2m_active(current->domain));
> >>>  
> >>> -    p2m_lock(p2m);
> >>> +    p2m_read_lock(p2m);
> >>>      rc = do_recalc(p2m, PFN_DOWN(gpa));
> >>> -    p2m_unlock(p2m);
> >>> +    p2m_read_unlock(p2m);
> >>>  
> >>>      return rc;
> >>>  }
> >>
> >> How can this be safe, when do_recalc() involves p2m_next_level(), which
> >> may install new (intermediate) page tables?
> > 
> > Oh, great, didn't realize it was capable of doing so, it's more hidden
> > than in the EPT case.  Seems like this will only happen if a superpage
> > needs to be split because a lower order frame is being used as an
> > ioreq server page.
> > 
> > Do you think it would be safe to try to attempt to perform the recalc
> > with the read lock only and fallback to the write lock if there's a
> > need to call p2m_next_level()?
> 
> Yes, that ought to be okay.
> 
> > Do you agree it might be possible to do the recalc with just the read
> > lock if it's updating of PTE type / recalc flags only?
> 
> Technically this looks to be possible, yes. Question is whether we do
> ourselves much good by introducing such a special case of permitting
> a certain kind of writes with the lock only held in read mode. The
> latest when we find a second (more or less similar) use case thing
> are likely to become difficult.

Yes, it's not very nice.  I'm open to suggestions of other ways to
remove some of the contention.  If we go down this route it would need
to be clearly documented.

Thanks, Roger.
Jan Beulich April 3, 2023, 4:07 p.m. UTC | #7
On 03.04.2023 17:38, Roger Pau Monné wrote:
> On Mon, Apr 03, 2023 at 05:32:39PM +0200, Jan Beulich wrote:
>> On 03.04.2023 12:14, Roger Pau Monne wrote:
>>> Global p2m type recalculations (as triggered by logdirty) can create
>>> so much contention on the p2m lock that simple guest operations like
>>> VCPUOP_set_singleshot_timer on guests with a high amount of vCPUs (32)
>>> will cease to work in a timely manner, up to the point that Linux
>>> kernel versions that sill use the VCPU_SSHOTTMR_future flag with the
>>> singleshot timer will cease to work:
>>>
>>> [   82.779470] CE: xen increased min_delta_ns to 1000000 nsec
>>> [   82.793075] CE: Reprogramming failure. Giving up
>>> [   82.779470] CE: Reprogramming failure. Giving up
>>> [   82.821864] CE: xen increased min_delta_ns to 506250 nsec
>>> [   82.821864] CE: xen increased min_delta_ns to 759375 nsec
>>> [   82.821864] CE: xen increased min_delta_ns to 1000000 nsec
>>> [   82.821864] CE: Reprogramming failure. Giving up
>>> [   82.856256] CE: Reprogramming failure. Giving up
>>> [   84.566279] CE: Reprogramming failure. Giving up
>>> [   84.649493] Freezing user space processes ...
>>> [  130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130)
>>> [  130.604032] Task dump for CPU 14:
>>> [  130.604032] swapper/14      R  running task        0     0      1 0x00000000
>>> [  130.604032] Call Trace:
>>> [  130.604032]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
>>> [  130.604032]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
>>> [  130.604032]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
>>> [  130.604032]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
>>> [  130.604032]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
>>> [  130.604032]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
>>> [  549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013)
>>> [  549.655463] Task dump for CPU 26:
>>> [  549.655463] swapper/26      R  running task        0     0      1 0x00000000
>>> [  549.655463] Call Trace:
>>> [  549.655463]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
>>> [  549.655463]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
>>> [  549.655463]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
>>> [  549.655463]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
>>> [  549.655463]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
>>> [  549.655463]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
>>> [  821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664)
>>> [  821.888596] Task dump for CPU 26:
>>> [  821.888622] swapper/26      R  running task        0     0      1 0x00000000
>>> [  821.888677] Call Trace:
>>> [  821.888712]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
>>> [  821.888771]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
>>> [  821.888818]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
>>> [  821.888865]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
>>> [  821.888917]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
>>> [  821.888966]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
>>>
>>> This is obviously undesirable.  One way to bodge the issue would be to
>>> ignore VCPU_SSHOTTMR_future, but that's a deliberate breakage of the
>>> hypercall ABI.
>>>
>>> Instead lower the contention in the lock by doing the recalculation
>>> with the lock in read mode.  This is safe because only the flags/type
>>> are changed, there's no PTE mfn change in the AMD recalculation logic.
>>> The Intel (EPT) case is likely more complicated, as superpage
>>> splitting for diverging EMT values must be done with the p2m lock in
>>> taken in write mode.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> I'm unsure whether such modification is fully safe:  I think changing
>>> the flags/type should be fine: the PTE write is performed using
>>> safwrite_p2m_entry() which must be atomic (as the guest is still
>>> running and accessing the page tables).  I'm slightly worried about
>>> all PTE readers not using atomic accesses to do so (ie: pointer
>>> returned by p2m_find_entry() should be read atomicallly), and code
>>> assuming that a gfn type cannot change while holding the p2m lock in
>>> read mode.
>>
>> Coming back to this: Yes, I think reads (at least the ones in do_recalc()
>> which can now be done in parallel) will need to be tightened if this is a
>> road we want to follow.
> 
> There are likely a lot of reads under the p2m read lock outside of
> do_recalc() that will ideally need to be switched to use atomic
> accesses also?

Possibly, perhaps even likely. I specifically said "at least". But ones
clearly on write-locked paths could probably be left alone.

> I'm open to suggestions to other ways to get this sorted.  And that's
> a guest with 'just' 32 vCPUs, as we go up the contention on the p2m
> lock during recalcs/misconfigs is going to increase massively.

I'm afraid I don't have any really good idea, but I'm wondering whether
trylock with (almost?) immediate exit back to guest might make this any
better. At least the guest could then take interrupts before the insn
is retried. Another thought in this direction would be to have a variant
of trylock which "senses" how contended the lock is, to spin if it's the
first one to wait, but exit (fail) otherwise.

Jan
Roger Pau Monné April 4, 2023, 9:49 a.m. UTC | #8
On Mon, Apr 03, 2023 at 06:07:26PM +0200, Jan Beulich wrote:
> On 03.04.2023 17:38, Roger Pau Monné wrote:
> > On Mon, Apr 03, 2023 at 05:32:39PM +0200, Jan Beulich wrote:
> >> On 03.04.2023 12:14, Roger Pau Monne wrote:
> >>> Global p2m type recalculations (as triggered by logdirty) can create
> >>> so much contention on the p2m lock that simple guest operations like
> >>> VCPUOP_set_singleshot_timer on guests with a high amount of vCPUs (32)
> >>> will cease to work in a timely manner, up to the point that Linux
> >>> kernel versions that sill use the VCPU_SSHOTTMR_future flag with the
> >>> singleshot timer will cease to work:
> >>>
> >>> [   82.779470] CE: xen increased min_delta_ns to 1000000 nsec
> >>> [   82.793075] CE: Reprogramming failure. Giving up
> >>> [   82.779470] CE: Reprogramming failure. Giving up
> >>> [   82.821864] CE: xen increased min_delta_ns to 506250 nsec
> >>> [   82.821864] CE: xen increased min_delta_ns to 759375 nsec
> >>> [   82.821864] CE: xen increased min_delta_ns to 1000000 nsec
> >>> [   82.821864] CE: Reprogramming failure. Giving up
> >>> [   82.856256] CE: Reprogramming failure. Giving up
> >>> [   84.566279] CE: Reprogramming failure. Giving up
> >>> [   84.649493] Freezing user space processes ...
> >>> [  130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130)
> >>> [  130.604032] Task dump for CPU 14:
> >>> [  130.604032] swapper/14      R  running task        0     0      1 0x00000000
> >>> [  130.604032] Call Trace:
> >>> [  130.604032]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> >>> [  130.604032]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> >>> [  130.604032]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> >>> [  130.604032]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> >>> [  130.604032]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> >>> [  130.604032]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> >>> [  549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013)
> >>> [  549.655463] Task dump for CPU 26:
> >>> [  549.655463] swapper/26      R  running task        0     0      1 0x00000000
> >>> [  549.655463] Call Trace:
> >>> [  549.655463]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> >>> [  549.655463]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> >>> [  549.655463]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> >>> [  549.655463]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> >>> [  549.655463]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> >>> [  549.655463]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> >>> [  821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664)
> >>> [  821.888596] Task dump for CPU 26:
> >>> [  821.888622] swapper/26      R  running task        0     0      1 0x00000000
> >>> [  821.888677] Call Trace:
> >>> [  821.888712]  [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
> >>> [  821.888771]  [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
> >>> [  821.888818]  [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
> >>> [  821.888865]  [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
> >>> [  821.888917]  [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
> >>> [  821.888966]  [<ffffffff900000d5>] ? start_cpu+0x5/0x14
> >>>
> >>> This is obviously undesirable.  One way to bodge the issue would be to
> >>> ignore VCPU_SSHOTTMR_future, but that's a deliberate breakage of the
> >>> hypercall ABI.
> >>>
> >>> Instead lower the contention in the lock by doing the recalculation
> >>> with the lock in read mode.  This is safe because only the flags/type
> >>> are changed, there's no PTE mfn change in the AMD recalculation logic.
> >>> The Intel (EPT) case is likely more complicated, as superpage
> >>> splitting for diverging EMT values must be done with the p2m lock in
> >>> taken in write mode.
> >>>
> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>> ---
> >>> I'm unsure whether such modification is fully safe:  I think changing
> >>> the flags/type should be fine: the PTE write is performed using
> >>> safwrite_p2m_entry() which must be atomic (as the guest is still
> >>> running and accessing the page tables).  I'm slightly worried about
> >>> all PTE readers not using atomic accesses to do so (ie: pointer
> >>> returned by p2m_find_entry() should be read atomicallly), and code
> >>> assuming that a gfn type cannot change while holding the p2m lock in
> >>> read mode.
> >>
> >> Coming back to this: Yes, I think reads (at least the ones in do_recalc()
> >> which can now be done in parallel) will need to be tightened if this is a
> >> road we want to follow.
> > 
> > There are likely a lot of reads under the p2m read lock outside of
> > do_recalc() that will ideally need to be switched to use atomic
> > accesses also?
> 
> Possibly, perhaps even likely. I specifically said "at least". But ones
> clearly on write-locked paths could probably be left alone.
> 
> > I'm open to suggestions to other ways to get this sorted.  And that's
> > a guest with 'just' 32 vCPUs, as we go up the contention on the p2m
> > lock during recalcs/misconfigs is going to increase massively.
> 
> I'm afraid I don't have any really good idea, but I'm wondering whether
> trylock with (almost?) immediate exit back to guest might make this any
> better. At least the guest could then take interrupts before the insn
> is retried. Another thought in this direction would be to have a variant
> of trylock which "senses" how contended the lock is, to spin if it's the
> first one to wait, but exit (fail) otherwise.

Using trylock in the recalc path could starve those quite badly, as
readers can acquire the lock concurrently.  Also we would loose the
fairness.

Using trylock on VCPUOP_set_singleshot_timer (in order to fetch the
data from the guest provided pointer) would lead to the same situation
AFAICT, as guests using VCPU_SSHOTTMR_future will likely see the time
expired by the point the hypercall checks it.

One thing that I've noticed is that copy_from_guest for HVM
(__hvm_copy()) takes and releases the p2m lock in read mode at several
points.  It would likely be better if the whole GVA -> MFN translation
was resolved inside of a single read-locked region, as that would
avoid stalls in the middle of the operation, but I don't think this
will solve the issue at hand.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index cd1af33b67..f145647f01 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -486,9 +486,6 @@  static int cf_check do_recalc(struct p2m_domain *p2m, unsigned long gfn)
         p2m_type_t ot, nt;
         unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
 
-        if ( !valid_recalc(l1, e) )
-            P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
-                      p2m->domain->domain_id, gfn, level);
         ot = p2m_flags_to_type(l1e_get_flags(e));
         nt = p2m_recalc_type_range(true, ot, p2m, gfn & mask, gfn | ~mask);
         if ( nt != ot )
@@ -538,9 +535,9 @@  int p2m_pt_handle_deferred_changes(uint64_t gpa)
      */
     ASSERT(!altp2m_active(current->domain));
 
-    p2m_lock(p2m);
+    p2m_read_lock(p2m);
     rc = do_recalc(p2m, PFN_DOWN(gpa));
-    p2m_unlock(p2m);
+    p2m_read_unlock(p2m);
 
     return rc;
 }