[RFC,00/10] Preemption in hypervisor (ARM only)

Message ID	20210223023428.757694-1-volodymyr_babchuk@epam.com (mailing list archive)
Headers	show Return-Path: <SRS0=trq0=HZ=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B63D4614A7 Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>, George Dunlap <george.dunlap@citrix.com>, Dario Faggioli <dfaggioli@suse.com>, Meng Xu <mengxu@cis.upenn.edu>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <iwj@xenproject.org>, Jan Beulich <jbeulich@suse.com>, Julien Grall <julien@xen.org>, Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org>, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> Subject: [RFC PATCH 00/10] Preemption in hypervisor (ARM only) Thread-Topic: [RFC PATCH 00/10] Preemption in hypervisor (ARM only) Thread-Index: AQHXCYx4A6OUUHr1gkqxWv1TEOLkug== Date: Tue, 23 Feb 2021 02:34:55 +0000 Message-ID: <20210223023428.757694-1-volodymyr_babchuk@epam.com> Accept-Language: en-US Content-Language: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0
Series	Preemption in hypervisor (ARM only) \| expand [RFC,00/10] Preemption in hypervisor (ARM only) [RFC,01/10] sched: core: save IRQ state during locking [RFC,02/10] sched: rt: save IRQ state during locking [RFC,03/10] sched: credit2: save IRQ state during locking [RFC,04/10] preempt: use atomic_t to for preempt_count [RFC,05/10] preempt: add try_preempt() function [RFC,06/10] arm: setup: disable preemption during startup [RFC,07/10] sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!] [RFC,08/10] arm: context_switch: allow to run with IRQs already disabled [RFC,09/10] arm: traps: try to preempt before leaving IRQ handler [RFC,10/10,HACK] alloc pages: enable preemption early

Volodymyr Babchuk Feb. 23, 2021, 2:34 a.m. UTC

Hello community,

Subject of this cover letter is quite self-explanatory. This patch
series implements PoC for preemption in hypervisor mode.

This is the sort of follow-up to recent discussion about latency
([1]).

Motivation
==========

It is well known that Xen is not preemptable. On other words, it is
impossible to switch vCPU contexts while running in hypervisor
mode. Only one place where scheduling decision can be made and one
vCPU can be replaced with another is the exit path from the hypervisor
mode. The one exception are Idle vCPUs, which never leaves the
hypervisor mode for obvious reasons.

This leads to a number of problems. This list is not comprehensive. It
lists only things that I or my colleagues encountered personally.

Long-running hypercalls. Due to nature of some hypercalls they can
execute for arbitrary long time. Mostly those are calls that deal with
long list of similar actions, like memory pages processing. To deal
with this issue Xen employs most horrific technique called "hypercall
continuation". When code that handles hypercall decides that it should
be preempted, it basically updates the hypercall parameters, and moves
guest PC one instruction back. This causes guest to re-execute the
hypercall with altered parameters, which will allow hypervisor to
continue hypercall execution later. This approach itself have obvious
problems: code that executes hypercall is responsible for preemption,
preemption checks are infrequent (because they are costly by
themselves), hypercall execution state is stored in guest-controlled
area, we rely on guest's good will to continue the hypercall. All this
imposes restrictions on which hypercalls can be preempted, when they
can be preempted and how to write hypercall handlers. Also, it
requires very accurate coding and already led to at least one
vulnerability - XSA-318. Some hypercalls can not be preempted at all,
like the one mentioned in [1].

Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs,
which are supposed to run when the system is idle. If hypervisor needs
to execute own tasks that are required to run right now, it have no
other way than to execute them on current vCPU. But scheduler does not
know that hypervisor executes hypervisor task and accounts spent time
to a domain. This can lead to domain starvation.

Also, absence of hypervisor threads leads to absence of high-level
synchronization primitives like mutexes, conditional variables,
completions, etc. This leads to two problems: we need to use spinlocks
everywhere and we have problems when porting device drivers from linux
kernel.

Proposed solution
=================

It is quite obvious that to fix problems above we need to allow
preemption in hypervisor mode. I am not familiar with x86 side, but
for the ARM it was surprisingly easy to implement. Basically, vCPU
context in hypervisor mode is determined by its stack at general
purpose registers. And __context_switch() function perfectly switches
them when running in hypervisor mode. So there are no hard
restrictions, why it should be called only in leave_hypervisor() path.

The obvious question is: when we should to try to preempt running
vCPU?  And answer is: when there was an external event. This means
that we should try to preempt only when there was an interrupt request
where we are running in hypervisor mode. On ARM, in this case function
do_trap_irq() is called. Problem is that IRQ handler can be called
when vCPU is already in atomic state (holding spinlock, for
example). In this case we should try to preempt right after leaving
atomic state. This is basically all the idea behind this PoC.

Now, about the series composition.
Patches

  sched: core: save IRQ state during locking
  sched: rt: save IRQ state during locking
  sched: credit2: save IRQ state during locking
  preempt: use atomic_t to for preempt_count
  arm: setup: disable preemption during startup
  arm: context_switch: allow to run with IRQs already disabled

prepare the groundwork for the rest of PoC. It appears that not all
code is ready to be executed in IRQ state, and schedule() now can be
called at end of do_trap_irq(), which technically is considered IRQ
handler state. Also, it is unwise to try preempt things when we are
still booting, so ween to enable atomic context during the boot
process.

Patches
  preempt: add try_preempt() function
  sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
  arm: traps: try to preempt before leaving IRQ handler

are basically the core of this PoC. try_preempt() function tries to
preempt vCPU when either called by IRQ handler and when leaving atomic
state. Scheduler now enters atomic state to ensure that it will not
preempt self. do_trap_irq() calls try_preempt() to initiate preemption.

Patch
  [HACK] alloc pages: enable preemption early

is exactly what it says. I wanted to see if this PoC is capable of
fixing that mentioned issue with long-running alloc_heap_pages(). So
this is just a hack that disables atomic context early. As mentioned
in the patch description, right solution would be to use mutexes.

Results
=======

I used the same testing setup that I described in [1]. The results are
quite promising:

1. Stefano noted that very first batch of measurements resulted in
higher than usual latency:

 *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc  ***
RT Eval app

Counter freq is 33280000 Hz. Period is 30 ns
Set alarm in 0 sec (332800 ticks)
Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)]
Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)]

Note that maximum latency is about 2ms.

With this patches applied, things are much better:

 *** Booting Zephyr OS build zephyr-v2.4.0-3614-g0e2689f8edc3  ***
RT Eval app

Counter freq is 33280000 Hz. Period is 30 ns
Set alarm in 0 sec (332800 ticks)
Mean: 335 (10050 ns) stddev: 52 (1560 ns) above thr: 0% [296 (8880 ns) - 1256 (37680 ns)]
Mean: 332 (9960 ns) stddev: 11 (330 ns) above thr: 0% [293 (8790 ns) - 501 (15030 ns)]

As you can see, maximum latency is ~38us, which is way lower than 2ms.

Second test is to observe influence of call to alloc_heap_pages() with
order 18. Without the last patch:

Mean: 756 (22680 ns) stddev: 7328 (219840 ns) above thr: 4% [326 (9780 ns) - 234405 (7032150 ns)]

Huge spike of 7ms can be observed.

Now, with the HACK patch:

Mean: 488 (14640 ns) stddev: 1656 (49680 ns) above thr: 6% [324 (9720 ns) - 52756 (1582680 ns)]
Mean: 458 (13740 ns) stddev: 227 (6810 ns) above thr: 3% [324 (9720 ns) - 3936 (118080 ns)]
Mean: 333 (9990 ns) stddev: 12 (360 ns) above thr: 0% [320 (9600 ns) - 512 (15360 ns)]

Two things can be observed: mean latency time is lower, maximum
latencies are lower too, but overall runtime is higher.

Downside of this patches is that mean latency time is a bit
higher. There are the results for current xen master branch:

Mean: 288 (8640 ns) stddev: 20 (600 ns) above thr: 0% [269 (8070 ns) - 766 (22980 ns)]
Mean: 287 (8610 ns) stddev: 20 (600 ns) above thr: 0% [266 (7980 ns) - 793 (23790 ns)]

8.6us versus ~10us with the patches.

Of course, this is the crude approach and certain things can be made
more optimally.

Know issues
===========

0. Right now it is ARM only. x86 changes vCPU contexts in a different
way, and I don't know what amount of changes needed to make this work on x86

1. RTDS scheduler goes crasy when running on SMP system (e.g. with
more than 1 pCPU) and tries to schedule already running vCPU on
multiple pCPU at a time. This leads to some hard-to-debug crashes

2. As I mentioned, mean latency become a bit higher

Conclusion
==========

My main intention is to begin discussion of hypervisor preemption. As
I showed, it is doable right away and provides some immediate
benefits. I do understand that proper implementation requires much
more efforts. But we are ready to do this work if community is
interested in it.

Just to reiterate main benefits:

1. More controllable latency. On embedded systems customers care about
such things.

2. We can get rid of hypercall continuations, which will results in
simpler and more secure code.

3. We can implement proper hypervisor threads, mutexes, completions
and so on. This will make scheduling more accurate, ease up linux
drivers porting and implementation of more complex features in the
hypervisor.



[1] https://marc.info/?l=xen-devel&m=161049529916656&w=2

Volodymyr Babchuk (10):
  sched: core: save IRQ state during locking
  sched: rt: save IRQ state during locking
  sched: credit2: save IRQ state during locking
  preempt: use atomic_t to for preempt_count
  preempt: add try_preempt() function
  arm: setup: disable preemption during startup
  sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
  arm: context_switch: allow to run with IRQs already disabled
  arm: traps: try to preempt before leaving IRQ handler
  [HACK] alloc pages: enable preemption early

 xen/arch/arm/domain.c      | 18 ++++++++++-----
 xen/arch/arm/setup.c       |  4 ++++
 xen/arch/arm/traps.c       |  7 ++++++
 xen/common/memory.c        |  4 ++--
 xen/common/page_alloc.c    | 21 ++---------------
 xen/common/preempt.c       | 36 ++++++++++++++++++++++++++---
 xen/common/sched/core.c    | 46 +++++++++++++++++++++++---------------
 xen/common/sched/credit2.c |  5 +++--
 xen/common/sched/rt.c      | 10 +++++----
 xen/include/xen/preempt.h  | 17 +++++++++-----
 10 files changed, 109 insertions(+), 59 deletions(-)

Julien Grall Feb. 23, 2021, 9:02 a.m. UTC | #1

On 23/02/2021 02:34, Volodymyr Babchuk wrote:
> Hello community,

Hi Volodymyr,

Thank you for the proposal, I like the like of been able to preempt the 
vCPU thread. This would make easier to implement some of the device 
emulation in Xen (e.g. vGIC, SMMU).

> 
> Subject of this cover letter is quite self-explanatory. This patch
> series implements PoC for preemption in hypervisor mode.
> 
> This is the sort of follow-up to recent discussion about latency
> ([1]).
> 
> Motivation
> ==========
> 
> It is well known that Xen is not preemptable. On other words, it is
> impossible to switch vCPU contexts while running in hypervisor
> mode. Only one place where scheduling decision can be made and one
> vCPU can be replaced with another is the exit path from the hypervisor
> mode. The one exception are Idle vCPUs, which never leaves the
> hypervisor mode for obvious reasons.
> 
> This leads to a number of problems. This list is not comprehensive. It
> lists only things that I or my colleagues encountered personally.
> 
> Long-running hypercalls. Due to nature of some hypercalls they can
> execute for arbitrary long time. Mostly those are calls that deal with
> long list of similar actions, like memory pages processing. To deal
> with this issue Xen employs most horrific technique called "hypercall
> continuation". 

I agree the code is not nice. However, it does serve another purpose 
than ...

> When code that handles hypercall decides that it should
> be preempted, it basically updates the hypercall parameters, and moves
> guest PC one instruction back. This causes guest to re-execute the
> hypercall with altered parameters, which will allow hypervisor to
> continue hypercall execution later.

... just rescheduling the vCPU. It will also give the opportunity for 
the guest to handle interrupts.

If you don't return to the guest, then risk to get an RCU sched stall on 
that the vCPU (some hypercalls can take really really long).


> This approach itself have obvious
> problems: code that executes hypercall is responsible for preemption,
> preemption checks are infrequent (because they are costly by
> themselves), hypercall execution state is stored in guest-controlled
> area, we rely on guest's good will to continue the hypercall. 

Why is it a problem to rely on guest's good will? The hypercalls should 
be preempted at a boundary that is safe to continue.

> All this
> imposes restrictions on which hypercalls can be preempted, when they
> can be preempted and how to write hypercall handlers. Also, it
> requires very accurate coding and already led to at least one
> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
> like the one mentioned in [1].
> 
> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs,
> which are supposed to run when the system is idle. If hypervisor needs
> to execute own tasks that are required to run right now, it have no
> other way than to execute them on current vCPU. But scheduler does not
> know that hypervisor executes hypervisor task and accounts spent time
> to a domain. This can lead to domain starvation.
> 
> Also, absence of hypervisor threads leads to absence of high-level
> synchronization primitives like mutexes, conditional variables,
> completions, etc. This leads to two problems: we need to use spinlocks
> everywhere and we have problems when porting device drivers from linux
> kernel.
> 
> Proposed solution
> =================
> 
> It is quite obvious that to fix problems above we need to allow
> preemption in hypervisor mode. I am not familiar with x86 side, but
> for the ARM it was surprisingly easy to implement. Basically, vCPU
> context in hypervisor mode is determined by its stack at general
> purpose registers. And __context_switch() function perfectly switches
> them when running in hypervisor mode. So there are no hard
> restrictions, why it should be called only in leave_hypervisor() path.
> 
> The obvious question is: when we should to try to preempt running
> vCPU?  And answer is: when there was an external event. This means
> that we should try to preempt only when there was an interrupt request
> where we are running in hypervisor mode. On ARM, in this case function
> do_trap_irq() is called. Problem is that IRQ handler can be called
> when vCPU is already in atomic state (holding spinlock, for
> example). In this case we should try to preempt right after leaving
> atomic state. This is basically all the idea behind this PoC.
> 
> Now, about the series composition.
> Patches
> 
>    sched: core: save IRQ state during locking
>    sched: rt: save IRQ state during locking
>    sched: credit2: save IRQ state during locking
>    preempt: use atomic_t to for preempt_count
>    arm: setup: disable preemption during startup
>    arm: context_switch: allow to run with IRQs already disabled
> 
> prepare the groundwork for the rest of PoC. It appears that not all
> code is ready to be executed in IRQ state, and schedule() now can be
> called at end of do_trap_irq(), which technically is considered IRQ
> handler state. Also, it is unwise to try preempt things when we are
> still booting, so ween to enable atomic context during the boot
> process.

I am really surprised that this is the only changes necessary in Xen. 
For a first approach, we may want to be conservative when the preemption 
is happening as I am not convinced that all the places are safe to preempt.

> 
> Patches
>    preempt: add try_preempt() function
>    sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
>    arm: traps: try to preempt before leaving IRQ handler
> 
> are basically the core of this PoC. try_preempt() function tries to
> preempt vCPU when either called by IRQ handler and when leaving atomic
> state. Scheduler now enters atomic state to ensure that it will not
> preempt self. do_trap_irq() calls try_preempt() to initiate preemption.

AFAICT, try_preempt() will deal with the rescheduling. But how about 
softirqs? Don't we want to handle them in try_preempt() as well?

[...]

> Conclusion
> ==========
> 
> My main intention is to begin discussion of hypervisor preemption. As
> I showed, it is doable right away and provides some immediate
> benefits. I do understand that proper implementation requires much
> more efforts. But we are ready to do this work if community is
> interested in it.
> 
> Just to reiterate main benefits:
> 
> 1. More controllable latency. On embedded systems customers care about
> such things.

Is the plan to only offer preemptible Xen?

> 
> 2. We can get rid of hypercall continuations, which will results in
> simpler and more secure code.

I don't think you can get rid of it completely without risking the OS to 
receive RCU sched stall. So you would need to handle them hypercalls 
differently.

> 
> 3. We can implement proper hypervisor threads, mutexes, completions
> and so on. This will make scheduling more accurate, ease up linux
> drivers porting and implementation of more complex features in the
> hypervisor.
> 
> 
> 
> [1] https://marc.info/?l=xen-devel&m=161049529916656&w=2
> 
> Volodymyr Babchuk (10):
>    sched: core: save IRQ state during locking
>    sched: rt: save IRQ state during locking
>    sched: credit2: save IRQ state during locking
>    preempt: use atomic_t to for preempt_count
>    preempt: add try_preempt() function
>    arm: setup: disable preemption during startup
>    sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
>    arm: context_switch: allow to run with IRQs already disabled
>    arm: traps: try to preempt before leaving IRQ handler
>    [HACK] alloc pages: enable preemption early
> 
>   xen/arch/arm/domain.c      | 18 ++++++++++-----
>   xen/arch/arm/setup.c       |  4 ++++
>   xen/arch/arm/traps.c       |  7 ++++++
>   xen/common/memory.c        |  4 ++--
>   xen/common/page_alloc.c    | 21 ++---------------
>   xen/common/preempt.c       | 36 ++++++++++++++++++++++++++---
>   xen/common/sched/core.c    | 46 +++++++++++++++++++++++---------------
>   xen/common/sched/credit2.c |  5 +++--
>   xen/common/sched/rt.c      | 10 +++++----
>   xen/include/xen/preempt.h  | 17 +++++++++-----
>   10 files changed, 109 insertions(+), 59 deletions(-)
> 

Cheers,

Volodymyr Babchuk Feb. 23, 2021, 12:06 p.m. UTC | #2

Hi Julien,

Julien Grall writes:

> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>> Hello community,
>
> Hi Volodymyr,
>
> Thank you for the proposal, I like the like of been able to preempt
> the vCPU thread. This would make easier to implement some of the
> device emulation in Xen (e.g. vGIC, SMMU).

Yes, emulation is the other topic that I didn't mentioned. Also, it could
lift some restrictions in OP-TEE mediator code as well.

>> Subject of this cover letter is quite self-explanatory. This patch
>> series implements PoC for preemption in hypervisor mode.
>> This is the sort of follow-up to recent discussion about latency
>> ([1]).
>> Motivation
>> ==========
>> It is well known that Xen is not preemptable. On other words, it is
>> impossible to switch vCPU contexts while running in hypervisor
>> mode. Only one place where scheduling decision can be made and one
>> vCPU can be replaced with another is the exit path from the hypervisor
>> mode. The one exception are Idle vCPUs, which never leaves the
>> hypervisor mode for obvious reasons.
>> This leads to a number of problems. This list is not
>> comprehensive. It
>> lists only things that I or my colleagues encountered personally.
>> Long-running hypercalls. Due to nature of some hypercalls they can
>> execute for arbitrary long time. Mostly those are calls that deal with
>> long list of similar actions, like memory pages processing. To deal
>> with this issue Xen employs most horrific technique called "hypercall
>> continuation". 
>
> I agree the code is not nice. However, it does serve another purpose
> than ...
>
>> When code that handles hypercall decides that it should
>> be preempted, it basically updates the hypercall parameters, and moves
>> guest PC one instruction back. This causes guest to re-execute the
>> hypercall with altered parameters, which will allow hypervisor to
>> continue hypercall execution later.
>
> ... just rescheduling the vCPU. It will also give the opportunity for
> the guest to handle interrupts.
>
> If you don't return to the guest, then risk to get an RCU sched stall
> on that the vCPU (some hypercalls can take really really long).

Ah yes, you are right. I'd only wish that hypervisor saved context of
hypercall on it's side...

I have example of OP-TEE before my eyes. They have special return code
"task was interrupted" and they use separate call "continue execution of
interrupted task", which takes opaque context handle as a
parameter. With this approach state of interrupted call never leaks to
rest of the system.

>
>> This approach itself have obvious
>> problems: code that executes hypercall is responsible for preemption,
>> preemption checks are infrequent (because they are costly by
>> themselves), hypercall execution state is stored in guest-controlled
>> area, we rely on guest's good will to continue the hypercall. 
>
> Why is it a problem to rely on guest's good will? The hypercalls
> should be preempted at a boundary that is safe to continue.

Yes, and it imposes restrictions on how to write hypercall
handler.
In other words, there are much more places in hypercall handler code
where it can be preempted than where hypercall continuation can be
used. For example, you can preempt hypercall that holds a mutex, but of
course you can't create an continuation point in such place.

>> All this
>> imposes restrictions on which hypercalls can be preempted, when they
>> can be preempted and how to write hypercall handlers. Also, it
>> requires very accurate coding and already led to at least one
>> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
>> like the one mentioned in [1].
>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle
>> vCPUs,
>> which are supposed to run when the system is idle. If hypervisor needs
>> to execute own tasks that are required to run right now, it have no
>> other way than to execute them on current vCPU. But scheduler does not
>> know that hypervisor executes hypervisor task and accounts spent time
>> to a domain. This can lead to domain starvation.
>> Also, absence of hypervisor threads leads to absence of high-level
>> synchronization primitives like mutexes, conditional variables,
>> completions, etc. This leads to two problems: we need to use spinlocks
>> everywhere and we have problems when porting device drivers from linux
>> kernel.
>> Proposed solution
>> =================
>> It is quite obvious that to fix problems above we need to allow
>> preemption in hypervisor mode. I am not familiar with x86 side, but
>> for the ARM it was surprisingly easy to implement. Basically, vCPU
>> context in hypervisor mode is determined by its stack at general
>> purpose registers. And __context_switch() function perfectly switches
>> them when running in hypervisor mode. So there are no hard
>> restrictions, why it should be called only in leave_hypervisor() path.
>> The obvious question is: when we should to try to preempt running
>> vCPU?  And answer is: when there was an external event. This means
>> that we should try to preempt only when there was an interrupt request
>> where we are running in hypervisor mode. On ARM, in this case function
>> do_trap_irq() is called. Problem is that IRQ handler can be called
>> when vCPU is already in atomic state (holding spinlock, for
>> example). In this case we should try to preempt right after leaving
>> atomic state. This is basically all the idea behind this PoC.
>> Now, about the series composition.
>> Patches
>>    sched: core: save IRQ state during locking
>>    sched: rt: save IRQ state during locking
>>    sched: credit2: save IRQ state during locking
>>    preempt: use atomic_t to for preempt_count
>>    arm: setup: disable preemption during startup
>>    arm: context_switch: allow to run with IRQs already disabled
>> prepare the groundwork for the rest of PoC. It appears that not all
>> code is ready to be executed in IRQ state, and schedule() now can be
>> called at end of do_trap_irq(), which technically is considered IRQ
>> handler state. Also, it is unwise to try preempt things when we are
>> still booting, so ween to enable atomic context during the boot
>> process.
>
> I am really surprised that this is the only changes necessary in
> Xen. For a first approach, we may want to be conservative when the
> preemption is happening as I am not convinced that all the places are
> safe to preempt.
>

Well, I can't say that I ran extensive tests, but I played with this for
some time and it seemed quite stable. Of course, I had some problems
with RTDS...

As I see it, Xen is already supports SMP, so all places where races are
possible should already be covered by spinlocks or taken into account by
some other means.

Places which may not be safe to preempt are clustered around task
management code itself: schedulers, xen entry/exit points, vcpu
creation/destruction and such.

For example, for sure we do not want to destroy vCPU which was preempted
in hypervisor mode. I didn't covered this case, by the way.

>> Patches
>>    preempt: add try_preempt() function
>>    sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
>>    arm: traps: try to preempt before leaving IRQ handler
>> are basically the core of this PoC. try_preempt() function tries to
>> preempt vCPU when either called by IRQ handler and when leaving atomic
>> state. Scheduler now enters atomic state to ensure that it will not
>> preempt self. do_trap_irq() calls try_preempt() to initiate preemption.
>
> AFAICT, try_preempt() will deal with the rescheduling. But how about
> softirqs? Don't we want to handle them in try_preempt() as well?

Well, yes and no. We have the following softirqs:

 TIMER_SOFTIRQ - should be called, I believe
 RCU_SOFTIRQ - I'm not sure about this, but probably no
 SCHED_SLAVE_SOFTIRQ - no
 SCHEDULE_SOFTIRQ - no
 NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate
 thread, maybe?
 TASKLET_SOFTIRQ - should be moved to a separate thread

So, looks like only timers should be handled for sure.

>
> [...]
>
>> Conclusion
>> ==========
>> My main intention is to begin discussion of hypervisor
>> preemption. As
>> I showed, it is doable right away and provides some immediate
>> benefits. I do understand that proper implementation requires much
>> more efforts. But we are ready to do this work if community is
>> interested in it.
>> Just to reiterate main benefits:
>> 1. More controllable latency. On embedded systems customers care
>> about
>> such things.
>
> Is the plan to only offer preemptible Xen?
>

Sorry, didn't get the question.

>> 2. We can get rid of hypercall continuations, which will results in
>> simpler and more secure code.
>
> I don't think you can get rid of it completely without risking the OS
> to receive RCU sched stall. So you would need to handle them
> hypercalls differently.

Agree. I believe that continuation context should reside in
hypervisor. Those changes are not connected to preemption per se and can
be implemented separately. But we can discuss them there, of course.

[...]

Julien Grall Feb. 24, 2021, 10:08 a.m. UTC | #3

On 23/02/2021 12:06, Volodymyr Babchuk wrote:
> 
> Hi Julien,

Hi Volodymyr,

> Julien Grall writes:
>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>> ... just rescheduling the vCPU. It will also give the opportunity for
>> the guest to handle interrupts.
>>
>> If you don't return to the guest, then risk to get an RCU sched stall
>> on that the vCPU (some hypercalls can take really really long).
> 
> Ah yes, you are right. I'd only wish that hypervisor saved context of
> hypercall on it's side...
> 
> I have example of OP-TEE before my eyes. They have special return code
> "task was interrupted" and they use separate call "continue execution of
> interrupted task", which takes opaque context handle as a
> parameter. With this approach state of interrupted call never leaks to > rest of the system.

Feel free to suggest a new approach for the hypercals.

>>
>>> This approach itself have obvious
>>> problems: code that executes hypercall is responsible for preemption,
>>> preemption checks are infrequent (because they are costly by
>>> themselves), hypercall execution state is stored in guest-controlled
>>> area, we rely on guest's good will to continue the hypercall.
>>
>> Why is it a problem to rely on guest's good will? The hypercalls
>> should be preempted at a boundary that is safe to continue.
> 
> Yes, and it imposes restrictions on how to write hypercall
> handler.
> In other words, there are much more places in hypercall handler code
> where it can be preempted than where hypercall continuation can be
> used. For example, you can preempt hypercall that holds a mutex, but of
> course you can't create an continuation point in such place.

I disagree, you can create continuation point in such place. Although it 
will be more complex because you have to make sure you break the work in 
a restartable place.

I would also like to point out that preemption also have some drawbacks.
With RT in mind, you have to deal with priority inversion (e.g. a lower 
priority vCPU held a mutex that is required by an higher priority).

Outside of RT, you have to be careful where mutex are held. In your 
earlier answer, you suggested to held mutex for the memory allocation. 
If you do that, then it means a domain A can block allocation for domain 
B as it helds the mutex.

This can lead to quite serious problem if domain A cannot run (because 
it exhausted its credit) for a long time.

> 
>>> All this
>>> imposes restrictions on which hypercalls can be preempted, when they
>>> can be preempted and how to write hypercall handlers. Also, it
>>> requires very accurate coding and already led to at least one
>>> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
>>> like the one mentioned in [1].
>>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle
>>> vCPUs,
>>> which are supposed to run when the system is idle. If hypervisor needs
>>> to execute own tasks that are required to run right now, it have no
>>> other way than to execute them on current vCPU. But scheduler does not
>>> know that hypervisor executes hypervisor task and accounts spent time
>>> to a domain. This can lead to domain starvation.
>>> Also, absence of hypervisor threads leads to absence of high-level
>>> synchronization primitives like mutexes, conditional variables,
>>> completions, etc. This leads to two problems: we need to use spinlocks
>>> everywhere and we have problems when porting device drivers from linux
>>> kernel.
>>> Proposed solution
>>> =================
>>> It is quite obvious that to fix problems above we need to allow
>>> preemption in hypervisor mode. I am not familiar with x86 side, but
>>> for the ARM it was surprisingly easy to implement. Basically, vCPU
>>> context in hypervisor mode is determined by its stack at general
>>> purpose registers. And __context_switch() function perfectly switches
>>> them when running in hypervisor mode. So there are no hard
>>> restrictions, why it should be called only in leave_hypervisor() path.
>>> The obvious question is: when we should to try to preempt running
>>> vCPU?  And answer is: when there was an external event. This means
>>> that we should try to preempt only when there was an interrupt request
>>> where we are running in hypervisor mode. On ARM, in this case function
>>> do_trap_irq() is called. Problem is that IRQ handler can be called
>>> when vCPU is already in atomic state (holding spinlock, for
>>> example). In this case we should try to preempt right after leaving
>>> atomic state. This is basically all the idea behind this PoC.
>>> Now, about the series composition.
>>> Patches
>>>     sched: core: save IRQ state during locking
>>>     sched: rt: save IRQ state during locking
>>>     sched: credit2: save IRQ state during locking
>>>     preempt: use atomic_t to for preempt_count
>>>     arm: setup: disable preemption during startup
>>>     arm: context_switch: allow to run with IRQs already disabled
>>> prepare the groundwork for the rest of PoC. It appears that not all
>>> code is ready to be executed in IRQ state, and schedule() now can be
>>> called at end of do_trap_irq(), which technically is considered IRQ
>>> handler state. Also, it is unwise to try preempt things when we are
>>> still booting, so ween to enable atomic context during the boot
>>> process.
>>
>> I am really surprised that this is the only changes necessary in
>> Xen. For a first approach, we may want to be conservative when the
>> preemption is happening as I am not convinced that all the places are
>> safe to preempt.
>>
> 
> Well, I can't say that I ran extensive tests, but I played with this for
> some time and it seemed quite stable. Of course, I had some problems
> with RTDS...
> 
> As I see it, Xen is already supports SMP, so all places where races are
> possible should already be covered by spinlocks or taken into account by
> some other means.
That's correct for shared resources. I am more worried for any 
hypercalls that expected to run more or less continuously (e.g not 
taking into account interrupt) on the same pCPU.

> 
> Places which may not be safe to preempt are clustered around task
> management code itself: schedulers, xen entry/exit points, vcpu
> creation/destruction and such.
> 
> For example, for sure we do not want to destroy vCPU which was preempted
> in hypervisor mode. I didn't covered this case, by the way.
> 
>>> Patches
>>>     preempt: add try_preempt() function
>>>     sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
>>>     arm: traps: try to preempt before leaving IRQ handler
>>> are basically the core of this PoC. try_preempt() function tries to
>>> preempt vCPU when either called by IRQ handler and when leaving atomic
>>> state. Scheduler now enters atomic state to ensure that it will not
>>> preempt self. do_trap_irq() calls try_preempt() to initiate preemption.
>>
>> AFAICT, try_preempt() will deal with the rescheduling. But how about
>> softirqs? Don't we want to handle them in try_preempt() as well?
> 
> Well, yes and no. We have the following softirqs:
> 
>   TIMER_SOFTIRQ - should be called, I believe
>   RCU_SOFTIRQ - I'm not sure about this, but probably no

When would you call RCU callback then?

>   SCHED_SLAVE_SOFTIRQ - no
>   SCHEDULE_SOFTIRQ - no
>   NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate
>   thread, maybe?
>   TASKLET_SOFTIRQ - should be moved to a separate thread >
> So, looks like only timers should be handled for sure.
> 
>>
>> [...]
>>
>>> Conclusion
>>> ==========
>>> My main intention is to begin discussion of hypervisor
>>> preemption. As
>>> I showed, it is doable right away and provides some immediate
>>> benefits. I do understand that proper implementation requires much
>>> more efforts. But we are ready to do this work if community is
>>> interested in it.
>>> Just to reiterate main benefits:
>>> 1. More controllable latency. On embedded systems customers care
>>> about
>>> such things.
>>
>> Is the plan to only offer preemptible Xen?
>>
> 
> Sorry, didn't get the question.

What's your plan for the preemption support? Will an admin be able to 
configure Xen to be either preemptible or not?

Cheers,

Andrew Cooper Feb. 24, 2021, 6:07 p.m. UTC | #4

On 23/02/2021 02:34, Volodymyr Babchuk wrote:
> Hello community,
>
> Subject of this cover letter is quite self-explanatory. This patch
> series implements PoC for preemption in hypervisor mode.
>
> This is the sort of follow-up to recent discussion about latency
> ([1]).
>
> Motivation
> ==========
>
> It is well known that Xen is not preemptable. On other words, it is
> impossible to switch vCPU contexts while running in hypervisor
> mode. Only one place where scheduling decision can be made and one
> vCPU can be replaced with another is the exit path from the hypervisor
> mode. The one exception are Idle vCPUs, which never leaves the
> hypervisor mode for obvious reasons.
>
> This leads to a number of problems. This list is not comprehensive. It
> lists only things that I or my colleagues encountered personally.
>
> Long-running hypercalls. Due to nature of some hypercalls they can
> execute for arbitrary long time. Mostly those are calls that deal with
> long list of similar actions, like memory pages processing. To deal
> with this issue Xen employs most horrific technique called "hypercall
> continuation". When code that handles hypercall decides that it should
> be preempted, it basically updates the hypercall parameters, and moves
> guest PC one instruction back. This causes guest to re-execute the
> hypercall with altered parameters, which will allow hypervisor to
> continue hypercall execution later. This approach itself have obvious
> problems: code that executes hypercall is responsible for preemption,
> preemption checks are infrequent (because they are costly by
> themselves), hypercall execution state is stored in guest-controlled
> area, we rely on guest's good will to continue the hypercall. All this
> imposes restrictions on which hypercalls can be preempted, when they
> can be preempted and how to write hypercall handlers. Also, it
> requires very accurate coding and already led to at least one
> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
> like the one mentioned in [1].
>
> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs,
> which are supposed to run when the system is idle. If hypervisor needs
> to execute own tasks that are required to run right now, it have no
> other way than to execute them on current vCPU. But scheduler does not
> know that hypervisor executes hypervisor task and accounts spent time
> to a domain. This can lead to domain starvation.
>
> Also, absence of hypervisor threads leads to absence of high-level
> synchronization primitives like mutexes, conditional variables,
> completions, etc. This leads to two problems: we need to use spinlocks
> everywhere and we have problems when porting device drivers from linux
> kernel.

You cannot reenter a guest, even to deliver interrupts, if pre-empted at
an arbitrary point in a hypercall.  State needs unwinding suitably.

Xen's non-preemptible-ness is designed to specifically force you to not
implement long-running hypercalls which would interfere with timely
interrupt handling in the general case.

Hypervisor/virt properties are different to both a kernel-only-RTOS, and
regular usespace.  This was why I gave you some specific extra scenarios
to do latency testing with, so you could make a fair comparison of
"extra overhead caused by Xen" separate from "overhead due to
fundamental design constraints of using virt".


Preemption like this will make some benchmarks look better, but it also
introduces the ability to create fundamental problems, like preventing
any interrupt delivery into a VM for seconds of wallclock time while
each vcpu happens to be in a long-running hypercall.

If you want timely interrupt handling, you either need to partition your
workloads by the long-running-ness of their hypercalls, or not have
long-running hypercalls.

I remain unconvinced that preemption is an sensible fix to the problem
you're trying to solve.

~Andrew

Volodymyr Babchuk Feb. 24, 2021, 8:57 p.m. UTC | #5

Hi Julien,

Julien Grall writes:

> On 23/02/2021 12:06, Volodymyr Babchuk wrote:
>> Hi Julien,
>
> Hi Volodymyr,
>
>> Julien Grall writes:
>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>>> ... just rescheduling the vCPU. It will also give the opportunity for
>>> the guest to handle interrupts.
>>>
>>> If you don't return to the guest, then risk to get an RCU sched stall
>>> on that the vCPU (some hypercalls can take really really long).
>> Ah yes, you are right. I'd only wish that hypervisor saved context
>> of
>> hypercall on it's side...
>> I have example of OP-TEE before my eyes. They have special return
>> code
>> "task was interrupted" and they use separate call "continue execution of
>> interrupted task", which takes opaque context handle as a
>> parameter. With this approach state of interrupted call never leaks to > rest of the system.
>
> Feel free to suggest a new approach for the hypercals.
>

I believe, I suggested it right above. There are some corner cases, that
should be addressed, of course.

>>>
>>>> This approach itself have obvious
>>>> problems: code that executes hypercall is responsible for preemption,
>>>> preemption checks are infrequent (because they are costly by
>>>> themselves), hypercall execution state is stored in guest-controlled
>>>> area, we rely on guest's good will to continue the hypercall.
>>>
>>> Why is it a problem to rely on guest's good will? The hypercalls
>>> should be preempted at a boundary that is safe to continue.
>> Yes, and it imposes restrictions on how to write hypercall
>> handler.
>> In other words, there are much more places in hypercall handler code
>> where it can be preempted than where hypercall continuation can be
>> used. For example, you can preempt hypercall that holds a mutex, but of
>> course you can't create an continuation point in such place.
>
> I disagree, you can create continuation point in such place. Although
> it will be more complex because you have to make sure you break the
> work in a restartable place.

Maybe there is some misunderstanding. You can't create hypercall
continuation point in a place where you are holding mutex. Because,
there is absolutely not guarantee that guest will restart the
hypercall.

But you can preempt vCPU while holding mutex, because xen owns scheduler
and it can guarantee that vCPU will be scheduled eventually to continue
the work and release mutex.

> I would also like to point out that preemption also have some drawbacks.
> With RT in mind, you have to deal with priority inversion (e.g. a
> lower priority vCPU held a mutex that is required by an higher
> priority).

Of course. This is not as simple as "just call scheduler when we want
to".

> Outside of RT, you have to be careful where mutex are held. In your
> earlier answer, you suggested to held mutex for the memory
> allocation. If you do that, then it means a domain A can block
> allocation for domain B as it helds the mutex.

As long as we do not exit to a EL1 with mutex being held, domain A can't
block anything. Of course, we have to deal with priority inversion, but
malicious domain can't cause DoS.

> This can lead to quite serious problem if domain A cannot run (because
> it exhausted its credit) for a long time.
>

I believe, this problem is related to a priority inversion problem and
they should be addressed together.

>> 
>>>> All this
>>>> imposes restrictions on which hypercalls can be preempted, when they
>>>> can be preempted and how to write hypercall handlers. Also, it
>>>> requires very accurate coding and already led to at least one
>>>> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
>>>> like the one mentioned in [1].
>>>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle
>>>> vCPUs,
>>>> which are supposed to run when the system is idle. If hypervisor needs
>>>> to execute own tasks that are required to run right now, it have no
>>>> other way than to execute them on current vCPU. But scheduler does not
>>>> know that hypervisor executes hypervisor task and accounts spent time
>>>> to a domain. This can lead to domain starvation.
>>>> Also, absence of hypervisor threads leads to absence of high-level
>>>> synchronization primitives like mutexes, conditional variables,
>>>> completions, etc. This leads to two problems: we need to use spinlocks
>>>> everywhere and we have problems when porting device drivers from linux
>>>> kernel.
>>>> Proposed solution
>>>> =================
>>>> It is quite obvious that to fix problems above we need to allow
>>>> preemption in hypervisor mode. I am not familiar with x86 side, but
>>>> for the ARM it was surprisingly easy to implement. Basically, vCPU
>>>> context in hypervisor mode is determined by its stack at general
>>>> purpose registers. And __context_switch() function perfectly switches
>>>> them when running in hypervisor mode. So there are no hard
>>>> restrictions, why it should be called only in leave_hypervisor() path.
>>>> The obvious question is: when we should to try to preempt running
>>>> vCPU?  And answer is: when there was an external event. This means
>>>> that we should try to preempt only when there was an interrupt request
>>>> where we are running in hypervisor mode. On ARM, in this case function
>>>> do_trap_irq() is called. Problem is that IRQ handler can be called
>>>> when vCPU is already in atomic state (holding spinlock, for
>>>> example). In this case we should try to preempt right after leaving
>>>> atomic state. This is basically all the idea behind this PoC.
>>>> Now, about the series composition.
>>>> Patches
>>>>     sched: core: save IRQ state during locking
>>>>     sched: rt: save IRQ state during locking
>>>>     sched: credit2: save IRQ state during locking
>>>>     preempt: use atomic_t to for preempt_count
>>>>     arm: setup: disable preemption during startup
>>>>     arm: context_switch: allow to run with IRQs already disabled
>>>> prepare the groundwork for the rest of PoC. It appears that not all
>>>> code is ready to be executed in IRQ state, and schedule() now can be
>>>> called at end of do_trap_irq(), which technically is considered IRQ
>>>> handler state. Also, it is unwise to try preempt things when we are
>>>> still booting, so ween to enable atomic context during the boot
>>>> process.
>>>
>>> I am really surprised that this is the only changes necessary in
>>> Xen. For a first approach, we may want to be conservative when the
>>> preemption is happening as I am not convinced that all the places are
>>> safe to preempt.
>>>
>> Well, I can't say that I ran extensive tests, but I played with this
>> for
>> some time and it seemed quite stable. Of course, I had some problems
>> with RTDS...
>> As I see it, Xen is already supports SMP, so all places where races
>> are
>> possible should already be covered by spinlocks or taken into account by
>> some other means.
> That's correct for shared resources. I am more worried for any
> hypercalls that expected to run more or less continuously (e.g not 
> taking into account interrupt) on the same pCPU.

Are there many such hypercalls? They can disable preemption if they
really need to run on the same pCPU. As I understand, they should be
relatively fast, because they can't create continuations anyway.

>> Places which may not be safe to preempt are clustered around task
>> management code itself: schedulers, xen entry/exit points, vcpu
>> creation/destruction and such.
>> For example, for sure we do not want to destroy vCPU which was
>> preempted
>> in hypervisor mode. I didn't covered this case, by the way.
>> 
>>>> Patches
>>>>     preempt: add try_preempt() function
>>>>     sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
>>>>     arm: traps: try to preempt before leaving IRQ handler
>>>> are basically the core of this PoC. try_preempt() function tries to
>>>> preempt vCPU when either called by IRQ handler and when leaving atomic
>>>> state. Scheduler now enters atomic state to ensure that it will not
>>>> preempt self. do_trap_irq() calls try_preempt() to initiate preemption.
>>>
>>> AFAICT, try_preempt() will deal with the rescheduling. But how about
>>> softirqs? Don't we want to handle them in try_preempt() as well?
>> Well, yes and no. We have the following softirqs:
>>   TIMER_SOFTIRQ - should be called, I believe
>>   RCU_SOFTIRQ - I'm not sure about this, but probably no
>
> When would you call RCU callback then?
>

I'm not sure there. But I think, they should be called in the same place
as always: while leaving hypervisor. But I'm not very familiar with
RCU, so I may talk nonsense. 

>>   SCHED_SLAVE_SOFTIRQ - no
>>   SCHEDULE_SOFTIRQ - no
>>   NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate
>>   thread, maybe?
>>   TASKLET_SOFTIRQ - should be moved to a separate thread >
>> So, looks like only timers should be handled for sure.
>> 
>>>
>>> [...]
>>>
>>>> Conclusion
>>>> ==========
>>>> My main intention is to begin discussion of hypervisor
>>>> preemption. As
>>>> I showed, it is doable right away and provides some immediate
>>>> benefits. I do understand that proper implementation requires much
>>>> more efforts. But we are ready to do this work if community is
>>>> interested in it.
>>>> Just to reiterate main benefits:
>>>> 1. More controllable latency. On embedded systems customers care
>>>> about
>>>> such things.
>>>
>>> Is the plan to only offer preemptible Xen?
>>>
>> Sorry, didn't get the question.
>
> What's your plan for the preemption support? Will an admin be able to
> configure Xen to be either preemptible or not?

Honestly, it would be much easier to enable it unconditionally. But I
understand, that this is not feasible. So, I'm looking at a build-time
option.

Julien Grall Feb. 24, 2021, 10:31 p.m. UTC | #6

On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk
<Volodymyr_Babchuk@epam.com> wrote:
>
>
> Hi Julien,
>
> Julien Grall writes:
>
> > On 23/02/2021 12:06, Volodymyr Babchuk wrote:
> >> Hi Julien,
> >
> > Hi Volodymyr,
> >
> >> Julien Grall writes:
> >>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
> >>> ... just rescheduling the vCPU. It will also give the opportunity for
> >>> the guest to handle interrupts.
> >>>
> >>> If you don't return to the guest, then risk to get an RCU sched stall
> >>> on that the vCPU (some hypercalls can take really really long).
> >> Ah yes, you are right. I'd only wish that hypervisor saved context
> >> of
> >> hypercall on it's side...
> >> I have example of OP-TEE before my eyes. They have special return
> >> code
> >> "task was interrupted" and they use separate call "continue execution of
> >> interrupted task", which takes opaque context handle as a
> >> parameter. With this approach state of interrupted call never leaks to > rest of the system.
> >
> > Feel free to suggest a new approach for the hypercals.
> >
>
> I believe, I suggested it right above. There are some corner cases, that
> should be addressed, of course.

If we wanted a clean break, then possibly yes.  But I meant one that doesn't
break all the existing users and doesn't put Xen at risk.

I don't believe your approach fulfill it.

>
> >>>
> >>>> This approach itself have obvious
> >>>> problems: code that executes hypercall is responsible for preemption,
> >>>> preemption checks are infrequent (because they are costly by
> >>>> themselves), hypercall execution state is stored in guest-controlled
> >>>> area, we rely on guest's good will to continue the hypercall.
> >>>
> >>> Why is it a problem to rely on guest's good will? The hypercalls
> >>> should be preempted at a boundary that is safe to continue.
> >> Yes, and it imposes restrictions on how to write hypercall
> >> handler.
> >> In other words, there are much more places in hypercall handler code
> >> where it can be preempted than where hypercall continuation can be
> >> used. For example, you can preempt hypercall that holds a mutex, but of
> >> course you can't create an continuation point in such place.
> >
> > I disagree, you can create continuation point in such place. Although
> > it will be more complex because you have to make sure you break the
> > work in a restartable place.
>
> Maybe there is some misunderstanding. You can't create hypercall
> continuation point in a place where you are holding mutex. Because,
> there is absolutely not guarantee that guest will restart the
> hypercall.

I don't think we are disagreeing here. My point is you should rarely
need to hold a mutex for a long period, so you could break your work
in smaller chunk. In which cases, you can use hypercall continuation.

>
> But you can preempt vCPU while holding mutex, because xen owns scheduler
> and it can guarantee that vCPU will be scheduled eventually to continue
> the work and release mutex.

The problem is the "eventually". If you are accounting the time spent
in the hypervisor to the vCPU A, then there is a possibility that it
has exhausted its time slice. In which case, your vCPU A may be
sleeping for a while with a mutex held.

If another vCPU B needs the mutex, it will either have to wait
potentially for a long time or we need to force vCPU A to run on
borrowed time.

>
> > I would also like to point out that preemption also have some drawbacks.
> > With RT in mind, you have to deal with priority inversion (e.g. a
> > lower priority vCPU held a mutex that is required by an higher
> > priority).
>
> Of course. This is not as simple as "just call scheduler when we want
> to".

Your e-mail made it sounds like it was easy to add preemption in Xen. ;)

>
> > Outside of RT, you have to be careful where mutex are held. In your
> > earlier answer, you suggested to held mutex for the memory
> > allocation. If you do that, then it means a domain A can block
> > allocation for domain B as it helds the mutex.
>
> As long as we do not exit to a EL1 with mutex being held, domain A can't
> block anything. Of course, we have to deal with priority inversion, but
> malicious domain can't cause DoS.

It is not really a priority inversion problem outside of RT because
all the tasks will have the same priority. It is more a time
accounting problem because each vCPU may have a different number of
credits.

> >>> I am really surprised that this is the only changes necessary in
> >>> Xen. For a first approach, we may want to be conservative when the
> >>> preemption is happening as I am not convinced that all the places are
> >>> safe to preempt.
> >>>
> >> Well, I can't say that I ran extensive tests, but I played with this
> >> for
> >> some time and it seemed quite stable. Of course, I had some problems
> >> with RTDS...
> >> As I see it, Xen is already supports SMP, so all places where races
> >> are
> >> possible should already be covered by spinlocks or taken into account by
> >> some other means.
> > That's correct for shared resources. I am more worried for any
> > hypercalls that expected to run more or less continuously (e.g not
> > taking into account interrupt) on the same pCPU.
>
> Are there many such hypercalls? They can disable preemption if they
> really need to run on the same pCPU. As I understand, they should be
> relatively fast, because they can't create continuations anyway.

Well, I never tried to make Xen preemptible... My comment is based on
the fact that the use preempt_{enable, disable}() was mostly done on a
best effort basis.

The usual suspects are anything using this_cpu() or interacting with
the per-CPU HW registers.

From a quick look here a few things (only looked at Arm):
  * map_domain_page() in particular on arm32 because this is using
per-CPU page-tables
  * guest_atomics_* as this uses this_cpu()
  * virt_to_mfn() in particular the failure path
  * Incorrect use (or missing) rcu locking. (Hopefully Juergen's
recent work in the RCU mitigate the risk)

I can provide guidance, but you will have to go through the code and
check what's happening.

Cheers,

Volodymyr Babchuk Feb. 24, 2021, 11:37 p.m. UTC | #7

Hi Andrew,

Andrew Cooper writes:

> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>> Hello community,
>>
>> Subject of this cover letter is quite self-explanatory. This patch
>> series implements PoC for preemption in hypervisor mode.
>>
>> This is the sort of follow-up to recent discussion about latency
>> ([1]).
>>
>> Motivation
>> ==========
>>
>> It is well known that Xen is not preemptable. On other words, it is
>> impossible to switch vCPU contexts while running in hypervisor
>> mode. Only one place where scheduling decision can be made and one
>> vCPU can be replaced with another is the exit path from the hypervisor
>> mode. The one exception are Idle vCPUs, which never leaves the
>> hypervisor mode for obvious reasons.
>>
>> This leads to a number of problems. This list is not comprehensive. It
>> lists only things that I or my colleagues encountered personally.
>>
>> Long-running hypercalls. Due to nature of some hypercalls they can
>> execute for arbitrary long time. Mostly those are calls that deal with
>> long list of similar actions, like memory pages processing. To deal
>> with this issue Xen employs most horrific technique called "hypercall
>> continuation". When code that handles hypercall decides that it should
>> be preempted, it basically updates the hypercall parameters, and moves
>> guest PC one instruction back. This causes guest to re-execute the
>> hypercall with altered parameters, which will allow hypervisor to
>> continue hypercall execution later. This approach itself have obvious
>> problems: code that executes hypercall is responsible for preemption,
>> preemption checks are infrequent (because they are costly by
>> themselves), hypercall execution state is stored in guest-controlled
>> area, we rely on guest's good will to continue the hypercall. All this
>> imposes restrictions on which hypercalls can be preempted, when they
>> can be preempted and how to write hypercall handlers. Also, it
>> requires very accurate coding and already led to at least one
>> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
>> like the one mentioned in [1].
>>
>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs,
>> which are supposed to run when the system is idle. If hypervisor needs
>> to execute own tasks that are required to run right now, it have no
>> other way than to execute them on current vCPU. But scheduler does not
>> know that hypervisor executes hypervisor task and accounts spent time
>> to a domain. This can lead to domain starvation.
>>
>> Also, absence of hypervisor threads leads to absence of high-level
>> synchronization primitives like mutexes, conditional variables,
>> completions, etc. This leads to two problems: we need to use spinlocks
>> everywhere and we have problems when porting device drivers from linux
>> kernel.
>
> You cannot reenter a guest, even to deliver interrupts, if pre-empted at
> an arbitrary point in a hypercall.  State needs unwinding suitably.
>

Yes, Julien pointed this to me already. So, looks like hypercall
continuations are still needed.

> Xen's non-preemptible-ness is designed to specifically force you to not
> implement long-running hypercalls which would interfere with timely
> interrupt handling in the general case.

What if long-running hypercalls are still required? There are other
options, like async calls, for example.

> Hypervisor/virt properties are different to both a kernel-only-RTOS, and
> regular usespace.  This was why I gave you some specific extra scenarios
> to do latency testing with, so you could make a fair comparison of
> "extra overhead caused by Xen" separate from "overhead due to
> fundamental design constraints of using virt".

I can't see any fundamental constraints there. I see how virtualization
architecture can influence context switch time: how many actions you
need to switch one vCPU to another. I have in mind low level things
there: reprogram MMU to use another set of tables, reprogram your
interrupt controller, timer, etc. Of course, you can't get latency lower
that context switch time. This is the only fundamental constraint I can
see.

But all other things are debatable.

As for latency testing, I'm not interested in absolute times per se. I
already determined that time needed to switch vCPU context on my machine
is about 9us. It is fine for me. I am interested in a (semi-)guaranteed
time of reaction. And Xen is doing quite well in most cases. But there
are other cases, when long-lasting hypercalls cause spikes in time of
reaction.

> Preemption like this will make some benchmarks look better, but it also
> introduces the ability to create fundamental problems, like preventing
> any interrupt delivery into a VM for seconds of wallclock time while
> each vcpu happens to be in a long-running hypercall.
>
> If you want timely interrupt handling, you either need to partition your
> workloads by the long-running-ness of their hypercalls, or not have
> long-running hypercalls.

... or do long-running tasks asynchronously. I believe, for most
domctls and sysctls there is no need to hold calling vCPU in hypervisor
mode at all.

> I remain unconvinced that preemption is an sensible fix to the problem
> you're trying to solve.

Well, this is the purpose of this little experiment. I want to discuss
different approaches and to estimate amount of required efforts. By the
way, from x86 point of view, how hard to switch vCPU context while it is
running in hypervisor mode?

Volodymyr Babchuk Feb. 24, 2021, 11:58 p.m. UTC | #8

Julien Grall writes:

> On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk
> <Volodymyr_Babchuk@epam.com> wrote:
>>
>>
>> Hi Julien,
>>
>> Julien Grall writes:
>>
>>> On 23/02/2021 12:06, Volodymyr Babchuk wrote:
>>>> Hi Julien,
>>>
>>> Hi Volodymyr,
>>>
>>>> Julien Grall writes:
>>>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>>>>> ... just rescheduling the vCPU. It will also give the opportunity for
>>>>> the guest to handle interrupts.
>>>>>
>>>>> If you don't return to the guest, then risk to get an RCU sched stall
>>>>> on that the vCPU (some hypercalls can take really really long).
>>>> Ah yes, you are right. I'd only wish that hypervisor saved context
>>>> of
>>>> hypercall on it's side...
>>>> I have example of OP-TEE before my eyes. They have special return
>>>> code
>>>> "task was interrupted" and they use separate call "continue execution of
>>>> interrupted task", which takes opaque context handle as a
>>>> parameter. With this approach state of interrupted call never leaks to > rest of the system.
>>>
>>> Feel free to suggest a new approach for the hypercals.
>>>
>>
>> I believe, I suggested it right above. There are some corner cases, that
>> should be addressed, of course.
>
> If we wanted a clean break, then possibly yes.  But I meant one that doesn't
> break all the existing users and doesn't put Xen at risk.
>
> I don't believe your approach fulfill it.

Of course, we can't touch any hypercalls that are part of stable
ABI. But if I got this right, domctls and sysctls are not stable, so one
can change theirs behavior quite drastically in major releases.

>>
>>>>>
>>>>>> This approach itself have obvious
>>>>>> problems: code that executes hypercall is responsible for preemption,
>>>>>> preemption checks are infrequent (because they are costly by
>>>>>> themselves), hypercall execution state is stored in guest-controlled
>>>>>> area, we rely on guest's good will to continue the hypercall.
>>>>>
>>>>> Why is it a problem to rely on guest's good will? The hypercalls
>>>>> should be preempted at a boundary that is safe to continue.
>>>> Yes, and it imposes restrictions on how to write hypercall
>>>> handler.
>>>> In other words, there are much more places in hypercall handler code
>>>> where it can be preempted than where hypercall continuation can be
>>>> used. For example, you can preempt hypercall that holds a mutex, but of
>>>> course you can't create an continuation point in such place.
>>>
>>> I disagree, you can create continuation point in such place. Although
>>> it will be more complex because you have to make sure you break the
>>> work in a restartable place.
>>
>> Maybe there is some misunderstanding. You can't create hypercall
>> continuation point in a place where you are holding mutex. Because,
>> there is absolutely not guarantee that guest will restart the
>> hypercall.
>
> I don't think we are disagreeing here. My point is you should rarely
> need to hold a mutex for a long period, so you could break your work
> in smaller chunk. In which cases, you can use hypercall continuation.

Let's say in this way: generally you can hold a mutex much longer than
you can hold a spinlock. And nothing catastrophic will happen if you are
preempted while holding a mutex. Better to avoid, this of course.

>
>>
>> But you can preempt vCPU while holding mutex, because xen owns scheduler
>> and it can guarantee that vCPU will be scheduled eventually to continue
>> the work and release mutex.
>
> The problem is the "eventually". If you are accounting the time spent
> in the hypervisor to the vCPU A, then there is a possibility that it
> has exhausted its time slice. In which case, your vCPU A may be
> sleeping for a while with a mutex held.
>
> If another vCPU B needs the mutex, it will either have to wait
> potentially for a long time or we need to force vCPU A to run on
> borrowed time.

Yes, of course.

>>
>>> I would also like to point out that preemption also have some drawbacks.
>>> With RT in mind, you have to deal with priority inversion (e.g. a
>>> lower priority vCPU held a mutex that is required by an higher
>>> priority).
>>
>> Of course. This is not as simple as "just call scheduler when we want
>> to".
>
> Your e-mail made it sounds like it was easy to add preemption in
> Xen. ;)

I'm sorry for that :)
Actually, there is lots of work to do. It appeared to me that "current"
needs to be reworked, preempt_enable/disable needs to be reworked, per-cpu
variables also should be reworked. And this is just to ensure
consistency of already existing code.

And I am not mentioning x86 support there...

>>> Outside of RT, you have to be careful where mutex are held. In your
>>> earlier answer, you suggested to held mutex for the memory
>>> allocation. If you do that, then it means a domain A can block
>>> allocation for domain B as it helds the mutex.
>>
>> As long as we do not exit to a EL1 with mutex being held, domain A can't
>> block anything. Of course, we have to deal with priority inversion, but
>> malicious domain can't cause DoS.
>
> It is not really a priority inversion problem outside of RT because
> all the tasks will have the same priority. It is more a time
> accounting problem because each vCPU may have a different number of
> credits.

Speaking of that, RTDS does not use concept of priority. And ARINC of
course neither.


>>>>> I am really surprised that this is the only changes necessary in
>>>>> Xen. For a first approach, we may want to be conservative when the
>>>>> preemption is happening as I am not convinced that all the places are
>>>>> safe to preempt.
>>>>>
>>>> Well, I can't say that I ran extensive tests, but I played with this
>>>> for
>>>> some time and it seemed quite stable. Of course, I had some problems
>>>> with RTDS...
>>>> As I see it, Xen is already supports SMP, so all places where races
>>>> are
>>>> possible should already be covered by spinlocks or taken into account by
>>>> some other means.
>>> That's correct for shared resources. I am more worried for any
>>> hypercalls that expected to run more or less continuously (e.g not
>>> taking into account interrupt) on the same pCPU.
>>
>> Are there many such hypercalls? They can disable preemption if they
>> really need to run on the same pCPU. As I understand, they should be
>> relatively fast, because they can't create continuations anyway.
>
> Well, I never tried to make Xen preemptible... My comment is based on
> the fact that the use preempt_{enable, disable}() was mostly done on a
> best effort basis.
>
> The usual suspects are anything using this_cpu() or interacting with
> the per-CPU HW registers.
>
> From a quick look here a few things (only looked at Arm):
>   * map_domain_page() in particular on arm32 because this is using
> per-CPU page-tables
>   * guest_atomics_* as this uses this_cpu()
>   * virt_to_mfn() in particular the failure path
>   * Incorrect use (or missing) rcu locking. (Hopefully Juergen's
> recent work in the RCU mitigate the risk)
>
> I can provide guidance, but you will have to go through the code and
> check what's happening.

Thank you for the list. Of course, I need to see thru all the code. I
already had a bunch of problems with per_cpu variables...

Andrew Cooper Feb. 25, 2021, 12:39 a.m. UTC | #9

On 24/02/2021 23:58, Volodymyr Babchuk wrote:
> And I am not mentioning x86 support there...

x86 uses per-pCPU stacks, not per-vCPU stacks.

Transcribing from an old thread which happened in private as part of an
XSA discussion, concerning the implications of trying to change this.

~Andrew

-----8<-----

Here is a partial list off the top of my head of the practical problems
you're going to have to solve.

Introduction of new SpectreRSB vulnerable gadgets.  I'm really close to
being able to drop RSB stuffing and recover some performance in Xen.

CPL0 entrypoints need updating across schedule.  SYSCALL entry would
need to become a stub per vcpu, rather than the current stub per pcpu.
This requires reintroducing a writeable mapping to the TSS (doable) and
a shadow stack switch of active stacks (This corner case is so broken it
looks to be a blocker for CET-SS support in Linux, and is resulting in
some conversation about tweaking Shstk's in future processors).

All per-cpu variables stop working.  You'd need to rewrite Xen to use
%gs for TLS which will have churn in the PV logic, and introduce the x86
architectural corner cases of running with an invalid %gs.  Xen has been
saved from a large number of privilege escalation vulnerabilities in
common with Linux and Windows by the fact that we don't use %gs, so
anyone trying to do this is going to have to come up with some concrete
way of proving that the corner cases are covered.

Volodymyr Babchuk Feb. 25, 2021, 12:51 p.m. UTC | #10

Hi Andrew,

Andrew Cooper writes:

> On 24/02/2021 23:58, Volodymyr Babchuk wrote:
>> And I am not mentioning x86 support there...
>
> x86 uses per-pCPU stacks, not per-vCPU stacks.
>
> Transcribing from an old thread which happened in private as part of an
> XSA discussion, concerning the implications of trying to change this.
>
> ~Andrew
>
> -----8<-----
>
> Here is a partial list off the top of my head of the practical problems
> you're going to have to solve.
>
> Introduction of new SpectreRSB vulnerable gadgets.  I'm really close to
> being able to drop RSB stuffing and recover some performance in Xen.
>
> CPL0 entrypoints need updating across schedule.  SYSCALL entry would
> need to become a stub per vcpu, rather than the current stub per pcpu.
> This requires reintroducing a writeable mapping to the TSS (doable) and
> a shadow stack switch of active stacks (This corner case is so broken it
> looks to be a blocker for CET-SS support in Linux, and is resulting in
> some conversation about tweaking Shstk's in future processors).
>
> All per-cpu variables stop working.  You'd need to rewrite Xen to use
> %gs for TLS which will have churn in the PV logic, and introduce the x86
> architectural corner cases of running with an invalid %gs.  Xen has been
> saved from a large number of privilege escalation vulnerabilities in
> common with Linux and Windows by the fact that we don't use %gs, so
> anyone trying to do this is going to have to come up with some concrete
> way of proving that the corner cases are covered.

Thank you. This is exactly what I needed. I am not a big specialist in
x86, but from what I said, I can see that there is no easy way to switch
contexts while in hypervisor mode.

Then I want to return to a task domain idea, which you mentioned in the
other thread. If I got it right, it would allow to

1. Implement asynchronous hypercalls for cases when there is no reason
to hold calling vCPU in hypervisor for the whole call duration

2. Improve time accounting, as tasklets can be scheduled to run in this
task domain.

I skimmed through ML archives, but didn't found any discussion about it.

As I see it, its implementation would be close to idle domain
implementation, but a little different.

George Dunlap March 1, 2021, 2:39 p.m. UTC | #11

> On Feb 24, 2021, at 11:37 PM, Volodymyr Babchuk <volodymyr_babchuk@epam.com> wrote:
> 
> 
>> Hypervisor/virt properties are different to both a kernel-only-RTOS, and
>> regular usespace.  This was why I gave you some specific extra scenarios
>> to do latency testing with, so you could make a fair comparison of
>> "extra overhead caused by Xen" separate from "overhead due to
>> fundamental design constraints of using virt".
> 
> I can't see any fundamental constraints there. I see how virtualization
> architecture can influence context switch time: how many actions you
> need to switch one vCPU to another. I have in mind low level things
> there: reprogram MMU to use another set of tables, reprogram your
> interrupt controller, timer, etc. Of course, you can't get latency lower
> that context switch time. This is the only fundamental constraint I can
> see.

Well suppose you have two domains, A and B, both of which control  hardware which have hard real-time requirements.

And suppose that A has just started handling handling a latency-sensitive interrupt, when a latency-sensitive interrupt comes in for B.  You might well preempt A and let B run for a full timeslice, causing A’s interrupt handler to be delayed by a significant amount.

Preventing that sort of thing would be a much more tricky issue to get right.

>> If you want timely interrupt handling, you either need to partition your
>> workloads by the long-running-ness of their hypercalls, or not have
>> long-running hypercalls.
> 
> ... or do long-running tasks asynchronously. I believe, for most
> domctls and sysctls there is no need to hold calling vCPU in hypervisor
> mode at all.
> 
>> I remain unconvinced that preemption is an sensible fix to the problem
>> you're trying to solve.
> 
> Well, this is the purpose of this little experiment. I want to discuss
> different approaches and to estimate amount of required efforts. By the
> way, from x86 point of view, how hard to switch vCPU context while it is
> running in hypervisor mode?

I’m not necessarily opposed to introducing preemption, but the more we ask about things, the more complex things begin to look.  The idea of introducing an async framework to deal with long-running hypercalls is a huge engineering and design effort, not just for Xen, but for all future callers of the interface.

The claim in the cover letter was that “[s]ome hypercalls can not be preempted at all”.  I looked at the reference, and it looks like you’re referring to this:

"I brooded over ways to make [alloc_domheap_pages()] preemptible. But it is a) located deep in call chain and b) used not only by hypercalls. So I can't see an easy way to make it preemptible."

Let’s assume for the sake of argument that preventing delays due to alloc_domheap_pages() would require significant rearchitecting of the code.  And let’s even assume that there are 2-3 other such knotty issues making for unacceptably long hypercalls.  Will identifying and tracking down those issues really be more effort than introducing preemption, introducing async operations, and all the other things we’ve been talking about?

One thing that might be interesting is to add some sort of metrics (disabled in Kconfig by default); e.g.:

1. On entry to a hypercall, take a timestamp

2. On every hypercall_preempt() call, take another timestamp and see how much time has passed without a preempt, and reset the timestamp count; also do a check on exit of the hypercall

We could start by trying to do stats and figuring out which hypercalls go the longest without preemption, as a way to guide the optimization efforts.  Then as we get that number down, we could add an ASSERT()s that the time is never longer than a certain amount, and add runs like that to osstest to make sure there are no regressions introduced.

I agree that hypercall continuations are complex; and you’re right that the fact that the hypercall continuation may never be called limits where preemption can happen.  But making the entire hypervisor preemption-friendly is also quite complex in its own way; it’s not immediately obvious to me from this thread that hypervisor preemption is less complex.

 -George

Volodymyr Babchuk March 5, 2021, 9:31 a.m. UTC | #12

Hi,

Volodymyr Babchuk writes:

> Hi Andrew,
>
> Andrew Cooper writes:
>
>> On 24/02/2021 23:58, Volodymyr Babchuk wrote:
>>> And I am not mentioning x86 support there...
>>
>> x86 uses per-pCPU stacks, not per-vCPU stacks.
>>
>> Transcribing from an old thread which happened in private as part of an
>> XSA discussion, concerning the implications of trying to change this.
>>
>> ~Andrew
>>
>> -----8<-----
>>
>> Here is a partial list off the top of my head of the practical problems
>> you're going to have to solve.
>>
>> Introduction of new SpectreRSB vulnerable gadgets.  I'm really close to
>> being able to drop RSB stuffing and recover some performance in Xen.
>>
>> CPL0 entrypoints need updating across schedule.  SYSCALL entry would
>> need to become a stub per vcpu, rather than the current stub per pcpu.
>> This requires reintroducing a writeable mapping to the TSS (doable) and
>> a shadow stack switch of active stacks (This corner case is so broken it
>> looks to be a blocker for CET-SS support in Linux, and is resulting in
>> some conversation about tweaking Shstk's in future processors).
>>
>> All per-cpu variables stop working.  You'd need to rewrite Xen to use
>> %gs for TLS which will have churn in the PV logic, and introduce the x86
>> architectural corner cases of running with an invalid %gs.  Xen has been
>> saved from a large number of privilege escalation vulnerabilities in
>> common with Linux and Windows by the fact that we don't use %gs, so
>> anyone trying to do this is going to have to come up with some concrete
>> way of proving that the corner cases are covered.
>
> Thank you. This is exactly what I needed. I am not a big specialist in
> x86, but from what I said, I can see that there is no easy way to switch
> contexts while in hypervisor mode.
>
> Then I want to return to a task domain idea, which you mentioned in the
> other thread. If I got it right, it would allow to
>
> 1. Implement asynchronous hypercalls for cases when there is no reason
> to hold calling vCPU in hypervisor for the whole call duration
>

Okay, I was too overexcited there. I mean - surely it is possible to
implement async hypercalls, but there is no immediate profit in this:
such hypercall can't be preempted anyways. On a SMP system you can
offload hypercall to another core, but that's basically all.

> I skimmed through ML archives, but didn't found any discussion about it.

Maybe you can give some hint how to find it?

> As I see it, its implementation would be close to idle domain
> implementation, but a little different.

[RFC,00/10] Preemption in hypervisor (ARM only)

Message

Comments