Message ID | 20210223023428.757694-1-volodymyr_babchuk@epam.com (mailing list archive) |
---|---|
Headers | show |
Series | Preemption in hypervisor (ARM only) | expand |
On 23/02/2021 02:34, Volodymyr Babchuk wrote: > Hello community, Hi Volodymyr, Thank you for the proposal, I like the like of been able to preempt the vCPU thread. This would make easier to implement some of the device emulation in Xen (e.g. vGIC, SMMU). > > Subject of this cover letter is quite self-explanatory. This patch > series implements PoC for preemption in hypervisor mode. > > This is the sort of follow-up to recent discussion about latency > ([1]). > > Motivation > ========== > > It is well known that Xen is not preemptable. On other words, it is > impossible to switch vCPU contexts while running in hypervisor > mode. Only one place where scheduling decision can be made and one > vCPU can be replaced with another is the exit path from the hypervisor > mode. The one exception are Idle vCPUs, which never leaves the > hypervisor mode for obvious reasons. > > This leads to a number of problems. This list is not comprehensive. It > lists only things that I or my colleagues encountered personally. > > Long-running hypercalls. Due to nature of some hypercalls they can > execute for arbitrary long time. Mostly those are calls that deal with > long list of similar actions, like memory pages processing. To deal > with this issue Xen employs most horrific technique called "hypercall > continuation". I agree the code is not nice. However, it does serve another purpose than ... > When code that handles hypercall decides that it should > be preempted, it basically updates the hypercall parameters, and moves > guest PC one instruction back. This causes guest to re-execute the > hypercall with altered parameters, which will allow hypervisor to > continue hypercall execution later. ... just rescheduling the vCPU. It will also give the opportunity for the guest to handle interrupts. If you don't return to the guest, then risk to get an RCU sched stall on that the vCPU (some hypercalls can take really really long). > This approach itself have obvious > problems: code that executes hypercall is responsible for preemption, > preemption checks are infrequent (because they are costly by > themselves), hypercall execution state is stored in guest-controlled > area, we rely on guest's good will to continue the hypercall. Why is it a problem to rely on guest's good will? The hypercalls should be preempted at a boundary that is safe to continue. > All this > imposes restrictions on which hypercalls can be preempted, when they > can be preempted and how to write hypercall handlers. Also, it > requires very accurate coding and already led to at least one > vulnerability - XSA-318. Some hypercalls can not be preempted at all, > like the one mentioned in [1]. > > Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs, > which are supposed to run when the system is idle. If hypervisor needs > to execute own tasks that are required to run right now, it have no > other way than to execute them on current vCPU. But scheduler does not > know that hypervisor executes hypervisor task and accounts spent time > to a domain. This can lead to domain starvation. > > Also, absence of hypervisor threads leads to absence of high-level > synchronization primitives like mutexes, conditional variables, > completions, etc. This leads to two problems: we need to use spinlocks > everywhere and we have problems when porting device drivers from linux > kernel. > > Proposed solution > ================= > > It is quite obvious that to fix problems above we need to allow > preemption in hypervisor mode. I am not familiar with x86 side, but > for the ARM it was surprisingly easy to implement. Basically, vCPU > context in hypervisor mode is determined by its stack at general > purpose registers. And __context_switch() function perfectly switches > them when running in hypervisor mode. So there are no hard > restrictions, why it should be called only in leave_hypervisor() path. > > The obvious question is: when we should to try to preempt running > vCPU? And answer is: when there was an external event. This means > that we should try to preempt only when there was an interrupt request > where we are running in hypervisor mode. On ARM, in this case function > do_trap_irq() is called. Problem is that IRQ handler can be called > when vCPU is already in atomic state (holding spinlock, for > example). In this case we should try to preempt right after leaving > atomic state. This is basically all the idea behind this PoC. > > Now, about the series composition. > Patches > > sched: core: save IRQ state during locking > sched: rt: save IRQ state during locking > sched: credit2: save IRQ state during locking > preempt: use atomic_t to for preempt_count > arm: setup: disable preemption during startup > arm: context_switch: allow to run with IRQs already disabled > > prepare the groundwork for the rest of PoC. It appears that not all > code is ready to be executed in IRQ state, and schedule() now can be > called at end of do_trap_irq(), which technically is considered IRQ > handler state. Also, it is unwise to try preempt things when we are > still booting, so ween to enable atomic context during the boot > process. I am really surprised that this is the only changes necessary in Xen. For a first approach, we may want to be conservative when the preemption is happening as I am not convinced that all the places are safe to preempt. > > Patches > preempt: add try_preempt() function > sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!] > arm: traps: try to preempt before leaving IRQ handler > > are basically the core of this PoC. try_preempt() function tries to > preempt vCPU when either called by IRQ handler and when leaving atomic > state. Scheduler now enters atomic state to ensure that it will not > preempt self. do_trap_irq() calls try_preempt() to initiate preemption. AFAICT, try_preempt() will deal with the rescheduling. But how about softirqs? Don't we want to handle them in try_preempt() as well? [...] > Conclusion > ========== > > My main intention is to begin discussion of hypervisor preemption. As > I showed, it is doable right away and provides some immediate > benefits. I do understand that proper implementation requires much > more efforts. But we are ready to do this work if community is > interested in it. > > Just to reiterate main benefits: > > 1. More controllable latency. On embedded systems customers care about > such things. Is the plan to only offer preemptible Xen? > > 2. We can get rid of hypercall continuations, which will results in > simpler and more secure code. I don't think you can get rid of it completely without risking the OS to receive RCU sched stall. So you would need to handle them hypercalls differently. > > 3. We can implement proper hypervisor threads, mutexes, completions > and so on. This will make scheduling more accurate, ease up linux > drivers porting and implementation of more complex features in the > hypervisor. > > > > [1] https://marc.info/?l=xen-devel&m=161049529916656&w=2 > > Volodymyr Babchuk (10): > sched: core: save IRQ state during locking > sched: rt: save IRQ state during locking > sched: credit2: save IRQ state during locking > preempt: use atomic_t to for preempt_count > preempt: add try_preempt() function > arm: setup: disable preemption during startup > sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!] > arm: context_switch: allow to run with IRQs already disabled > arm: traps: try to preempt before leaving IRQ handler > [HACK] alloc pages: enable preemption early > > xen/arch/arm/domain.c | 18 ++++++++++----- > xen/arch/arm/setup.c | 4 ++++ > xen/arch/arm/traps.c | 7 ++++++ > xen/common/memory.c | 4 ++-- > xen/common/page_alloc.c | 21 ++--------------- > xen/common/preempt.c | 36 ++++++++++++++++++++++++++--- > xen/common/sched/core.c | 46 +++++++++++++++++++++++--------------- > xen/common/sched/credit2.c | 5 +++-- > xen/common/sched/rt.c | 10 +++++---- > xen/include/xen/preempt.h | 17 +++++++++----- > 10 files changed, 109 insertions(+), 59 deletions(-) > Cheers,
Hi Julien, Julien Grall writes: > On 23/02/2021 02:34, Volodymyr Babchuk wrote: >> Hello community, > > Hi Volodymyr, > > Thank you for the proposal, I like the like of been able to preempt > the vCPU thread. This would make easier to implement some of the > device emulation in Xen (e.g. vGIC, SMMU). Yes, emulation is the other topic that I didn't mentioned. Also, it could lift some restrictions in OP-TEE mediator code as well. >> Subject of this cover letter is quite self-explanatory. This patch >> series implements PoC for preemption in hypervisor mode. >> This is the sort of follow-up to recent discussion about latency >> ([1]). >> Motivation >> ========== >> It is well known that Xen is not preemptable. On other words, it is >> impossible to switch vCPU contexts while running in hypervisor >> mode. Only one place where scheduling decision can be made and one >> vCPU can be replaced with another is the exit path from the hypervisor >> mode. The one exception are Idle vCPUs, which never leaves the >> hypervisor mode for obvious reasons. >> This leads to a number of problems. This list is not >> comprehensive. It >> lists only things that I or my colleagues encountered personally. >> Long-running hypercalls. Due to nature of some hypercalls they can >> execute for arbitrary long time. Mostly those are calls that deal with >> long list of similar actions, like memory pages processing. To deal >> with this issue Xen employs most horrific technique called "hypercall >> continuation". > > I agree the code is not nice. However, it does serve another purpose > than ... > >> When code that handles hypercall decides that it should >> be preempted, it basically updates the hypercall parameters, and moves >> guest PC one instruction back. This causes guest to re-execute the >> hypercall with altered parameters, which will allow hypervisor to >> continue hypercall execution later. > > ... just rescheduling the vCPU. It will also give the opportunity for > the guest to handle interrupts. > > If you don't return to the guest, then risk to get an RCU sched stall > on that the vCPU (some hypercalls can take really really long). Ah yes, you are right. I'd only wish that hypervisor saved context of hypercall on it's side... I have example of OP-TEE before my eyes. They have special return code "task was interrupted" and they use separate call "continue execution of interrupted task", which takes opaque context handle as a parameter. With this approach state of interrupted call never leaks to rest of the system. > >> This approach itself have obvious >> problems: code that executes hypercall is responsible for preemption, >> preemption checks are infrequent (because they are costly by >> themselves), hypercall execution state is stored in guest-controlled >> area, we rely on guest's good will to continue the hypercall. > > Why is it a problem to rely on guest's good will? The hypercalls > should be preempted at a boundary that is safe to continue. Yes, and it imposes restrictions on how to write hypercall handler. In other words, there are much more places in hypercall handler code where it can be preempted than where hypercall continuation can be used. For example, you can preempt hypercall that holds a mutex, but of course you can't create an continuation point in such place. >> All this >> imposes restrictions on which hypercalls can be preempted, when they >> can be preempted and how to write hypercall handlers. Also, it >> requires very accurate coding and already led to at least one >> vulnerability - XSA-318. Some hypercalls can not be preempted at all, >> like the one mentioned in [1]. >> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle >> vCPUs, >> which are supposed to run when the system is idle. If hypervisor needs >> to execute own tasks that are required to run right now, it have no >> other way than to execute them on current vCPU. But scheduler does not >> know that hypervisor executes hypervisor task and accounts spent time >> to a domain. This can lead to domain starvation. >> Also, absence of hypervisor threads leads to absence of high-level >> synchronization primitives like mutexes, conditional variables, >> completions, etc. This leads to two problems: we need to use spinlocks >> everywhere and we have problems when porting device drivers from linux >> kernel. >> Proposed solution >> ================= >> It is quite obvious that to fix problems above we need to allow >> preemption in hypervisor mode. I am not familiar with x86 side, but >> for the ARM it was surprisingly easy to implement. Basically, vCPU >> context in hypervisor mode is determined by its stack at general >> purpose registers. And __context_switch() function perfectly switches >> them when running in hypervisor mode. So there are no hard >> restrictions, why it should be called only in leave_hypervisor() path. >> The obvious question is: when we should to try to preempt running >> vCPU? And answer is: when there was an external event. This means >> that we should try to preempt only when there was an interrupt request >> where we are running in hypervisor mode. On ARM, in this case function >> do_trap_irq() is called. Problem is that IRQ handler can be called >> when vCPU is already in atomic state (holding spinlock, for >> example). In this case we should try to preempt right after leaving >> atomic state. This is basically all the idea behind this PoC. >> Now, about the series composition. >> Patches >> sched: core: save IRQ state during locking >> sched: rt: save IRQ state during locking >> sched: credit2: save IRQ state during locking >> preempt: use atomic_t to for preempt_count >> arm: setup: disable preemption during startup >> arm: context_switch: allow to run with IRQs already disabled >> prepare the groundwork for the rest of PoC. It appears that not all >> code is ready to be executed in IRQ state, and schedule() now can be >> called at end of do_trap_irq(), which technically is considered IRQ >> handler state. Also, it is unwise to try preempt things when we are >> still booting, so ween to enable atomic context during the boot >> process. > > I am really surprised that this is the only changes necessary in > Xen. For a first approach, we may want to be conservative when the > preemption is happening as I am not convinced that all the places are > safe to preempt. > Well, I can't say that I ran extensive tests, but I played with this for some time and it seemed quite stable. Of course, I had some problems with RTDS... As I see it, Xen is already supports SMP, so all places where races are possible should already be covered by spinlocks or taken into account by some other means. Places which may not be safe to preempt are clustered around task management code itself: schedulers, xen entry/exit points, vcpu creation/destruction and such. For example, for sure we do not want to destroy vCPU which was preempted in hypervisor mode. I didn't covered this case, by the way. >> Patches >> preempt: add try_preempt() function >> sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!] >> arm: traps: try to preempt before leaving IRQ handler >> are basically the core of this PoC. try_preempt() function tries to >> preempt vCPU when either called by IRQ handler and when leaving atomic >> state. Scheduler now enters atomic state to ensure that it will not >> preempt self. do_trap_irq() calls try_preempt() to initiate preemption. > > AFAICT, try_preempt() will deal with the rescheduling. But how about > softirqs? Don't we want to handle them in try_preempt() as well? Well, yes and no. We have the following softirqs: TIMER_SOFTIRQ - should be called, I believe RCU_SOFTIRQ - I'm not sure about this, but probably no SCHED_SLAVE_SOFTIRQ - no SCHEDULE_SOFTIRQ - no NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate thread, maybe? TASKLET_SOFTIRQ - should be moved to a separate thread So, looks like only timers should be handled for sure. > > [...] > >> Conclusion >> ========== >> My main intention is to begin discussion of hypervisor >> preemption. As >> I showed, it is doable right away and provides some immediate >> benefits. I do understand that proper implementation requires much >> more efforts. But we are ready to do this work if community is >> interested in it. >> Just to reiterate main benefits: >> 1. More controllable latency. On embedded systems customers care >> about >> such things. > > Is the plan to only offer preemptible Xen? > Sorry, didn't get the question. >> 2. We can get rid of hypercall continuations, which will results in >> simpler and more secure code. > > I don't think you can get rid of it completely without risking the OS > to receive RCU sched stall. So you would need to handle them > hypercalls differently. Agree. I believe that continuation context should reside in hypervisor. Those changes are not connected to preemption per se and can be implemented separately. But we can discuss them there, of course. [...]
On 23/02/2021 12:06, Volodymyr Babchuk wrote: > > Hi Julien, Hi Volodymyr, > Julien Grall writes: >> On 23/02/2021 02:34, Volodymyr Babchuk wrote: >> ... just rescheduling the vCPU. It will also give the opportunity for >> the guest to handle interrupts. >> >> If you don't return to the guest, then risk to get an RCU sched stall >> on that the vCPU (some hypercalls can take really really long). > > Ah yes, you are right. I'd only wish that hypervisor saved context of > hypercall on it's side... > > I have example of OP-TEE before my eyes. They have special return code > "task was interrupted" and they use separate call "continue execution of > interrupted task", which takes opaque context handle as a > parameter. With this approach state of interrupted call never leaks to > rest of the system. Feel free to suggest a new approach for the hypercals. >> >>> This approach itself have obvious >>> problems: code that executes hypercall is responsible for preemption, >>> preemption checks are infrequent (because they are costly by >>> themselves), hypercall execution state is stored in guest-controlled >>> area, we rely on guest's good will to continue the hypercall. >> >> Why is it a problem to rely on guest's good will? The hypercalls >> should be preempted at a boundary that is safe to continue. > > Yes, and it imposes restrictions on how to write hypercall > handler. > In other words, there are much more places in hypercall handler code > where it can be preempted than where hypercall continuation can be > used. For example, you can preempt hypercall that holds a mutex, but of > course you can't create an continuation point in such place. I disagree, you can create continuation point in such place. Although it will be more complex because you have to make sure you break the work in a restartable place. I would also like to point out that preemption also have some drawbacks. With RT in mind, you have to deal with priority inversion (e.g. a lower priority vCPU held a mutex that is required by an higher priority). Outside of RT, you have to be careful where mutex are held. In your earlier answer, you suggested to held mutex for the memory allocation. If you do that, then it means a domain A can block allocation for domain B as it helds the mutex. This can lead to quite serious problem if domain A cannot run (because it exhausted its credit) for a long time. > >>> All this >>> imposes restrictions on which hypercalls can be preempted, when they >>> can be preempted and how to write hypercall handlers. Also, it >>> requires very accurate coding and already led to at least one >>> vulnerability - XSA-318. Some hypercalls can not be preempted at all, >>> like the one mentioned in [1]. >>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle >>> vCPUs, >>> which are supposed to run when the system is idle. If hypervisor needs >>> to execute own tasks that are required to run right now, it have no >>> other way than to execute them on current vCPU. But scheduler does not >>> know that hypervisor executes hypervisor task and accounts spent time >>> to a domain. This can lead to domain starvation. >>> Also, absence of hypervisor threads leads to absence of high-level >>> synchronization primitives like mutexes, conditional variables, >>> completions, etc. This leads to two problems: we need to use spinlocks >>> everywhere and we have problems when porting device drivers from linux >>> kernel. >>> Proposed solution >>> ================= >>> It is quite obvious that to fix problems above we need to allow >>> preemption in hypervisor mode. I am not familiar with x86 side, but >>> for the ARM it was surprisingly easy to implement. Basically, vCPU >>> context in hypervisor mode is determined by its stack at general >>> purpose registers. And __context_switch() function perfectly switches >>> them when running in hypervisor mode. So there are no hard >>> restrictions, why it should be called only in leave_hypervisor() path. >>> The obvious question is: when we should to try to preempt running >>> vCPU? And answer is: when there was an external event. This means >>> that we should try to preempt only when there was an interrupt request >>> where we are running in hypervisor mode. On ARM, in this case function >>> do_trap_irq() is called. Problem is that IRQ handler can be called >>> when vCPU is already in atomic state (holding spinlock, for >>> example). In this case we should try to preempt right after leaving >>> atomic state. This is basically all the idea behind this PoC. >>> Now, about the series composition. >>> Patches >>> sched: core: save IRQ state during locking >>> sched: rt: save IRQ state during locking >>> sched: credit2: save IRQ state during locking >>> preempt: use atomic_t to for preempt_count >>> arm: setup: disable preemption during startup >>> arm: context_switch: allow to run with IRQs already disabled >>> prepare the groundwork for the rest of PoC. It appears that not all >>> code is ready to be executed in IRQ state, and schedule() now can be >>> called at end of do_trap_irq(), which technically is considered IRQ >>> handler state. Also, it is unwise to try preempt things when we are >>> still booting, so ween to enable atomic context during the boot >>> process. >> >> I am really surprised that this is the only changes necessary in >> Xen. For a first approach, we may want to be conservative when the >> preemption is happening as I am not convinced that all the places are >> safe to preempt. >> > > Well, I can't say that I ran extensive tests, but I played with this for > some time and it seemed quite stable. Of course, I had some problems > with RTDS... > > As I see it, Xen is already supports SMP, so all places where races are > possible should already be covered by spinlocks or taken into account by > some other means. That's correct for shared resources. I am more worried for any hypercalls that expected to run more or less continuously (e.g not taking into account interrupt) on the same pCPU. > > Places which may not be safe to preempt are clustered around task > management code itself: schedulers, xen entry/exit points, vcpu > creation/destruction and such. > > For example, for sure we do not want to destroy vCPU which was preempted > in hypervisor mode. I didn't covered this case, by the way. > >>> Patches >>> preempt: add try_preempt() function >>> sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!] >>> arm: traps: try to preempt before leaving IRQ handler >>> are basically the core of this PoC. try_preempt() function tries to >>> preempt vCPU when either called by IRQ handler and when leaving atomic >>> state. Scheduler now enters atomic state to ensure that it will not >>> preempt self. do_trap_irq() calls try_preempt() to initiate preemption. >> >> AFAICT, try_preempt() will deal with the rescheduling. But how about >> softirqs? Don't we want to handle them in try_preempt() as well? > > Well, yes and no. We have the following softirqs: > > TIMER_SOFTIRQ - should be called, I believe > RCU_SOFTIRQ - I'm not sure about this, but probably no When would you call RCU callback then? > SCHED_SLAVE_SOFTIRQ - no > SCHEDULE_SOFTIRQ - no > NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate > thread, maybe? > TASKLET_SOFTIRQ - should be moved to a separate thread > > So, looks like only timers should be handled for sure. > >> >> [...] >> >>> Conclusion >>> ========== >>> My main intention is to begin discussion of hypervisor >>> preemption. As >>> I showed, it is doable right away and provides some immediate >>> benefits. I do understand that proper implementation requires much >>> more efforts. But we are ready to do this work if community is >>> interested in it. >>> Just to reiterate main benefits: >>> 1. More controllable latency. On embedded systems customers care >>> about >>> such things. >> >> Is the plan to only offer preemptible Xen? >> > > Sorry, didn't get the question. What's your plan for the preemption support? Will an admin be able to configure Xen to be either preemptible or not? Cheers,
On 23/02/2021 02:34, Volodymyr Babchuk wrote: > Hello community, > > Subject of this cover letter is quite self-explanatory. This patch > series implements PoC for preemption in hypervisor mode. > > This is the sort of follow-up to recent discussion about latency > ([1]). > > Motivation > ========== > > It is well known that Xen is not preemptable. On other words, it is > impossible to switch vCPU contexts while running in hypervisor > mode. Only one place where scheduling decision can be made and one > vCPU can be replaced with another is the exit path from the hypervisor > mode. The one exception are Idle vCPUs, which never leaves the > hypervisor mode for obvious reasons. > > This leads to a number of problems. This list is not comprehensive. It > lists only things that I or my colleagues encountered personally. > > Long-running hypercalls. Due to nature of some hypercalls they can > execute for arbitrary long time. Mostly those are calls that deal with > long list of similar actions, like memory pages processing. To deal > with this issue Xen employs most horrific technique called "hypercall > continuation". When code that handles hypercall decides that it should > be preempted, it basically updates the hypercall parameters, and moves > guest PC one instruction back. This causes guest to re-execute the > hypercall with altered parameters, which will allow hypervisor to > continue hypercall execution later. This approach itself have obvious > problems: code that executes hypercall is responsible for preemption, > preemption checks are infrequent (because they are costly by > themselves), hypercall execution state is stored in guest-controlled > area, we rely on guest's good will to continue the hypercall. All this > imposes restrictions on which hypercalls can be preempted, when they > can be preempted and how to write hypercall handlers. Also, it > requires very accurate coding and already led to at least one > vulnerability - XSA-318. Some hypercalls can not be preempted at all, > like the one mentioned in [1]. > > Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs, > which are supposed to run when the system is idle. If hypervisor needs > to execute own tasks that are required to run right now, it have no > other way than to execute them on current vCPU. But scheduler does not > know that hypervisor executes hypervisor task and accounts spent time > to a domain. This can lead to domain starvation. > > Also, absence of hypervisor threads leads to absence of high-level > synchronization primitives like mutexes, conditional variables, > completions, etc. This leads to two problems: we need to use spinlocks > everywhere and we have problems when porting device drivers from linux > kernel. You cannot reenter a guest, even to deliver interrupts, if pre-empted at an arbitrary point in a hypercall. State needs unwinding suitably. Xen's non-preemptible-ness is designed to specifically force you to not implement long-running hypercalls which would interfere with timely interrupt handling in the general case. Hypervisor/virt properties are different to both a kernel-only-RTOS, and regular usespace. This was why I gave you some specific extra scenarios to do latency testing with, so you could make a fair comparison of "extra overhead caused by Xen" separate from "overhead due to fundamental design constraints of using virt". Preemption like this will make some benchmarks look better, but it also introduces the ability to create fundamental problems, like preventing any interrupt delivery into a VM for seconds of wallclock time while each vcpu happens to be in a long-running hypercall. If you want timely interrupt handling, you either need to partition your workloads by the long-running-ness of their hypercalls, or not have long-running hypercalls. I remain unconvinced that preemption is an sensible fix to the problem you're trying to solve. ~Andrew
Hi Julien, Julien Grall writes: > On 23/02/2021 12:06, Volodymyr Babchuk wrote: >> Hi Julien, > > Hi Volodymyr, > >> Julien Grall writes: >>> On 23/02/2021 02:34, Volodymyr Babchuk wrote: >>> ... just rescheduling the vCPU. It will also give the opportunity for >>> the guest to handle interrupts. >>> >>> If you don't return to the guest, then risk to get an RCU sched stall >>> on that the vCPU (some hypercalls can take really really long). >> Ah yes, you are right. I'd only wish that hypervisor saved context >> of >> hypercall on it's side... >> I have example of OP-TEE before my eyes. They have special return >> code >> "task was interrupted" and they use separate call "continue execution of >> interrupted task", which takes opaque context handle as a >> parameter. With this approach state of interrupted call never leaks to > rest of the system. > > Feel free to suggest a new approach for the hypercals. > I believe, I suggested it right above. There are some corner cases, that should be addressed, of course. >>> >>>> This approach itself have obvious >>>> problems: code that executes hypercall is responsible for preemption, >>>> preemption checks are infrequent (because they are costly by >>>> themselves), hypercall execution state is stored in guest-controlled >>>> area, we rely on guest's good will to continue the hypercall. >>> >>> Why is it a problem to rely on guest's good will? The hypercalls >>> should be preempted at a boundary that is safe to continue. >> Yes, and it imposes restrictions on how to write hypercall >> handler. >> In other words, there are much more places in hypercall handler code >> where it can be preempted than where hypercall continuation can be >> used. For example, you can preempt hypercall that holds a mutex, but of >> course you can't create an continuation point in such place. > > I disagree, you can create continuation point in such place. Although > it will be more complex because you have to make sure you break the > work in a restartable place. Maybe there is some misunderstanding. You can't create hypercall continuation point in a place where you are holding mutex. Because, there is absolutely not guarantee that guest will restart the hypercall. But you can preempt vCPU while holding mutex, because xen owns scheduler and it can guarantee that vCPU will be scheduled eventually to continue the work and release mutex. > I would also like to point out that preemption also have some drawbacks. > With RT in mind, you have to deal with priority inversion (e.g. a > lower priority vCPU held a mutex that is required by an higher > priority). Of course. This is not as simple as "just call scheduler when we want to". > Outside of RT, you have to be careful where mutex are held. In your > earlier answer, you suggested to held mutex for the memory > allocation. If you do that, then it means a domain A can block > allocation for domain B as it helds the mutex. As long as we do not exit to a EL1 with mutex being held, domain A can't block anything. Of course, we have to deal with priority inversion, but malicious domain can't cause DoS. > This can lead to quite serious problem if domain A cannot run (because > it exhausted its credit) for a long time. > I believe, this problem is related to a priority inversion problem and they should be addressed together. >> >>>> All this >>>> imposes restrictions on which hypercalls can be preempted, when they >>>> can be preempted and how to write hypercall handlers. Also, it >>>> requires very accurate coding and already led to at least one >>>> vulnerability - XSA-318. Some hypercalls can not be preempted at all, >>>> like the one mentioned in [1]. >>>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle >>>> vCPUs, >>>> which are supposed to run when the system is idle. If hypervisor needs >>>> to execute own tasks that are required to run right now, it have no >>>> other way than to execute them on current vCPU. But scheduler does not >>>> know that hypervisor executes hypervisor task and accounts spent time >>>> to a domain. This can lead to domain starvation. >>>> Also, absence of hypervisor threads leads to absence of high-level >>>> synchronization primitives like mutexes, conditional variables, >>>> completions, etc. This leads to two problems: we need to use spinlocks >>>> everywhere and we have problems when porting device drivers from linux >>>> kernel. >>>> Proposed solution >>>> ================= >>>> It is quite obvious that to fix problems above we need to allow >>>> preemption in hypervisor mode. I am not familiar with x86 side, but >>>> for the ARM it was surprisingly easy to implement. Basically, vCPU >>>> context in hypervisor mode is determined by its stack at general >>>> purpose registers. And __context_switch() function perfectly switches >>>> them when running in hypervisor mode. So there are no hard >>>> restrictions, why it should be called only in leave_hypervisor() path. >>>> The obvious question is: when we should to try to preempt running >>>> vCPU? And answer is: when there was an external event. This means >>>> that we should try to preempt only when there was an interrupt request >>>> where we are running in hypervisor mode. On ARM, in this case function >>>> do_trap_irq() is called. Problem is that IRQ handler can be called >>>> when vCPU is already in atomic state (holding spinlock, for >>>> example). In this case we should try to preempt right after leaving >>>> atomic state. This is basically all the idea behind this PoC. >>>> Now, about the series composition. >>>> Patches >>>> sched: core: save IRQ state during locking >>>> sched: rt: save IRQ state during locking >>>> sched: credit2: save IRQ state during locking >>>> preempt: use atomic_t to for preempt_count >>>> arm: setup: disable preemption during startup >>>> arm: context_switch: allow to run with IRQs already disabled >>>> prepare the groundwork for the rest of PoC. It appears that not all >>>> code is ready to be executed in IRQ state, and schedule() now can be >>>> called at end of do_trap_irq(), which technically is considered IRQ >>>> handler state. Also, it is unwise to try preempt things when we are >>>> still booting, so ween to enable atomic context during the boot >>>> process. >>> >>> I am really surprised that this is the only changes necessary in >>> Xen. For a first approach, we may want to be conservative when the >>> preemption is happening as I am not convinced that all the places are >>> safe to preempt. >>> >> Well, I can't say that I ran extensive tests, but I played with this >> for >> some time and it seemed quite stable. Of course, I had some problems >> with RTDS... >> As I see it, Xen is already supports SMP, so all places where races >> are >> possible should already be covered by spinlocks or taken into account by >> some other means. > That's correct for shared resources. I am more worried for any > hypercalls that expected to run more or less continuously (e.g not > taking into account interrupt) on the same pCPU. Are there many such hypercalls? They can disable preemption if they really need to run on the same pCPU. As I understand, they should be relatively fast, because they can't create continuations anyway. >> Places which may not be safe to preempt are clustered around task >> management code itself: schedulers, xen entry/exit points, vcpu >> creation/destruction and such. >> For example, for sure we do not want to destroy vCPU which was >> preempted >> in hypervisor mode. I didn't covered this case, by the way. >> >>>> Patches >>>> preempt: add try_preempt() function >>>> sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!] >>>> arm: traps: try to preempt before leaving IRQ handler >>>> are basically the core of this PoC. try_preempt() function tries to >>>> preempt vCPU when either called by IRQ handler and when leaving atomic >>>> state. Scheduler now enters atomic state to ensure that it will not >>>> preempt self. do_trap_irq() calls try_preempt() to initiate preemption. >>> >>> AFAICT, try_preempt() will deal with the rescheduling. But how about >>> softirqs? Don't we want to handle them in try_preempt() as well? >> Well, yes and no. We have the following softirqs: >> TIMER_SOFTIRQ - should be called, I believe >> RCU_SOFTIRQ - I'm not sure about this, but probably no > > When would you call RCU callback then? > I'm not sure there. But I think, they should be called in the same place as always: while leaving hypervisor. But I'm not very familiar with RCU, so I may talk nonsense. >> SCHED_SLAVE_SOFTIRQ - no >> SCHEDULE_SOFTIRQ - no >> NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate >> thread, maybe? >> TASKLET_SOFTIRQ - should be moved to a separate thread > >> So, looks like only timers should be handled for sure. >> >>> >>> [...] >>> >>>> Conclusion >>>> ========== >>>> My main intention is to begin discussion of hypervisor >>>> preemption. As >>>> I showed, it is doable right away and provides some immediate >>>> benefits. I do understand that proper implementation requires much >>>> more efforts. But we are ready to do this work if community is >>>> interested in it. >>>> Just to reiterate main benefits: >>>> 1. More controllable latency. On embedded systems customers care >>>> about >>>> such things. >>> >>> Is the plan to only offer preemptible Xen? >>> >> Sorry, didn't get the question. > > What's your plan for the preemption support? Will an admin be able to > configure Xen to be either preemptible or not? Honestly, it would be much easier to enable it unconditionally. But I understand, that this is not feasible. So, I'm looking at a build-time option.
On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> wrote: > > > Hi Julien, > > Julien Grall writes: > > > On 23/02/2021 12:06, Volodymyr Babchuk wrote: > >> Hi Julien, > > > > Hi Volodymyr, > > > >> Julien Grall writes: > >>> On 23/02/2021 02:34, Volodymyr Babchuk wrote: > >>> ... just rescheduling the vCPU. It will also give the opportunity for > >>> the guest to handle interrupts. > >>> > >>> If you don't return to the guest, then risk to get an RCU sched stall > >>> on that the vCPU (some hypercalls can take really really long). > >> Ah yes, you are right. I'd only wish that hypervisor saved context > >> of > >> hypercall on it's side... > >> I have example of OP-TEE before my eyes. They have special return > >> code > >> "task was interrupted" and they use separate call "continue execution of > >> interrupted task", which takes opaque context handle as a > >> parameter. With this approach state of interrupted call never leaks to > rest of the system. > > > > Feel free to suggest a new approach for the hypercals. > > > > I believe, I suggested it right above. There are some corner cases, that > should be addressed, of course. If we wanted a clean break, then possibly yes. But I meant one that doesn't break all the existing users and doesn't put Xen at risk. I don't believe your approach fulfill it. > > >>> > >>>> This approach itself have obvious > >>>> problems: code that executes hypercall is responsible for preemption, > >>>> preemption checks are infrequent (because they are costly by > >>>> themselves), hypercall execution state is stored in guest-controlled > >>>> area, we rely on guest's good will to continue the hypercall. > >>> > >>> Why is it a problem to rely on guest's good will? The hypercalls > >>> should be preempted at a boundary that is safe to continue. > >> Yes, and it imposes restrictions on how to write hypercall > >> handler. > >> In other words, there are much more places in hypercall handler code > >> where it can be preempted than where hypercall continuation can be > >> used. For example, you can preempt hypercall that holds a mutex, but of > >> course you can't create an continuation point in such place. > > > > I disagree, you can create continuation point in such place. Although > > it will be more complex because you have to make sure you break the > > work in a restartable place. > > Maybe there is some misunderstanding. You can't create hypercall > continuation point in a place where you are holding mutex. Because, > there is absolutely not guarantee that guest will restart the > hypercall. I don't think we are disagreeing here. My point is you should rarely need to hold a mutex for a long period, so you could break your work in smaller chunk. In which cases, you can use hypercall continuation. > > But you can preempt vCPU while holding mutex, because xen owns scheduler > and it can guarantee that vCPU will be scheduled eventually to continue > the work and release mutex. The problem is the "eventually". If you are accounting the time spent in the hypervisor to the vCPU A, then there is a possibility that it has exhausted its time slice. In which case, your vCPU A may be sleeping for a while with a mutex held. If another vCPU B needs the mutex, it will either have to wait potentially for a long time or we need to force vCPU A to run on borrowed time. > > > I would also like to point out that preemption also have some drawbacks. > > With RT in mind, you have to deal with priority inversion (e.g. a > > lower priority vCPU held a mutex that is required by an higher > > priority). > > Of course. This is not as simple as "just call scheduler when we want > to". Your e-mail made it sounds like it was easy to add preemption in Xen. ;) > > > Outside of RT, you have to be careful where mutex are held. In your > > earlier answer, you suggested to held mutex for the memory > > allocation. If you do that, then it means a domain A can block > > allocation for domain B as it helds the mutex. > > As long as we do not exit to a EL1 with mutex being held, domain A can't > block anything. Of course, we have to deal with priority inversion, but > malicious domain can't cause DoS. It is not really a priority inversion problem outside of RT because all the tasks will have the same priority. It is more a time accounting problem because each vCPU may have a different number of credits. > >>> I am really surprised that this is the only changes necessary in > >>> Xen. For a first approach, we may want to be conservative when the > >>> preemption is happening as I am not convinced that all the places are > >>> safe to preempt. > >>> > >> Well, I can't say that I ran extensive tests, but I played with this > >> for > >> some time and it seemed quite stable. Of course, I had some problems > >> with RTDS... > >> As I see it, Xen is already supports SMP, so all places where races > >> are > >> possible should already be covered by spinlocks or taken into account by > >> some other means. > > That's correct for shared resources. I am more worried for any > > hypercalls that expected to run more or less continuously (e.g not > > taking into account interrupt) on the same pCPU. > > Are there many such hypercalls? They can disable preemption if they > really need to run on the same pCPU. As I understand, they should be > relatively fast, because they can't create continuations anyway. Well, I never tried to make Xen preemptible... My comment is based on the fact that the use preempt_{enable, disable}() was mostly done on a best effort basis. The usual suspects are anything using this_cpu() or interacting with the per-CPU HW registers. From a quick look here a few things (only looked at Arm): * map_domain_page() in particular on arm32 because this is using per-CPU page-tables * guest_atomics_* as this uses this_cpu() * virt_to_mfn() in particular the failure path * Incorrect use (or missing) rcu locking. (Hopefully Juergen's recent work in the RCU mitigate the risk) I can provide guidance, but you will have to go through the code and check what's happening. Cheers,
Hi Andrew, Andrew Cooper writes: > On 23/02/2021 02:34, Volodymyr Babchuk wrote: >> Hello community, >> >> Subject of this cover letter is quite self-explanatory. This patch >> series implements PoC for preemption in hypervisor mode. >> >> This is the sort of follow-up to recent discussion about latency >> ([1]). >> >> Motivation >> ========== >> >> It is well known that Xen is not preemptable. On other words, it is >> impossible to switch vCPU contexts while running in hypervisor >> mode. Only one place where scheduling decision can be made and one >> vCPU can be replaced with another is the exit path from the hypervisor >> mode. The one exception are Idle vCPUs, which never leaves the >> hypervisor mode for obvious reasons. >> >> This leads to a number of problems. This list is not comprehensive. It >> lists only things that I or my colleagues encountered personally. >> >> Long-running hypercalls. Due to nature of some hypercalls they can >> execute for arbitrary long time. Mostly those are calls that deal with >> long list of similar actions, like memory pages processing. To deal >> with this issue Xen employs most horrific technique called "hypercall >> continuation". When code that handles hypercall decides that it should >> be preempted, it basically updates the hypercall parameters, and moves >> guest PC one instruction back. This causes guest to re-execute the >> hypercall with altered parameters, which will allow hypervisor to >> continue hypercall execution later. This approach itself have obvious >> problems: code that executes hypercall is responsible for preemption, >> preemption checks are infrequent (because they are costly by >> themselves), hypercall execution state is stored in guest-controlled >> area, we rely on guest's good will to continue the hypercall. All this >> imposes restrictions on which hypercalls can be preempted, when they >> can be preempted and how to write hypercall handlers. Also, it >> requires very accurate coding and already led to at least one >> vulnerability - XSA-318. Some hypercalls can not be preempted at all, >> like the one mentioned in [1]. >> >> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs, >> which are supposed to run when the system is idle. If hypervisor needs >> to execute own tasks that are required to run right now, it have no >> other way than to execute them on current vCPU. But scheduler does not >> know that hypervisor executes hypervisor task and accounts spent time >> to a domain. This can lead to domain starvation. >> >> Also, absence of hypervisor threads leads to absence of high-level >> synchronization primitives like mutexes, conditional variables, >> completions, etc. This leads to two problems: we need to use spinlocks >> everywhere and we have problems when porting device drivers from linux >> kernel. > > You cannot reenter a guest, even to deliver interrupts, if pre-empted at > an arbitrary point in a hypercall. State needs unwinding suitably. > Yes, Julien pointed this to me already. So, looks like hypercall continuations are still needed. > Xen's non-preemptible-ness is designed to specifically force you to not > implement long-running hypercalls which would interfere with timely > interrupt handling in the general case. What if long-running hypercalls are still required? There are other options, like async calls, for example. > Hypervisor/virt properties are different to both a kernel-only-RTOS, and > regular usespace. This was why I gave you some specific extra scenarios > to do latency testing with, so you could make a fair comparison of > "extra overhead caused by Xen" separate from "overhead due to > fundamental design constraints of using virt". I can't see any fundamental constraints there. I see how virtualization architecture can influence context switch time: how many actions you need to switch one vCPU to another. I have in mind low level things there: reprogram MMU to use another set of tables, reprogram your interrupt controller, timer, etc. Of course, you can't get latency lower that context switch time. This is the only fundamental constraint I can see. But all other things are debatable. As for latency testing, I'm not interested in absolute times per se. I already determined that time needed to switch vCPU context on my machine is about 9us. It is fine for me. I am interested in a (semi-)guaranteed time of reaction. And Xen is doing quite well in most cases. But there are other cases, when long-lasting hypercalls cause spikes in time of reaction. > Preemption like this will make some benchmarks look better, but it also > introduces the ability to create fundamental problems, like preventing > any interrupt delivery into a VM for seconds of wallclock time while > each vcpu happens to be in a long-running hypercall. > > If you want timely interrupt handling, you either need to partition your > workloads by the long-running-ness of their hypercalls, or not have > long-running hypercalls. ... or do long-running tasks asynchronously. I believe, for most domctls and sysctls there is no need to hold calling vCPU in hypervisor mode at all. > I remain unconvinced that preemption is an sensible fix to the problem > you're trying to solve. Well, this is the purpose of this little experiment. I want to discuss different approaches and to estimate amount of required efforts. By the way, from x86 point of view, how hard to switch vCPU context while it is running in hypervisor mode?
Julien Grall writes: > On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk > <Volodymyr_Babchuk@epam.com> wrote: >> >> >> Hi Julien, >> >> Julien Grall writes: >> >>> On 23/02/2021 12:06, Volodymyr Babchuk wrote: >>>> Hi Julien, >>> >>> Hi Volodymyr, >>> >>>> Julien Grall writes: >>>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote: >>>>> ... just rescheduling the vCPU. It will also give the opportunity for >>>>> the guest to handle interrupts. >>>>> >>>>> If you don't return to the guest, then risk to get an RCU sched stall >>>>> on that the vCPU (some hypercalls can take really really long). >>>> Ah yes, you are right. I'd only wish that hypervisor saved context >>>> of >>>> hypercall on it's side... >>>> I have example of OP-TEE before my eyes. They have special return >>>> code >>>> "task was interrupted" and they use separate call "continue execution of >>>> interrupted task", which takes opaque context handle as a >>>> parameter. With this approach state of interrupted call never leaks to > rest of the system. >>> >>> Feel free to suggest a new approach for the hypercals. >>> >> >> I believe, I suggested it right above. There are some corner cases, that >> should be addressed, of course. > > If we wanted a clean break, then possibly yes. But I meant one that doesn't > break all the existing users and doesn't put Xen at risk. > > I don't believe your approach fulfill it. Of course, we can't touch any hypercalls that are part of stable ABI. But if I got this right, domctls and sysctls are not stable, so one can change theirs behavior quite drastically in major releases. >> >>>>> >>>>>> This approach itself have obvious >>>>>> problems: code that executes hypercall is responsible for preemption, >>>>>> preemption checks are infrequent (because they are costly by >>>>>> themselves), hypercall execution state is stored in guest-controlled >>>>>> area, we rely on guest's good will to continue the hypercall. >>>>> >>>>> Why is it a problem to rely on guest's good will? The hypercalls >>>>> should be preempted at a boundary that is safe to continue. >>>> Yes, and it imposes restrictions on how to write hypercall >>>> handler. >>>> In other words, there are much more places in hypercall handler code >>>> where it can be preempted than where hypercall continuation can be >>>> used. For example, you can preempt hypercall that holds a mutex, but of >>>> course you can't create an continuation point in such place. >>> >>> I disagree, you can create continuation point in such place. Although >>> it will be more complex because you have to make sure you break the >>> work in a restartable place. >> >> Maybe there is some misunderstanding. You can't create hypercall >> continuation point in a place where you are holding mutex. Because, >> there is absolutely not guarantee that guest will restart the >> hypercall. > > I don't think we are disagreeing here. My point is you should rarely > need to hold a mutex for a long period, so you could break your work > in smaller chunk. In which cases, you can use hypercall continuation. Let's say in this way: generally you can hold a mutex much longer than you can hold a spinlock. And nothing catastrophic will happen if you are preempted while holding a mutex. Better to avoid, this of course. > >> >> But you can preempt vCPU while holding mutex, because xen owns scheduler >> and it can guarantee that vCPU will be scheduled eventually to continue >> the work and release mutex. > > The problem is the "eventually". If you are accounting the time spent > in the hypervisor to the vCPU A, then there is a possibility that it > has exhausted its time slice. In which case, your vCPU A may be > sleeping for a while with a mutex held. > > If another vCPU B needs the mutex, it will either have to wait > potentially for a long time or we need to force vCPU A to run on > borrowed time. Yes, of course. >> >>> I would also like to point out that preemption also have some drawbacks. >>> With RT in mind, you have to deal with priority inversion (e.g. a >>> lower priority vCPU held a mutex that is required by an higher >>> priority). >> >> Of course. This is not as simple as "just call scheduler when we want >> to". > > Your e-mail made it sounds like it was easy to add preemption in > Xen. ;) I'm sorry for that :) Actually, there is lots of work to do. It appeared to me that "current" needs to be reworked, preempt_enable/disable needs to be reworked, per-cpu variables also should be reworked. And this is just to ensure consistency of already existing code. And I am not mentioning x86 support there... >>> Outside of RT, you have to be careful where mutex are held. In your >>> earlier answer, you suggested to held mutex for the memory >>> allocation. If you do that, then it means a domain A can block >>> allocation for domain B as it helds the mutex. >> >> As long as we do not exit to a EL1 with mutex being held, domain A can't >> block anything. Of course, we have to deal with priority inversion, but >> malicious domain can't cause DoS. > > It is not really a priority inversion problem outside of RT because > all the tasks will have the same priority. It is more a time > accounting problem because each vCPU may have a different number of > credits. Speaking of that, RTDS does not use concept of priority. And ARINC of course neither. >>>>> I am really surprised that this is the only changes necessary in >>>>> Xen. For a first approach, we may want to be conservative when the >>>>> preemption is happening as I am not convinced that all the places are >>>>> safe to preempt. >>>>> >>>> Well, I can't say that I ran extensive tests, but I played with this >>>> for >>>> some time and it seemed quite stable. Of course, I had some problems >>>> with RTDS... >>>> As I see it, Xen is already supports SMP, so all places where races >>>> are >>>> possible should already be covered by spinlocks or taken into account by >>>> some other means. >>> That's correct for shared resources. I am more worried for any >>> hypercalls that expected to run more or less continuously (e.g not >>> taking into account interrupt) on the same pCPU. >> >> Are there many such hypercalls? They can disable preemption if they >> really need to run on the same pCPU. As I understand, they should be >> relatively fast, because they can't create continuations anyway. > > Well, I never tried to make Xen preemptible... My comment is based on > the fact that the use preempt_{enable, disable}() was mostly done on a > best effort basis. > > The usual suspects are anything using this_cpu() or interacting with > the per-CPU HW registers. > > From a quick look here a few things (only looked at Arm): > * map_domain_page() in particular on arm32 because this is using > per-CPU page-tables > * guest_atomics_* as this uses this_cpu() > * virt_to_mfn() in particular the failure path > * Incorrect use (or missing) rcu locking. (Hopefully Juergen's > recent work in the RCU mitigate the risk) > > I can provide guidance, but you will have to go through the code and > check what's happening. Thank you for the list. Of course, I need to see thru all the code. I already had a bunch of problems with per_cpu variables...
On 24/02/2021 23:58, Volodymyr Babchuk wrote:
> And I am not mentioning x86 support there...
x86 uses per-pCPU stacks, not per-vCPU stacks.
Transcribing from an old thread which happened in private as part of an
XSA discussion, concerning the implications of trying to change this.
~Andrew
-----8<-----
Here is a partial list off the top of my head of the practical problems
you're going to have to solve.
Introduction of new SpectreRSB vulnerable gadgets. I'm really close to
being able to drop RSB stuffing and recover some performance in Xen.
CPL0 entrypoints need updating across schedule. SYSCALL entry would
need to become a stub per vcpu, rather than the current stub per pcpu.
This requires reintroducing a writeable mapping to the TSS (doable) and
a shadow stack switch of active stacks (This corner case is so broken it
looks to be a blocker for CET-SS support in Linux, and is resulting in
some conversation about tweaking Shstk's in future processors).
All per-cpu variables stop working. You'd need to rewrite Xen to use
%gs for TLS which will have churn in the PV logic, and introduce the x86
architectural corner cases of running with an invalid %gs. Xen has been
saved from a large number of privilege escalation vulnerabilities in
common with Linux and Windows by the fact that we don't use %gs, so
anyone trying to do this is going to have to come up with some concrete
way of proving that the corner cases are covered.
Hi Andrew, Andrew Cooper writes: > On 24/02/2021 23:58, Volodymyr Babchuk wrote: >> And I am not mentioning x86 support there... > > x86 uses per-pCPU stacks, not per-vCPU stacks. > > Transcribing from an old thread which happened in private as part of an > XSA discussion, concerning the implications of trying to change this. > > ~Andrew > > -----8<----- > > Here is a partial list off the top of my head of the practical problems > you're going to have to solve. > > Introduction of new SpectreRSB vulnerable gadgets. I'm really close to > being able to drop RSB stuffing and recover some performance in Xen. > > CPL0 entrypoints need updating across schedule. SYSCALL entry would > need to become a stub per vcpu, rather than the current stub per pcpu. > This requires reintroducing a writeable mapping to the TSS (doable) and > a shadow stack switch of active stacks (This corner case is so broken it > looks to be a blocker for CET-SS support in Linux, and is resulting in > some conversation about tweaking Shstk's in future processors). > > All per-cpu variables stop working. You'd need to rewrite Xen to use > %gs for TLS which will have churn in the PV logic, and introduce the x86 > architectural corner cases of running with an invalid %gs. Xen has been > saved from a large number of privilege escalation vulnerabilities in > common with Linux and Windows by the fact that we don't use %gs, so > anyone trying to do this is going to have to come up with some concrete > way of proving that the corner cases are covered. Thank you. This is exactly what I needed. I am not a big specialist in x86, but from what I said, I can see that there is no easy way to switch contexts while in hypervisor mode. Then I want to return to a task domain idea, which you mentioned in the other thread. If I got it right, it would allow to 1. Implement asynchronous hypercalls for cases when there is no reason to hold calling vCPU in hypervisor for the whole call duration 2. Improve time accounting, as tasklets can be scheduled to run in this task domain. I skimmed through ML archives, but didn't found any discussion about it. As I see it, its implementation would be close to idle domain implementation, but a little different.
> On Feb 24, 2021, at 11:37 PM, Volodymyr Babchuk <volodymyr_babchuk@epam.com> wrote: > > >> Hypervisor/virt properties are different to both a kernel-only-RTOS, and >> regular usespace. This was why I gave you some specific extra scenarios >> to do latency testing with, so you could make a fair comparison of >> "extra overhead caused by Xen" separate from "overhead due to >> fundamental design constraints of using virt". > > I can't see any fundamental constraints there. I see how virtualization > architecture can influence context switch time: how many actions you > need to switch one vCPU to another. I have in mind low level things > there: reprogram MMU to use another set of tables, reprogram your > interrupt controller, timer, etc. Of course, you can't get latency lower > that context switch time. This is the only fundamental constraint I can > see. Well suppose you have two domains, A and B, both of which control hardware which have hard real-time requirements. And suppose that A has just started handling handling a latency-sensitive interrupt, when a latency-sensitive interrupt comes in for B. You might well preempt A and let B run for a full timeslice, causing A’s interrupt handler to be delayed by a significant amount. Preventing that sort of thing would be a much more tricky issue to get right. >> If you want timely interrupt handling, you either need to partition your >> workloads by the long-running-ness of their hypercalls, or not have >> long-running hypercalls. > > ... or do long-running tasks asynchronously. I believe, for most > domctls and sysctls there is no need to hold calling vCPU in hypervisor > mode at all. > >> I remain unconvinced that preemption is an sensible fix to the problem >> you're trying to solve. > > Well, this is the purpose of this little experiment. I want to discuss > different approaches and to estimate amount of required efforts. By the > way, from x86 point of view, how hard to switch vCPU context while it is > running in hypervisor mode? I’m not necessarily opposed to introducing preemption, but the more we ask about things, the more complex things begin to look. The idea of introducing an async framework to deal with long-running hypercalls is a huge engineering and design effort, not just for Xen, but for all future callers of the interface. The claim in the cover letter was that “[s]ome hypercalls can not be preempted at all”. I looked at the reference, and it looks like you’re referring to this: "I brooded over ways to make [alloc_domheap_pages()] preemptible. But it is a) located deep in call chain and b) used not only by hypercalls. So I can't see an easy way to make it preemptible." Let’s assume for the sake of argument that preventing delays due to alloc_domheap_pages() would require significant rearchitecting of the code. And let’s even assume that there are 2-3 other such knotty issues making for unacceptably long hypercalls. Will identifying and tracking down those issues really be more effort than introducing preemption, introducing async operations, and all the other things we’ve been talking about? One thing that might be interesting is to add some sort of metrics (disabled in Kconfig by default); e.g.: 1. On entry to a hypercall, take a timestamp 2. On every hypercall_preempt() call, take another timestamp and see how much time has passed without a preempt, and reset the timestamp count; also do a check on exit of the hypercall We could start by trying to do stats and figuring out which hypercalls go the longest without preemption, as a way to guide the optimization efforts. Then as we get that number down, we could add an ASSERT()s that the time is never longer than a certain amount, and add runs like that to osstest to make sure there are no regressions introduced. I agree that hypercall continuations are complex; and you’re right that the fact that the hypercall continuation may never be called limits where preemption can happen. But making the entire hypervisor preemption-friendly is also quite complex in its own way; it’s not immediately obvious to me from this thread that hypervisor preemption is less complex. -George
Hi, Volodymyr Babchuk writes: > Hi Andrew, > > Andrew Cooper writes: > >> On 24/02/2021 23:58, Volodymyr Babchuk wrote: >>> And I am not mentioning x86 support there... >> >> x86 uses per-pCPU stacks, not per-vCPU stacks. >> >> Transcribing from an old thread which happened in private as part of an >> XSA discussion, concerning the implications of trying to change this. >> >> ~Andrew >> >> -----8<----- >> >> Here is a partial list off the top of my head of the practical problems >> you're going to have to solve. >> >> Introduction of new SpectreRSB vulnerable gadgets. I'm really close to >> being able to drop RSB stuffing and recover some performance in Xen. >> >> CPL0 entrypoints need updating across schedule. SYSCALL entry would >> need to become a stub per vcpu, rather than the current stub per pcpu. >> This requires reintroducing a writeable mapping to the TSS (doable) and >> a shadow stack switch of active stacks (This corner case is so broken it >> looks to be a blocker for CET-SS support in Linux, and is resulting in >> some conversation about tweaking Shstk's in future processors). >> >> All per-cpu variables stop working. You'd need to rewrite Xen to use >> %gs for TLS which will have churn in the PV logic, and introduce the x86 >> architectural corner cases of running with an invalid %gs. Xen has been >> saved from a large number of privilege escalation vulnerabilities in >> common with Linux and Windows by the fact that we don't use %gs, so >> anyone trying to do this is going to have to come up with some concrete >> way of proving that the corner cases are covered. > > Thank you. This is exactly what I needed. I am not a big specialist in > x86, but from what I said, I can see that there is no easy way to switch > contexts while in hypervisor mode. > > Then I want to return to a task domain idea, which you mentioned in the > other thread. If I got it right, it would allow to > > 1. Implement asynchronous hypercalls for cases when there is no reason > to hold calling vCPU in hypervisor for the whole call duration > Okay, I was too overexcited there. I mean - surely it is possible to implement async hypercalls, but there is no immediate profit in this: such hypercall can't be preempted anyways. On a SMP system you can offload hypercall to another core, but that's basically all. > I skimmed through ML archives, but didn't found any discussion about it. Maybe you can give some hint how to find it? > As I see it, its implementation would be close to idle domain > implementation, but a little different.