Message ID | 158402065414.753.15785539969715690913.stgit@Palanthas (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen: credit2: fix vcpu starvation due to too few credits | expand |
On 12/03/2020 13:44, Dario Faggioli wrote: > There have been report of stalls of guest vCPUs, when Credit2 was used. > It seemed like these vCPUs were not getting scheduled for very long > time, even under light load conditions (e.g., during dom0 boot). > > Investigations led to the discovery that --although rarely-- it can > happen that a vCPU manages to run for very long timeslices. In Credit2, > this means that, when runtime accounting happens, the vCPU will lose a > large quantity of credits. This in turn may lead to the vCPU having less > credits than the idle vCPUs (-2^30). At this point, the scheduler will > pick the idle vCPU, instead of the ready to run vCPU, for a few > "epochs", which often times is enough for the guest kernel to think the > vCPU is not responding and crashing. > > An example of this situation is shown here. In fact, we can see d0v1 > sitting in the runqueue while all the CPUs are idle, as it has > -1254238270 credits, which is smaller than -2^30 = −1073741824: > > (XEN) Runqueue 0: > (XEN) ncpus = 28 > (XEN) cpus = 0-27 > (XEN) max_weight = 256 > (XEN) pick_bias = 22 > (XEN) instload = 1 > (XEN) aveload = 293391 (~111%) > (XEN) idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff > (XEN) tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000 > (XEN) fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff > [...] > (XEN) Runqueue 0: > (XEN) CPU[00] runq=0, sibling=00,..., core=00,... > (XEN) CPU[01] runq=0, sibling=00,..., core=00,... > [...] > (XEN) CPU[26] runq=0, sibling=00,..., core=00,... > (XEN) CPU[27] runq=0, sibling=00,..., core=00,... > (XEN) RUNQ: > (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%) > > We certainly don't want, under any circumstance, this to happen. > Therefore, let's use INT_MIN for the credits of the idle vCPU, in > Credit2, to be sure that no vCPU can get below that value. > > NOTE: investigations have been done about _how_ it is possible for a > vCPU to execute for so long that its credits becomes so low. While still > not completely clear, there are evidence that: > - it only happens very rarely > - it appears to be both machine and workload specific > - it does not look to be a Credit2 (e.g., as it happens when running > with Credit1 as well) issue, or a scheduler issue On what basis? Everything reported to xen-devel appears to suggests it is a credit2 problem. It doesn't manifest on versions of Xen before credit2 became the default, and switching back to credit1 appears to mitigate the problem. Certainly as far as XenServer is concerned, we haven't seen symptoms like this in a decade of running credit1. ~Andrew
> On Mar 12, 2020, at 1:55 PM, Andrew Cooper <Andrew.Cooper3@citrix.com> wrote: > > On 12/03/2020 13:44, Dario Faggioli wrote: >> There have been report of stalls of guest vCPUs, when Credit2 was used. >> It seemed like these vCPUs were not getting scheduled for very long >> time, even under light load conditions (e.g., during dom0 boot). >> >> Investigations led to the discovery that --although rarely-- it can >> happen that a vCPU manages to run for very long timeslices. In Credit2, >> this means that, when runtime accounting happens, the vCPU will lose a >> large quantity of credits. This in turn may lead to the vCPU having less >> credits than the idle vCPUs (-2^30). At this point, the scheduler will >> pick the idle vCPU, instead of the ready to run vCPU, for a few >> "epochs", which often times is enough for the guest kernel to think the >> vCPU is not responding and crashing. >> >> An example of this situation is shown here. In fact, we can see d0v1 >> sitting in the runqueue while all the CPUs are idle, as it has >> -1254238270 credits, which is smaller than -2^30 = −1073741824: >> >> (XEN) Runqueue 0: >> (XEN) ncpus = 28 >> (XEN) cpus = 0-27 >> (XEN) max_weight = 256 >> (XEN) pick_bias = 22 >> (XEN) instload = 1 >> (XEN) aveload = 293391 (~111%) >> (XEN) idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff >> (XEN) tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000 >> (XEN) fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff >> [...] >> (XEN) Runqueue 0: >> (XEN) CPU[00] runq=0, sibling=00,..., core=00,... >> (XEN) CPU[01] runq=0, sibling=00,..., core=00,... >> [...] >> (XEN) CPU[26] runq=0, sibling=00,..., core=00,... >> (XEN) CPU[27] runq=0, sibling=00,..., core=00,... >> (XEN) RUNQ: >> (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%) >> >> We certainly don't want, under any circumstance, this to happen. >> Therefore, let's use INT_MIN for the credits of the idle vCPU, in >> Credit2, to be sure that no vCPU can get below that value. >> >> NOTE: investigations have been done about _how_ it is possible for a >> vCPU to execute for so long that its credits becomes so low. While still >> not completely clear, there are evidence that: >> - it only happens very rarely >> - it appears to be both machine and workload specific >> - it does not look to be a Credit2 (e.g., as it happens when running >> with Credit1 as well) issue, or a scheduler issue > > On what basis? > > Everything reported to xen-devel appears to suggests it is a credit2 > problem. It doesn't manifest on versions of Xen before credit2 became > the default, and switching back to credit1 appears to mitigate the problem. > > Certainly as far as XenServer is concerned, we haven't seen symptoms > like this in a decade of running credit1. One reason could be because the symptoms are different. On credit1, credits and “priority” are separated; it’s not possible in credit1 for a vcpu to end up with a lower priority than the idle domain, and no matter how low the credits become, a vcpu will always end up with some “peers” at the same priority level, meaning it always has a chance at some cpu. What Dario is saying (if I understand him correctly) is that the *proximate* cause (allowing a vcpu to have an effective priority of less than idle) is certainly credit2-only; but the *deeper* cause (vcpus racking up massive amounts of negative credit) is not. -George
> On Mar 12, 2020, at 1:44 PM, Dario Faggioli <dfaggioli@suse.com> wrote: > > There have been report of stalls of guest vCPUs, when Credit2 was used. > It seemed like these vCPUs were not getting scheduled for very long > time, even under light load conditions (e.g., during dom0 boot). > > Investigations led to the discovery that --although rarely-- it can > happen that a vCPU manages to run for very long timeslices. In Credit2, > this means that, when runtime accounting happens, the vCPU will lose a > large quantity of credits. This in turn may lead to the vCPU having less > credits than the idle vCPUs (-2^30). At this point, the scheduler will > pick the idle vCPU, instead of the ready to run vCPU, for a few > "epochs", which often times is enough for the guest kernel to think the > vCPU is not responding and crashing. > > An example of this situation is shown here. In fact, we can see d0v1 > sitting in the runqueue while all the CPUs are idle, as it has > -1254238270 credits, which is smaller than -2^30 = −1073741824: > > (XEN) Runqueue 0: > (XEN) ncpus = 28 > (XEN) cpus = 0-27 > (XEN) max_weight = 256 > (XEN) pick_bias = 22 > (XEN) instload = 1 > (XEN) aveload = 293391 (~111%) > (XEN) idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff > (XEN) tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000 > (XEN) fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff > [...] > (XEN) Runqueue 0: > (XEN) CPU[00] runq=0, sibling=00,..., core=00,... > (XEN) CPU[01] runq=0, sibling=00,..., core=00,... > [...] > (XEN) CPU[26] runq=0, sibling=00,..., core=00,... > (XEN) CPU[27] runq=0, sibling=00,..., core=00,... > (XEN) RUNQ: > (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%) > > We certainly don't want, under any circumstance, this to happen. > Therefore, let's use INT_MIN for the credits of the idle vCPU, in > Credit2, to be sure that no vCPU can get below that value. > > NOTE: investigations have been done about _how_ it is possible for a > vCPU to execute for so long that its credits becomes so low. While still > not completely clear, there are evidence that: > - it only happens very rarely > - it appears to be both machine and workload specific > - it does not look to be a Credit2 (e.g., as it happens when running > with Credit1 as well) issue, or a scheduler issue > > This patch makes Credit2 more robust to events like this, whatever > the cause is, and should hence be backported (as far as possible). > > Signed-off-by: Dario Faggioli <dfaggioli@suse.com> > Reported-by: Glen <glenbarney@gmail.com> > Reported-by: Tomas Mozes <hydrapolic@gmail.com> Nit: The reported-by’s should be before the SoB (i.e., tags roughly in time order). I think this is a good change to make the algorithm more robust, so: Acked-by: George Dunlap <george.dunlap@citrix.com> But it seems like allowing a guest to rack up -2^63 credits is still a bad thing, and it would be nice to have some other backstop / reset mechanism. But I guess to have an effective mechanism of that sort we’d want to understand how it happened in the first place.
On Thu, 2020-03-12 at 13:55 +0000, Andrew Cooper wrote: > On 12/03/2020 13:44, Dario Faggioli wrote: > > > > NOTE: investigations have been done about _how_ it is possible for > > a > > vCPU to execute for so long that its credits becomes so low. While > > still > > not completely clear, there are evidence that: > > - it only happens very rarely > > - it appears to be both machine and workload specific > > - it does not look to be a Credit2 (e.g., as it happens when > > running > > with Credit1 as well) issue, or a scheduler issue > > On what basis? > On the basis that I have traced execution times of vCPUs, when running on both Credit1 and Credit2, on a system where I can sort-of reproduce the issue, and I see them being able to execute for ginormous timeslices, with both scheduler. That's what this paragraph above, plus similar ones in the cover letter ,were supposed to mean. If it is not clear enough, sorry. > Everything reported to xen-devel appears to suggests it is a credit2 > problem. It doesn't manifest on versions of Xen before credit2 > became > the default, and switching back to credit1 appears to mitigate the > problem. > Yes, because even if the issue is there in both cases, when you use Credit2 it manifests as vCPUs being starved (because of the "they end up having less priority than the idle vCPU" thing). When you use Credit1, this does not happen, because time accounting and prioritization are done differently there. Switching to Credit1, at least on the box where I could reproduce the problem, did not make the issue that the vCPUs run for way too long disappear. It's the symptoms that this issue cause that are different for the two schedulers. > Certainly as far as XenServer is concerned, we haven't seen symptoms > like this in a decade of running credit1. > I am only able to reproduce this issue on one box, and not really 100% deterministically. I have never noticed anything like this in the past few years, on Credit2. At least, nothing as severe as this (or we would have noticed, which is what, in fact, has happened!). With Credit1, even if the issue is there, you won't notice it in this form, se we really can't know. Regards
On Thu, 2020-03-12 at 14:40 +0000, George Dunlap wrote: > > On Mar 12, 2020, at 1:55 PM, Andrew Cooper < > > Andrew.Cooper3@citrix.com> wrote: > > > > Certainly as far as XenServer is concerned, we haven't seen > > symptoms > > like this in a decade of running credit1. > > One reason could be because the symptoms are different. On credit1, > credits and “priority” are separated; it’s not possible in credit1 > for a vcpu to end up with a lower priority than the idle domain, and > no matter how low the credits become, a vcpu will always end up with > some “peers” at the same priority level, meaning it always has a > chance at some cpu. > Indeed. Under many respects, Credit1 is basically a slightly more powerful variant of Round-Robin. The actual time vCPUs spend executing influences the scheduling decisions a lot less than in Credit2 (well, it influences the scheduling decisions a lot less than in any scheduler of any OS or hypervisor conceived and implemented in the last decade, or maybe even more). As a matter of fact, this property happens to have shield it from the nastiest effects of the issue at hand. > What Dario is saying (if I understand him correctly) is that the > *proximate* cause (allowing a vcpu to have an effective priority of > less than idle) is certainly credit2-only; but the *deeper* cause > (vcpus racking up massive amounts of negative credit) is not. > Yes, that's exactly what I meant. :-) Regards
On 12.03.2020 14:44, Dario Faggioli wrote: > --- a/xen/common/sched/credit2.c > +++ b/xen/common/sched/credit2.c > @@ -234,7 +234,7 @@ > * units does not consume credits, and it must be lower than whatever > * amount of credit 'regular' unit would end up with. > */ > -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) > +#define CSCHED2_IDLE_CREDIT INT_MIN The title saying "lower than", is "equal" actually fine? Looking at e.g. runq_insert() I'm getting the impression it's not. Looking at t2c_update() I'm also getting the impression that there's UB when the subtraction underflows. After all, if -1 << 30 wasn't small enough a value, I don't see why -1 << 31 would be. Jan
On 12.03.20 16:26, Jan Beulich wrote: > On 12.03.2020 14:44, Dario Faggioli wrote: >> --- a/xen/common/sched/credit2.c >> +++ b/xen/common/sched/credit2.c >> @@ -234,7 +234,7 @@ >> * units does not consume credits, and it must be lower than whatever >> * amount of credit 'regular' unit would end up with. >> */ >> -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) >> +#define CSCHED2_IDLE_CREDIT INT_MIN > > The title saying "lower than", is "equal" actually fine? Looking > at e.g. runq_insert() I'm getting the impression it's not. > > Looking at t2c_update() I'm also getting the impression that > there's UB when the subtraction underflows. After all, if > -1 << 30 wasn't small enough a value, I don't see why -1 << 31 > would be. Yes, I'd limit svc->credit in t2c_update() to CSCHED2_IDLE_CREDIT+1 Juergen
On Thu, 2020-03-12 at 16:26 +0100, Jan Beulich wrote: > On 12.03.2020 14:44, Dario Faggioli wrote: > > --- a/xen/common/sched/credit2.c > > +++ b/xen/common/sched/credit2.c > > @@ -234,7 +234,7 @@ > > * units does not consume credits, and it must be lower than > > whatever > > * amount of credit 'regular' unit would end up with. > > */ > > -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) > > +#define CSCHED2_IDLE_CREDIT INT_MIN > > The title saying "lower than", is "equal" actually fine? Looking > at e.g. runq_insert() I'm getting the impression it's not. > In Credit2, we don't have the idle vCPUs in the runqueue. So we will never compare non-idle with idle while inserting. But this is a good point, in general, and I think we may need to turn the ">" in a ">=" in runq_candidate(). > Looking at t2c_update() I'm also getting the impression that > there's UB when the subtraction underflows. After all, if > -1 << 30 wasn't small enough a value, I don't see why -1 << 31 > would be. > Mmm... not sure I am getting. Are you suggesting we should apply a cap to val? If yes, this looks like an issue independent from what the value of CSCHED2_IDLE_CREDIT is, but yeah, we can do that. Or am I missing something? Regards
On 12.03.2020 17:11, Dario Faggioli wrote: > On Thu, 2020-03-12 at 16:26 +0100, Jan Beulich wrote: >> Looking at t2c_update() I'm also getting the impression that >> there's UB when the subtraction underflows. After all, if >> -1 << 30 wasn't small enough a value, I don't see why -1 << 31 >> would be. >> > Mmm... not sure I am getting. Are you suggesting we should apply a cap > to val? > > If yes, this looks like an issue independent from what the value of > CSCHED2_IDLE_CREDIT is, but yeah, we can do that. Or am I missing > something? Well, at the very least UB needs to be avoided, i.e. we shouldn't subtract when the result is not representable in "credit"'s type. But even beyond that there may be a need for the subtraction to actually act as if it was saturating at INT_MIN. (Similarly additions may need to saturate at INT_MAX, if there's any possibility for those values to grow large.) Jan
On Thu, 2020-03-12 at 17:00 +0100, Jürgen Groß wrote: > On 12.03.20 16:26, Jan Beulich wrote: > > On 12.03.2020 14:44, Dario Faggioli wrote: > > > --- a/xen/common/sched/credit2.c > > > +++ b/xen/common/sched/credit2.c > > > @@ -234,7 +234,7 @@ > > > * units does not consume credits, and it must be lower than > > > whatever > > > * amount of credit 'regular' unit would end up with. > > > */ > > > -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) > > > +#define CSCHED2_IDLE_CREDIT INT_MIN > > > > The title saying "lower than", is "equal" actually fine? Looking > > at e.g. runq_insert() I'm getting the impression it's not. > > > > Looking at t2c_update() I'm also getting the impression that > > there's UB when the subtraction underflows. After all, if > > -1 << 30 wasn't small enough a value, I don't see why -1 << 31 > > would be. > > Yes, I'd limit svc->credit in t2c_update() to CSCHED2_IDLE_CREDIT+1 > Right, after having spoken with George as well, I guess I'll go for something like that (by imposing some boundaries to `val`, in that function, I think). Thanks and Regards
On Thu, 2020-03-12 at 14:45 +0000, George Dunlap wrote: > > On Mar 12, 2020, at 1:44 PM, Dario Faggioli <dfaggioli@suse.com> > > wrote: > > > > This patch makes Credit2 more robust to events like this, whatever > > the cause is, and should hence be backported (as far as possible). > > > > Signed-off-by: Dario Faggioli <dfaggioli@suse.com> > > Reported-by: Glen <glenbarney@gmail.com> > > Reported-by: Tomas Mozes <hydrapolic@gmail.com> > > Nit: The reported-by’s should be before the SoB (i.e., tags roughly > in time order). > Ah, right! :-( > I think this is a good change to make the algorithm more robust, so: > > Acked-by: George Dunlap <george.dunlap@citrix.com> > > But it seems like allowing a guest to rack up -2^63 credits is still > a bad thing, and it would be nice to have some other backstop / reset > mechanism. > I agree. FWIW, this is way it took me a while to get to the bottom of this. I was assuming it was *entirely* a Credit2 specific issue (caused by, e.g., something like I found and fixed with patch 2). When I noticed that things were not exactly like that, I also realized that we at least need to prevent --under any circumstance-- that idle vCPUs are preferred over "regular" vCPUs, and even if I did consider approaches like "compacting" the credits dynamic, I went straight for the INT_MIN approach. Considering how this thread is going, I guess we should actually push this further, and limit the "credit swing". Thanks and Regards
diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c index c7241944a8..5c0ab9cd05 100644 --- a/xen/common/sched/credit2.c +++ b/xen/common/sched/credit2.c @@ -234,7 +234,7 @@ * units does not consume credits, and it must be lower than whatever * amount of credit 'regular' unit would end up with. */ -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) +#define CSCHED2_IDLE_CREDIT INT_MIN /* * Carryover: How much "extra" credit may be carried over after * a reset.
There have been report of stalls of guest vCPUs, when Credit2 was used. It seemed like these vCPUs were not getting scheduled for very long time, even under light load conditions (e.g., during dom0 boot). Investigations led to the discovery that --although rarely-- it can happen that a vCPU manages to run for very long timeslices. In Credit2, this means that, when runtime accounting happens, the vCPU will lose a large quantity of credits. This in turn may lead to the vCPU having less credits than the idle vCPUs (-2^30). At this point, the scheduler will pick the idle vCPU, instead of the ready to run vCPU, for a few "epochs", which often times is enough for the guest kernel to think the vCPU is not responding and crashing. An example of this situation is shown here. In fact, we can see d0v1 sitting in the runqueue while all the CPUs are idle, as it has -1254238270 credits, which is smaller than -2^30 = −1073741824: (XEN) Runqueue 0: (XEN) ncpus = 28 (XEN) cpus = 0-27 (XEN) max_weight = 256 (XEN) pick_bias = 22 (XEN) instload = 1 (XEN) aveload = 293391 (~111%) (XEN) idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff (XEN) tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff [...] (XEN) Runqueue 0: (XEN) CPU[00] runq=0, sibling=00,..., core=00,... (XEN) CPU[01] runq=0, sibling=00,..., core=00,... [...] (XEN) CPU[26] runq=0, sibling=00,..., core=00,... (XEN) CPU[27] runq=0, sibling=00,..., core=00,... (XEN) RUNQ: (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%) We certainly don't want, under any circumstance, this to happen. Therefore, let's use INT_MIN for the credits of the idle vCPU, in Credit2, to be sure that no vCPU can get below that value. NOTE: investigations have been done about _how_ it is possible for a vCPU to execute for so long that its credits becomes so low. While still not completely clear, there are evidence that: - it only happens very rarely - it appears to be both machine and workload specific - it does not look to be a Credit2 (e.g., as it happens when running with Credit1 as well) issue, or a scheduler issue This patch makes Credit2 more robust to events like this, whatever the cause is, and should hence be backported (as far as possible). Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Reported-by: Glen <glenbarney@gmail.com> Reported-by: Tomas Mozes <hydrapolic@gmail.com> --- Cc: George Dunlap <george.dunlap@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Charles Arnold <carnold@suse.com> Cc: Sarah Newman <srn@prgmr.com> --- I will provide the backports myself, at least for 4.13 and 4.12.x (and feel free to ask for more). --- For Sarah, looking back at the various threads, I am not quite sure whether you also experienced the issue and reported it. If yes, I'm happy to add a "Reported-by:" line about you too (or, if this is fine to go in, for this to be done while committing, if possible). --- xen/common/sched/credit2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)