mbox series

[0/2] xen: credit2: fix vcpu starvation due to too few credits

Message ID 158402056376.753.7091379488590272336.stgit@Palanthas (mailing list archive)
Headers show
Series xen: credit2: fix vcpu starvation due to too few credits | expand

Message

Dario Faggioli March 12, 2020, 1:44 p.m. UTC
Hello everyone,

There have been reports of a Credit2 issue due to which vCPUs where
being starved, to the point that guest kernel would complain or even
crash.

See the following xen-users and xen-devel threads:
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg01158.html

I did some investigations, and figured out that the vCPUs in question
are not scheduled for long time intervals because they somehow manage to
be given an amount of credits which is less than the credit the idle
vCPU has.

An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:

    (XEN) Runqueue 0:
    (XEN)   ncpus              = 28
    (XEN)   cpus               = 0-27
    (XEN)   max_weight         = 256
    (XEN)   pick_bias          = 22
    (XEN)   instload           = 1
    (XEN)   aveload            = 293391 (~111%)
    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    [...]
    (XEN) Runqueue 0:
    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
    [...]
    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
    (XEN) RUNQ:
    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)

This happens bacause --although very rarely-- vCPUs are allowed to
execute for much more than the scheduler would want them to.

For example, I have a trace showing that csched2_schedule() is invoked at
t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
fire at t=57979485083ns, i.e., 8738928ns in future. That's because credit
of snext is exactly that 8738928ns. Then, what I see is that the next
call to burn_credits(), coming from csched2_schedule() for the same vCPU
happens at t=60083283617ns. That is *a lot* (2103798534ns) later than
when we expected and asked. Of course, that also means that delta is
2112537462ns, and therefore credits will sink to -2103798534!

Also, to the best of my current knowledge, this does not look like
Credit2 related, as I've observed it when running with Credit1 as well.
I personally don't think it would be scheduling related, in general, but
I need to do more investigation to be sure about that (and/or to figure
out what the real root cause is).

The reason why Credit2 is affected much more than Credit1 is because of
how time accounting is done. Basically, there's very rudimental time
accounting in Credit1, which is a very bad thing, IMO, but indeed that
is also what prevented for this issue to cause severe stalls.

One more thing is that Credit2 gives -2^30 credits to the idle vCPU, which
was considered to be low enough, which is true. But it's not a robust
choice, should an issue like the one we're discussing occur, which is
happening. :-) Therefore, I think we should lower the credits of the
idle vCPU to the minimum possible value, so that even under whatever
unusual or weird or buggy situations like this one, we will never pick
idle instead of an actual vCPU that is ready to run.

This is what is done in the first patch of this series. This is a
robustness improvement and a fix (or at least the best way we can deal
with the it within the scheduler) for the issue at hand. It therefore
should be backported.

While looking into this, I also have found out that there is an actual
bug in Credit2 code. It is something I introduced myself with commit
5e4b4199667b9 ("xen: credit2: only reset credit on reset condition").
In fact, while it was and still is a good idea to avoid resetting
credits too often, the implementation of this was just wrong.

A fix for this bug is what is contained in patch 2. And it also should
be backported.

Note that patch 2 alone was also already mitigating the stall/starvation
issue quite substantially. Nevertheless, the proper fix for the issue
itself is making Credit2 more robust against similar problem, as done in
patch 1, while this other bug just happens to be something which
interact with the sympthoms.

This to say that, although both patches will be bugported, asboth are
actual bugfixes, if there is the need to apply something "in emergency"
to fix the starvation problem, applying only patch 1 is enough.

Thanks and Regards
---
Dario Faggioli (2):
      xen: credit2: avoid vCPUs to ever reach lower credits than idle
      xen: credit2: fix credit reset happening too few times

 tools/xentrace/formats     |    2 +-
 tools/xentrace/xenalyze.c  |    8 +++-----
 xen/common/sched/credit2.c |   32 ++++++++++++++------------------
 3 files changed, 18 insertions(+), 24 deletions(-)
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Comments

Roger Pau Monné March 12, 2020, 3:08 p.m. UTC | #1
Thanks for looking into this, seems like a specially tricky issue to
tackle!

On Thu, Mar 12, 2020 at 02:44:07PM +0100, Dario Faggioli wrote:
[...]
> For example, I have a trace showing that csched2_schedule() is invoked at
> t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
> fire at t=57979485083ns, i.e., 8738928ns in future. That's because credit
> of snext is exactly that 8738928ns. Then, what I see is that the next
> call to burn_credits(), coming from csched2_schedule() for the same vCPU
> happens at t=60083283617ns. That is *a lot* (2103798534ns) later than
> when we expected and asked. Of course, that also means that delta is
> 2112537462ns, and therefore credits will sink to -2103798534!

Which timer does this hardware use? DYK if there's some relation
between the timer hardware used and the issue?

Thanks, Roger.
Jürgen Groß March 12, 2020, 3:51 p.m. UTC | #2
On 12.03.20 14:44, Dario Faggioli wrote:
> Hello everyone,
> 
> There have been reports of a Credit2 issue due to which vCPUs where
> being starved, to the point that guest kernel would complain or even
> crash.
> 
> See the following xen-users and xen-devel threads:
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
> https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg01158.html
> 
> I did some investigations, and figured out that the vCPUs in question
> are not scheduled for long time intervals because they somehow manage to
> be given an amount of credits which is less than the credit the idle
> vCPU has.
> 
> An example of this situation is shown here. In fact, we can see d0v1
> sitting in the runqueue while all the CPUs are idle, as it has
> -1254238270 credits, which is smaller than -2^30 = −1073741824:
> 
>      (XEN) Runqueue 0:
>      (XEN)   ncpus              = 28
>      (XEN)   cpus               = 0-27
>      (XEN)   max_weight         = 256
>      (XEN)   pick_bias          = 22
>      (XEN)   instload           = 1
>      (XEN)   aveload            = 293391 (~111%)
>      (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
>      (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
>      (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
>      [...]
>      (XEN) Runqueue 0:
>      (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
>      (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
>      [...]
>      (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
>      (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
>      (XEN) RUNQ:
>      (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)
> 
> This happens bacause --although very rarely-- vCPUs are allowed to
> execute for much more than the scheduler would want them to.
> 
> For example, I have a trace showing that csched2_schedule() is invoked at
> t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
> fire at t=57979485083ns, i.e., 8738928ns in future. That's because credit
> of snext is exactly that 8738928ns. Then, what I see is that the next
> call to burn_credits(), coming from csched2_schedule() for the same vCPU
> happens at t=60083283617ns. That is *a lot* (2103798534ns) later than
> when we expected and asked. Of course, that also means that delta is
> 2112537462ns, and therefore credits will sink to -2103798534!

Current ideas are:

- Could it be the vcpu is busy for very long time in the hypervisor?
   So either fighting with another vcpu for a lock, doing a long
   running hypercall, ...

- The timer used is not reliable.

- The time base is not reliable (tsc or whatever is used for getting
   the time has jumped 2 seconds into the future).

- System management mode has kicked in.


Juergen
Andrew Cooper March 12, 2020, 4:27 p.m. UTC | #3
On 12/03/2020 15:51, Jürgen Groß wrote:
> On 12.03.20 14:44, Dario Faggioli wrote:
>> Hello everyone,
>>
>> There have been reports of a Credit2 issue due to which vCPUs where
>> being starved, to the point that guest kernel would complain or even
>> crash.
>>
>> See the following xen-users and xen-devel threads:
>> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
>>
>> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
>>
>> https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg01158.html
>>
>>
>> I did some investigations, and figured out that the vCPUs in question
>> are not scheduled for long time intervals because they somehow manage to
>> be given an amount of credits which is less than the credit the idle
>> vCPU has.
>>
>> An example of this situation is shown here. In fact, we can see d0v1
>> sitting in the runqueue while all the CPUs are idle, as it has
>> -1254238270 credits, which is smaller than -2^30 = −1073741824:
>>
>>      (XEN) Runqueue 0:
>>      (XEN)   ncpus              = 28
>>      (XEN)   cpus               = 0-27
>>      (XEN)   max_weight         = 256
>>      (XEN)   pick_bias          = 22
>>      (XEN)   instload           = 1
>>      (XEN)   aveload            = 293391 (~111%)
>>      (XEN)   idlers:
>> 00,00000000,00000000,00000000,00000000,00000000,0fffffff
>>      (XEN)   tickled:
>> 00,00000000,00000000,00000000,00000000,00000000,00000000
>>      (XEN)   fully idle cores:
>> 00,00000000,00000000,00000000,00000000,00000000,0fffffff
>>      [...]
>>      (XEN) Runqueue 0:
>>      (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
>>      (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
>>      [...]
>>      (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
>>      (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
>>      (XEN) RUNQ:
>>      (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256]
>> load=262144 (~100%)
>>
>> This happens bacause --although very rarely-- vCPUs are allowed to
>> execute for much more than the scheduler would want them to.
>>
>> For example, I have a trace showing that csched2_schedule() is
>> invoked at
>> t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
>> fire at t=57979485083ns, i.e., 8738928ns in future. That's because
>> credit
>> of snext is exactly that 8738928ns. Then, what I see is that the next
>> call to burn_credits(), coming from csched2_schedule() for the same vCPU
>> happens at t=60083283617ns. That is *a lot* (2103798534ns) later than
>> when we expected and asked. Of course, that also means that delta is
>> 2112537462ns, and therefore credits will sink to -2103798534!
>
> Current ideas are:
>
> - Could it be the vcpu is busy for very long time in the hypervisor?
>   So either fighting with another vcpu for a lock, doing a long
>   running hypercall, ...

Using watchdog=2 might catch that.  (There is a counting issue which
I've not had time to fix yet, which makes the watchdog more fragile with
a smaller timeout, but 2 should be ok.)

>
> - The timer used is not reliable.
>
> - The time base is not reliable (tsc or whatever is used for getting
>   the time has jumped 2 seconds into the future).

Worth instrumenting the TSC rendezvous for unexpectedly large jumps?

>
> - System management mode has kicked in.

SMM handlers need to rendezvous to keep their secrets secret these days,
but I suppose this is always a possibility.

There are non-architectural SMI_COUNT MSRs (0x34 on Intel, can't
remember AMD off the top of my head) which can be used to see if any
have occurred, and this has proved useful in the past for debugging.

~Andrew
Dario Faggioli March 12, 2020, 5:02 p.m. UTC | #4
On Thu, 2020-03-12 at 16:08 +0100, Roger Pau Monné wrote:
> Thanks for looking into this, seems like a specially tricky issue to
> tackle!
> 
It was tricky indeed! :-)

> On Thu, Mar 12, 2020 at 02:44:07PM +0100, Dario Faggioli wrote:
> [...]
> > For example, I have a trace showing that csched2_schedule() is
> > invoked at
> > t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
> > fire at t=57979485083ns, i.e., 8738928ns in future. That's because
> > credit
> > of snext is exactly that 8738928ns. Then, what I see is that the
> > next
> > call to burn_credits(), coming from csched2_schedule() for the same
> > vCPU
> > happens at t=60083283617ns. That is *a lot* (2103798534ns) later
> > than
> > when we expected and asked. Of course, that also means that delta
> > is
> > 2112537462ns, and therefore credits will sink to -2103798534!
> 
> Which timer does this hardware use? DYK if there's some relation
> between the timer hardware used and the issue?
> 
Timers came to mind but I haven't checked yet.

FWIW, one thing I saw is that, without patches, my machine times out
around...

[    2.364819] NET: Registered protocol family 16
[    2.368018] xen:grant_table: Grant tables using version 1 layout
[    2.372033] Grant table initialized
[    2.377115] ACPI: bus type PCI registered
[    2.380011] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    2.384660] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    2.388033] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
[    2.499080] PCI: Using configuration type 1 for base access
[    2.516768] ACPI: Added _OSI(Module Device)
[    2.524006] ACPI: Added _OSI(Processor Device)
[    2.536004] ACPI: Added _OSI(3.0 _SCP Extensions)
[    2.544003] ACPI: Added _OSI(Processor Aggregator Device)
[    2.816022] ACPI: 4 ACPI AML tables successfully acquired and loaded
[    2.852011] xen: registering gsi 9 triggering 0 polarity 0
[    2.856021] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored

... here, during dom0 boot.

[    2.871615] ACPI: Dynamic OEM Table Load:
[    2.941945] ACPI: Interpreter enabled
[    2.952021] ACPI: (supports S0 S3 S4 S5)
[    2.960004] ACPI: Using IOAPIC for interrupt routing
[    2.972031] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    2.993032] ACPI: Enabled 6 GPEs in block 00 to 3F
[    3.042478] ACPI: PCI Root Bridge [UNC1] (domain 0000 [bus ff])
[    3.056010] acpi PNP0A03:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    3.079707] acpi PNP0A03:02: _OSC: platform does not support [SHPCHotplug LTR]
[    3.098999] acpi PNP0A03:02: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]

What do you mean with "Which timer does this hardware use" ?

It's a Intel(R) Xeon(R) CPU E5-2697 and, when booted bare metal, this
is what I see:

# dmesg |grep -i time
[    0.000000] ACPI: PM-Timer IO Port: 0x408
[    0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 5188.26 BogoMIPS (lpj=10376536)
[    0.022602] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.062298] TSC deadline timer enabled
[    0.436174] RTC time: 16:56:42, date: 03/12/20
[    2.117580] RAPL PMU: API unit is 2^-32 Joules, 4 fixed counters, 655360 ms ovfl timer
[    2.123351] workingset: timestamp_bits=36 max_order=23 bucket_order=0
[    2.288144] intel_idle: lapic_timer_reliable_states 0xffffffff

Regards
Roger Pau Monné March 12, 2020, 5:59 p.m. UTC | #5
On Thu, Mar 12, 2020 at 06:02:03PM +0100, Dario Faggioli wrote:
> On Thu, 2020-03-12 at 16:08 +0100, Roger Pau Monné wrote:
> > Thanks for looking into this, seems like a specially tricky issue to
> > tackle!
> > 
> It was tricky indeed! :-)
> 
> > On Thu, Mar 12, 2020 at 02:44:07PM +0100, Dario Faggioli wrote:
> > [...]
> > > For example, I have a trace showing that csched2_schedule() is
> > > invoked at
> > > t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
> > > fire at t=57979485083ns, i.e., 8738928ns in future. That's because
> > > credit
> > > of snext is exactly that 8738928ns. Then, what I see is that the
> > > next
> > > call to burn_credits(), coming from csched2_schedule() for the same
> > > vCPU
> > > happens at t=60083283617ns. That is *a lot* (2103798534ns) later
> > > than
> > > when we expected and asked. Of course, that also means that delta
> > > is
> > > 2112537462ns, and therefore credits will sink to -2103798534!
> > 
> > Which timer does this hardware use? DYK if there's some relation
> > between the timer hardware used and the issue?
> > 
> Timers came to mind but I haven't checked yet.
> 
> FWIW, one thing I saw is that, without patches, my machine times out
> around...
> 
> [    2.364819] NET: Registered protocol family 16
> [    2.368018] xen:grant_table: Grant tables using version 1 layout
> [    2.372033] Grant table initialized
> [    2.377115] ACPI: bus type PCI registered
> [    2.380011] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
> [    2.384660] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
> [    2.388033] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
> [    2.499080] PCI: Using configuration type 1 for base access
> [    2.516768] ACPI: Added _OSI(Module Device)
> [    2.524006] ACPI: Added _OSI(Processor Device)
> [    2.536004] ACPI: Added _OSI(3.0 _SCP Extensions)
> [    2.544003] ACPI: Added _OSI(Processor Aggregator Device)
> [    2.816022] ACPI: 4 ACPI AML tables successfully acquired and loaded
> [    2.852011] xen: registering gsi 9 triggering 0 polarity 0
> [    2.856021] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> 
> ... here, during dom0 boot.
> 
> [    2.871615] ACPI: Dynamic OEM Table Load:
> [    2.941945] ACPI: Interpreter enabled
> [    2.952021] ACPI: (supports S0 S3 S4 S5)
> [    2.960004] ACPI: Using IOAPIC for interrupt routing
> [    2.972031] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> [    2.993032] ACPI: Enabled 6 GPEs in block 00 to 3F
> [    3.042478] ACPI: PCI Root Bridge [UNC1] (domain 0000 [bus ff])
> [    3.056010] acpi PNP0A03:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
> [    3.079707] acpi PNP0A03:02: _OSC: platform does not support [SHPCHotplug LTR]
> [    3.098999] acpi PNP0A03:02: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
> 
> What do you mean with "Which timer does this hardware use" ?

Xen uses a hardware timer (HPET, PMTIMER or PIT IIRC) in order to get
interrupts at specified times, on my box I see for example:

(XEN) Platform timer is 23.999MHz HPET

You should also see something along those lines. I was wondering if
there was some relation between the timer in use and the delay in
timer interrupts that you are seeing.

Thanks, Roger.
Dario Faggioli March 13, 2020, 6:19 a.m. UTC | #6
On Thu, 2020-03-12 at 18:59 +0100, Roger Pau Monné wrote:
> On Thu, Mar 12, 2020 at 06:02:03PM +0100, Dario Faggioli wrote:
> > What do you mean with "Which timer does this hardware use" ?
> 
> Xen uses a hardware timer (HPET, PMTIMER or PIT IIRC) in order to get
> interrupts at specified times, on my box I see for example:
> 
> (XEN) Platform timer is 23.999MHz HPET
> 
> You should also see something along those lines. 
>
Right. If booted Xen, I see this, which looks fine to me (this is
currently Xen 4.12.2):

# xl dmesg |grep time
(XEN) [00018d6c323bf350] Platform timer is 14.318MHz HPET  <===========
(XEN) [   22.097647] TSC deadline timer enabled
(XEN) [   23.068865] mwait-idle: lapic_timer_reliable_states 0xffffffff
(XEN) [   27.316536] mcheck_poll: Machine check polling timer started.

Regards
Dario Faggioli March 13, 2020, 7:26 a.m. UTC | #7
On Thu, 2020-03-12 at 16:27 +0000, Andrew Cooper wrote:
> On 12/03/2020 15:51, Jürgen Groß wrote:
> > - Could it be the vcpu is busy for very long time in the
> > hypervisor?
> >   So either fighting with another vcpu for a lock, doing a long
> >   running hypercall, ...
> 
> Using watchdog=2 might catch that.  (There is a counting issue which
> I've not had time to fix yet, which makes the watchdog more fragile
> with
> a smaller timeout, but 2 should be ok.)
> 
Done (actually, I used "watchdog=1 watchdog_timeout=2"), both with and
without these patches applies.

I don't see any watchdog firing.

While I was there, I tried 1 too. Nothing either.

> > - The timer used is not reliable.
> > 
> > - The time base is not reliable (tsc or whatever is used for
> > getting
> >   the time has jumped 2 seconds into the future).
> 
> Worth instrumenting the TSC rendezvous for unexpectedly large jumps?
> 
Yes, we should look into that.

> > - System management mode has kicked in.
> 
> There are non-architectural SMI_COUNT MSRs (0x34 on Intel, can't
> remember AMD off the top of my head) which can be used to see if any
> have occurred, and this has proved useful in the past for debugging.
> 
I'll have to look up at how to check and dump these from where I need.
Let's see if I can do that... hints, code pointer or anything else you
may have handy would be much appreciated. :-)

Thanks and Regards