Message ID | cb408368-077d-edb5-b4ad-f80086db48c1@invisiblethingslab.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC: disable HPET legacy mode after timer check | expand |
On 11/04/2023 11:30 am, Simon Gaiser wrote: > Hi, > > I have been recently looking into getting S0ix working on Xen [1]. > > Thanks to a tip from Andrew I found that the HPET legacy mode was > preventing my test system from reaching a package C-state lower than PC7 > and thereby also preventing S0ix residency. > > For testing I simply modified check_timer() to disable it again after it > checked the timer irq: > > --- a/xen/arch/x86/io_apic.c > +++ b/xen/arch/x86/io_apic.c > @@ -1966,6 +1969,8 @@ static void __init check_timer(void) > > if ( timer_irq_works() ) > { > + hpet_disable_legacy_replacement_mode(); > local_irq_restore(flags); > return; > } > > > With this [2] I'm able to reach S0ix residency for some time and for short > periods the systems power consumption goes down to the same level as with > native Linux! Excellent progress! > It reaches low power states only for a fraction of the suspend to idle > time, so something still makes the CPU/chipset think it should leave the > low power mode, but that's another topic. Do you have any further info here? There are a range of possibilities, from excess timers in Xen (e.g. PV guests default to a 100Hz timer even though no guests actually want it AFAICT), or the 1s TSC rendezvous (which isn't actually needed on modern systems), all the way to the platform devices not entering d3hot. > > I tried to understand how all the timer code interacts with disabling > the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT > is not available (Which is available on my test system and indeed I > didn't run into obvious breakage). > > Is this (disabled PIT && !ARAT) a configuration that exists (and needs > to be supported)? > > Did I miss something else? (Very much possible, given that this is way > above my existing experience with X86 and Xen internals.) Xen's code is a mess and needs an overhaul. Right now, we're using the timer as "a source of interrupts" to try and check that we've got things set up suitably. But this doesn't need to be the PIT, or a timer at all - it just needs to be "an interrupt coming in from the platform". Furthermore, there will soon be client platforms with no PIT/HPET/etc at all. (HPET is known broken in <PC7, and not used these days anyway in order to get deeper sleep working), so this logic is going to explode on us yet again. AFAICT moving forwards, the expectation is to use TSCs as the clocksource, and the deadline timer for interrupts. ~Andrew
On Tue, Apr 11, 2023 at 12:20:13PM +0100, Andrew Cooper wrote: > On 11/04/2023 11:30 am, Simon Gaiser wrote: > > Hi, > > > > I have been recently looking into getting S0ix working on Xen [1]. > > > > Thanks to a tip from Andrew I found that the HPET legacy mode was > > preventing my test system from reaching a package C-state lower than PC7 > > and thereby also preventing S0ix residency. > > > > For testing I simply modified check_timer() to disable it again after it > > checked the timer irq: > > > > --- a/xen/arch/x86/io_apic.c > > +++ b/xen/arch/x86/io_apic.c > > @@ -1966,6 +1969,8 @@ static void __init check_timer(void) > > > > if ( timer_irq_works() ) > > { > > + hpet_disable_legacy_replacement_mode(); > > local_irq_restore(flags); > > return; > > } > > > > > > With this [2] I'm able to reach S0ix residency for some time and for short > > periods the systems power consumption goes down to the same level as with > > native Linux! > > Excellent progress! > > > It reaches low power states only for a fraction of the suspend to idle > > time, so something still makes the CPU/chipset think it should leave the > > low power mode, but that's another topic. > > Do you have any further info here? There are a range of possibilities, > from excess timers in Xen (e.g. PV guests default to a 100Hz timer even > though no guests actually want it AFAICT), or the 1s TSC rendezvous > (which isn't actually needed on modern systems), all the way to the > platform devices not entering d3hot. > > > > > I tried to understand how all the timer code interacts with disabling > > the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT > > is not available (Which is available on my test system and indeed I > > didn't run into obvious breakage). > > > > Is this (disabled PIT && !ARAT) a configuration that exists (and needs > > to be supported)? > > > > Did I miss something else? (Very much possible, given that this is way > > above my existing experience with X86 and Xen internals.) > > Xen's code is a mess and needs an overhaul. > > Right now, we're using the timer as "a source of interrupts" to try and > check that we've got things set up suitably. But this doesn't need to > be the PIT, or a timer at all - it just needs to be "an interrupt coming > in from the platform". I would even question whether that testing is useful overall. We test a single IO-APIC pin, which still leaves room for the rest of them to not be properly configured, and Xen might not be using the PIT timer at the end. IOW: I think it's fine to test that the timer is working, but forcing that to be routed through the IO-APIC is wrong. HPET for example can support FSB, which skips the IO-APIC and injects the vector directly into the local APIC. We do support the APIC deadline timer, so I'm confused as to why we also need to use the PIT or HPET. I don't think I fully understand the relation between the usage of PIT/HPET/APIC deadline timers on Xen. Roger.
On 12.04.2023 13:25, Roger Pau Monné wrote: > On Tue, Apr 11, 2023 at 12:20:13PM +0100, Andrew Cooper wrote: >> On 11/04/2023 11:30 am, Simon Gaiser wrote: >>> Hi, >>> >>> I have been recently looking into getting S0ix working on Xen [1]. >>> >>> Thanks to a tip from Andrew I found that the HPET legacy mode was >>> preventing my test system from reaching a package C-state lower than PC7 >>> and thereby also preventing S0ix residency. >>> >>> For testing I simply modified check_timer() to disable it again after it >>> checked the timer irq: >>> >>> --- a/xen/arch/x86/io_apic.c >>> +++ b/xen/arch/x86/io_apic.c >>> @@ -1966,6 +1969,8 @@ static void __init check_timer(void) >>> >>> if ( timer_irq_works() ) >>> { >>> + hpet_disable_legacy_replacement_mode(); >>> local_irq_restore(flags); >>> return; >>> } >>> >>> >>> With this [2] I'm able to reach S0ix residency for some time and for short >>> periods the systems power consumption goes down to the same level as with >>> native Linux! >> >> Excellent progress! >> >>> It reaches low power states only for a fraction of the suspend to idle >>> time, so something still makes the CPU/chipset think it should leave the >>> low power mode, but that's another topic. >> >> Do you have any further info here? There are a range of possibilities, >> from excess timers in Xen (e.g. PV guests default to a 100Hz timer even >> though no guests actually want it AFAICT), or the 1s TSC rendezvous >> (which isn't actually needed on modern systems), all the way to the >> platform devices not entering d3hot. >> >>> >>> I tried to understand how all the timer code interacts with disabling >>> the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT >>> is not available (Which is available on my test system and indeed I >>> didn't run into obvious breakage). >>> >>> Is this (disabled PIT && !ARAT) a configuration that exists (and needs >>> to be supported)? >>> >>> Did I miss something else? (Very much possible, given that this is way >>> above my existing experience with X86 and Xen internals.) >> >> Xen's code is a mess and needs an overhaul. >> >> Right now, we're using the timer as "a source of interrupts" to try and >> check that we've got things set up suitably. But this doesn't need to >> be the PIT, or a timer at all - it just needs to be "an interrupt coming >> in from the platform". > > I would even question whether that testing is useful overall. We test > a single IO-APIC pin, which still leaves room for the rest of them to > not be properly configured, and Xen might not be using the PIT timer at > the end. Testing one pin is sufficient for the intended purpose (proving that the delivery route platform -> IO-APIC -> LAPIC works), leaving aside firmware possibly configuring multiple IO-APICs inconsistently. Yet if there are multiple IO-APICs, I'm afraid we have no way of knowing how to trigger any of the pins of secondary ones. Even if we went to figure out what devices are connected to it, we'd then still have no (rudimentary) device drivers knowing how to interact with the devices. Jan
On Mon, Apr 17, 2023 at 10:49:38AM +0200, Jan Beulich wrote: > On 12.04.2023 13:25, Roger Pau Monné wrote: > > On Tue, Apr 11, 2023 at 12:20:13PM +0100, Andrew Cooper wrote: > >> On 11/04/2023 11:30 am, Simon Gaiser wrote: > >>> Hi, > >>> > >>> I have been recently looking into getting S0ix working on Xen [1]. > >>> > >>> Thanks to a tip from Andrew I found that the HPET legacy mode was > >>> preventing my test system from reaching a package C-state lower than PC7 > >>> and thereby also preventing S0ix residency. > >>> > >>> For testing I simply modified check_timer() to disable it again after it > >>> checked the timer irq: > >>> > >>> --- a/xen/arch/x86/io_apic.c > >>> +++ b/xen/arch/x86/io_apic.c > >>> @@ -1966,6 +1969,8 @@ static void __init check_timer(void) > >>> > >>> if ( timer_irq_works() ) > >>> { > >>> + hpet_disable_legacy_replacement_mode(); > >>> local_irq_restore(flags); > >>> return; > >>> } > >>> > >>> > >>> With this [2] I'm able to reach S0ix residency for some time and for short > >>> periods the systems power consumption goes down to the same level as with > >>> native Linux! > >> > >> Excellent progress! > >> > >>> It reaches low power states only for a fraction of the suspend to idle > >>> time, so something still makes the CPU/chipset think it should leave the > >>> low power mode, but that's another topic. > >> > >> Do you have any further info here? There are a range of possibilities, > >> from excess timers in Xen (e.g. PV guests default to a 100Hz timer even > >> though no guests actually want it AFAICT), or the 1s TSC rendezvous > >> (which isn't actually needed on modern systems), all the way to the > >> platform devices not entering d3hot. > >> > >>> > >>> I tried to understand how all the timer code interacts with disabling > >>> the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT > >>> is not available (Which is available on my test system and indeed I > >>> didn't run into obvious breakage). > >>> > >>> Is this (disabled PIT && !ARAT) a configuration that exists (and needs > >>> to be supported)? > >>> > >>> Did I miss something else? (Very much possible, given that this is way > >>> above my existing experience with X86 and Xen internals.) > >> > >> Xen's code is a mess and needs an overhaul. > >> > >> Right now, we're using the timer as "a source of interrupts" to try and > >> check that we've got things set up suitably. But this doesn't need to > >> be the PIT, or a timer at all - it just needs to be "an interrupt coming > >> in from the platform". > > > > I would even question whether that testing is useful overall. We test > > a single IO-APIC pin, which still leaves room for the rest of them to > > not be properly configured, and Xen might not be using the PIT timer at > > the end. > > Testing one pin is sufficient for the intended purpose (proving that > the delivery route platform -> IO-APIC -> LAPIC works), leaving aside > firmware possibly configuring multiple IO-APICs inconsistently. Yet > if there are multiple IO-APICs, I'm afraid we have no way of knowing > how to trigger any of the pins of secondary ones. Even if we went to > figure out what devices are connected to it, we'd then still have no > (rudimentary) device drivers knowing how to interact with the devices. That's why I think the test is not very useful. Also the delivery route not being properly configured will be quite obvious when dom0 attempts to to use any device, as it would get timeouts. I think it's fine to test that the timer interrupt in use by Xen is working, but forcing this kind of interrupt delivery test doesn't seem specially useful to me, the more that we keep accumulation workarounds to make it work on newer platforms. Thanks, Roger.
Andrew Cooper: [...] >> It reaches low power states only for a fraction of the suspend to idle >> time, so something still makes the CPU/chipset think it should leave the >> low power mode, but that's another topic. > > Do you have any further info here? There are a range of possibilities, > from excess timers in Xen (e.g. PV guests default to a 100Hz timer even > though no guests actually want it AFAICT), or the 1s TSC rendezvous > (which isn't actually needed on modern systems), all the way to the > platform devices not entering d3hot. So in the meantime I got some progress here. What helps a lot is setting cpufreq to powersave before going to s2idle. With that I get residency of about 88 % (everything is still tested with only dom0 running). Not yet the > 99 % that a native Linux manages, but much better than before (<< 50 %). While, based on your and Marek's feedback, I was already looking at active timers, I first ignored the cpufreq dbs timer since the idle driver suspend it and I assumed it was active because I wake things up when triggering the debug key. But turns out disabling the ondemand governor has a big effect. But not sure if it's the timer itself or some other part of it. I tried to disable the time calibration timer. While eyeballing on the power meter I first thought it brings some improvement there's no difference according to the residency counters (will need to improve my power measurement setup). Other timers I see active: common/sched/core.c#vcpu_singleshot_timer_fn: If I understand correctly those are configure by the domain (so dom0 here). So Linux should do this right. But I will have to have a closer look. arch/x86/cpu/mcheck/non-fatal.c#mce_work_fn: Triggers only seldom (something >> 1 s), so unlikely. But will try disabling. arch/x86/time.c#plt_overflow: Dito. Simon
--- a/xen/arch/x86/io_apic.c +++ b/xen/arch/x86/io_apic.c @@ -1966,6 +1969,8 @@ static void __init check_timer(void) if ( timer_irq_works() ) { + hpet_disable_legacy_replacement_mode(); local_irq_restore(flags); return; }