diff mbox series

x86/intel: Disable HPET on another Intel Coffee Lake platform

Message ID 20210916131739.1260552-1-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series x86/intel: Disable HPET on another Intel Coffee Lake platform | expand

Commit Message

Jakub Kicinski Sept. 16, 2021, 1:17 p.m. UTC
My Lenovo T490s with i7-8665U had been marking TSC as unstable
since v5.13, resulting in very sluggish desktop experience...

I have a 8086:3e34 bridge, also known as "Host bridge: Intel
Corporation Coffee Lake HOST and DRAM Controller (rev 0c)".
Add it to the list.

We should perhaps consider applying this quirk more widely.
The Intel documentation does not list my device [1], but
linuxhw [2] does, and it seems to list a few more bridges
we do not currently cover (3e31, 3ecc, 3e35, 3e0f).

[1] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/8th-gen-core-family-datasheet-vol-2.pdf
[2] https://github.com/linuxhw/DevicePopulation/blob/master/README.md

Cc: stable@vger.kernel.org # v5.13+
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
---
 arch/x86/kernel/early-quirks.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Bjorn Helgaas Sept. 16, 2021, 3:07 p.m. UTC | #1
On Thu, Sep 16, 2021 at 06:17:39AM -0700, Jakub Kicinski wrote:
> My Lenovo T490s with i7-8665U had been marking TSC as unstable
> since v5.13, resulting in very sluggish desktop experience...

Including the actual dmesg log line here might help others locate this
fix.

> I have a 8086:3e34 bridge, also known as "Host bridge: Intel
> Corporation Coffee Lake HOST and DRAM Controller (rev 0c)".
> Add it to the list.
> 
> We should perhaps consider applying this quirk more widely.
> The Intel documentation does not list my device [1], but
> linuxhw [2] does, and it seems to list a few more bridges
> we do not currently cover (3e31, 3ecc, 3e35, 3e0f).

In the fine tradition of:

  e0748539e3d5 ("x86/intel: Disable HPET on Intel Ice Lake platforms")
  f8edbde885bb ("x86/intel: Disable HPET on Intel Coffee Lake H platforms")
  fc5db58539b4 ("x86/quirks: Disable HPET on Intel Coffe Lake platforms")
  62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail plat
form")

This seems to be an ongoing issue, not just a point defect in a single
product, and I really hate the onesy-twosy nature of this.  Is there
really no way to detect this issue automatically or fix whatever Linux
bug makes us trip over this?  I am no clock expert, so I have
absolutely no idea whether this is possible.

> [1] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/8th-gen-core-family-datasheet-vol-2.pdf
> [2] https://github.com/linuxhw/DevicePopulation/blob/master/README.md
> 
> Cc: stable@vger.kernel.org # v5.13+

How did you pick v5.13?  force_disable_hpet() was added by
62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail
platform"), which appeared in v3.15.

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> ---
>  arch/x86/kernel/early-quirks.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 38837dad46e6..7d2de04f8750 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -716,6 +716,8 @@ static struct chipset early_qrk[] __initdata = {
>  		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
>  	{ PCI_VENDOR_ID_INTEL, 0x3e20,
>  		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
> +	{ PCI_VENDOR_ID_INTEL, 0x3e34,
> +		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
>  	{ PCI_VENDOR_ID_INTEL, 0x3ec4,
>  		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
>  	{ PCI_VENDOR_ID_INTEL, 0x8a12,
> -- 
> 2.31.1
>
Jakub Kicinski Sept. 16, 2021, 3:30 p.m. UTC | #2
On Thu, 16 Sep 2021 10:07:07 -0500 Bjorn Helgaas wrote:
> On Thu, Sep 16, 2021 at 06:17:39AM -0700, Jakub Kicinski wrote:
> > My Lenovo T490s with i7-8665U had been marking TSC as unstable
> > since v5.13, resulting in very sluggish desktop experience...  
> 
> Including the actual dmesg log line here might help others locate this
> fix.

Good point, will add in v2.

clocksource: timekeeping watchdog on CPU3: hpet read-back delay of 316000ns, attempt 4, marking unstable
tsc: Marking TSC unstable due to clocksource watchdog
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
sched_clock: Marking unstable (14539801827657, -530891666)<-(14539319241737, -48307500)
clocksource: Checking clocksource tsc synchronization from CPU 3 to CPUs 0-2,6-7.
clocksource: Switched to clocksource hpet

> > I have a 8086:3e34 bridge, also known as "Host bridge: Intel
> > Corporation Coffee Lake HOST and DRAM Controller (rev 0c)".
> > Add it to the list.
> > 
> > We should perhaps consider applying this quirk more widely.
> > The Intel documentation does not list my device [1], but
> > linuxhw [2] does, and it seems to list a few more bridges
> > we do not currently cover (3e31, 3ecc, 3e35, 3e0f).  
> 
> In the fine tradition of:
> 
>   e0748539e3d5 ("x86/intel: Disable HPET on Intel Ice Lake platforms")
>   f8edbde885bb ("x86/intel: Disable HPET on Intel Coffee Lake H platforms")
>   fc5db58539b4 ("x86/quirks: Disable HPET on Intel Coffe Lake platforms")
>   62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail plat form")
> 
> This seems to be an ongoing issue, not just a point defect in a single
> product, and I really hate the onesy-twosy nature of this.

Indeed. Or at least cover all Coffee Lakes in one fell swoop.

> Is there really no way to detect this issue automatically or fix
> whatever Linux bug makes us trip over this?  I am no clock expert, so
> I have absolutely no idea whether this is possible.

I'm deferring to clock experts. Paul mentioned he has some prototype
patches that may help.

> > [1] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/8th-gen-core-family-datasheet-vol-2.pdf
> > [2] https://github.com/linuxhw/DevicePopulation/blob/master/README.md
> > 
> > Cc: stable@vger.kernel.org # v5.13+  
> 
> How did you pick v5.13?  force_disable_hpet() was added by
> 62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail
> platform"), which appeared in v3.15.

Erm, good question, it started happening for me (and others with the
same laptop) with v5.13. I just sort of assumed it was 2e27e793e280
("clocksource: Reduce clocksource-skew threshold"). 

It usually takes  a day to repro (4 hours was the quickest repro I've
seen) so bisection was kind of out of question.
Paul E. McKenney Sept. 16, 2021, 4:35 p.m. UTC | #3
On Thu, Sep 16, 2021 at 08:30:42AM -0700, Jakub Kicinski wrote:
> On Thu, 16 Sep 2021 10:07:07 -0500 Bjorn Helgaas wrote:
> > On Thu, Sep 16, 2021 at 06:17:39AM -0700, Jakub Kicinski wrote:
> > > My Lenovo T490s with i7-8665U had been marking TSC as unstable
> > > since v5.13, resulting in very sluggish desktop experience...  
> > 
> > Including the actual dmesg log line here might help others locate this
> > fix.
> 
> Good point, will add in v2.
> 
> clocksource: timekeeping watchdog on CPU3: hpet read-back delay of 316000ns, attempt 4, marking unstable
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (14539801827657, -530891666)<-(14539319241737, -48307500)
> clocksource: Checking clocksource tsc synchronization from CPU 3 to CPUs 0-2,6-7.
> clocksource: Switched to clocksource hpet
> 
> > > I have a 8086:3e34 bridge, also known as "Host bridge: Intel
> > > Corporation Coffee Lake HOST and DRAM Controller (rev 0c)".
> > > Add it to the list.
> > > 
> > > We should perhaps consider applying this quirk more widely.
> > > The Intel documentation does not list my device [1], but
> > > linuxhw [2] does, and it seems to list a few more bridges
> > > we do not currently cover (3e31, 3ecc, 3e35, 3e0f).  
> > 
> > In the fine tradition of:
> > 
> >   e0748539e3d5 ("x86/intel: Disable HPET on Intel Ice Lake platforms")
> >   f8edbde885bb ("x86/intel: Disable HPET on Intel Coffee Lake H platforms")
> >   fc5db58539b4 ("x86/quirks: Disable HPET on Intel Coffe Lake platforms")
> >   62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail plat form")
> > 
> > This seems to be an ongoing issue, not just a point defect in a single
> > product, and I really hate the onesy-twosy nature of this.
> 
> Indeed. Or at least cover all Coffee Lakes in one fell swoop.
> 
> > Is there really no way to detect this issue automatically or fix
> > whatever Linux bug makes us trip over this?  I am no clock expert, so
> > I have absolutely no idea whether this is possible.
> 
> I'm deferring to clock experts. Paul mentioned he has some prototype
> patches that may help.
> 
> > > [1] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/8th-gen-core-family-datasheet-vol-2.pdf
> > > [2] https://github.com/linuxhw/DevicePopulation/blob/master/README.md
> > > 
> > > Cc: stable@vger.kernel.org # v5.13+  
> > 
> > How did you pick v5.13?  force_disable_hpet() was added by
> > 62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail
> > platform"), which appeared in v3.15.
> 
> Erm, good question, it started happening for me (and others with the
> same laptop) with v5.13. I just sort of assumed it was 2e27e793e280
> ("clocksource: Reduce clocksource-skew threshold"). 
> 
> It usually takes  a day to repro (4 hours was the quickest repro I've
> seen) so bisection was kind of out of question.

OK, so this is an intermittent condition where HPET is sometimes slow to
access for a short period of time?  If that is the case, my thought is
to set the clocksource to be reinitialized (without a splat and without
marking the clocksource unstable), and to splat (and mark the clocksource
unstable) if it is not get a good read after 100 subsequent attempts.

So as long as the period of slowness lasts for less than 50 seconds,
things would work fine.

Seem reasonable?

							Thanx, Paul
Jakub Kicinski Sept. 17, 2021, 2:57 a.m. UTC | #4
On Thu, 16 Sep 2021 09:35:47 -0700 Paul E. McKenney wrote:
> > > How did you pick v5.13?  force_disable_hpet() was added by
> > > 62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail
> > > platform"), which appeared in v3.15.  
> > 
> > Erm, good question, it started happening for me (and others with the
> > same laptop) with v5.13. I just sort of assumed it was 2e27e793e280
> > ("clocksource: Reduce clocksource-skew threshold"). 
> > 
> > It usually takes  a day to repro (4 hours was the quickest repro I've
> > seen) so bisection was kind of out of question.  
> 
> OK, so this is an intermittent condition where HPET is sometimes slow to
> access for a short period of time?  If that is the case, my thought is
> to set the clocksource to be reinitialized (without a splat and without
> marking the clocksource unstable), and to splat (and mark the clocksource
> unstable) if it is not get a good read after 100 subsequent attempts.
> 
> So as long as the period of slowness lasts for less than 50 seconds,
> things would work fine.
> 
> Seem reasonable?

Could well be. Initially I thought it was suspend/resume related, then
I looked closer and it did happen mostly after resume... but anywhere
between 20 minutes to few hours after the resume.

I'm here to test less crude patches but since that may take some time
I'd hope we can get this merged and into stable ASAP. Hopefully it can
make it to 5.13 while that branch is alive and into Fedora. It really
makes Coffee Lake machines pretty much unusable.
Paul E. McKenney Sept. 17, 2021, 3:33 a.m. UTC | #5
On Thu, Sep 16, 2021 at 07:57:13PM -0700, Jakub Kicinski wrote:
> On Thu, 16 Sep 2021 09:35:47 -0700 Paul E. McKenney wrote:
> > > > How did you pick v5.13?  force_disable_hpet() was added by
> > > > 62187910b0fc ("x86/intel: Add quirk to disable HPET for the Baytrail
> > > > platform"), which appeared in v3.15.  
> > > 
> > > Erm, good question, it started happening for me (and others with the
> > > same laptop) with v5.13. I just sort of assumed it was 2e27e793e280
> > > ("clocksource: Reduce clocksource-skew threshold"). 
> > > 
> > > It usually takes  a day to repro (4 hours was the quickest repro I've
> > > seen) so bisection was kind of out of question.  
> > 
> > OK, so this is an intermittent condition where HPET is sometimes slow to
> > access for a short period of time?  If that is the case, my thought is
> > to set the clocksource to be reinitialized (without a splat and without
> > marking the clocksource unstable), and to splat (and mark the clocksource
> > unstable) if it is not get a good read after 100 subsequent attempts.
> > 
> > So as long as the period of slowness lasts for less than 50 seconds,
> > things would work fine.
> > 
> > Seem reasonable?
> 
> Could well be. Initially I thought it was suspend/resume related, then
> I looked closer and it did happen mostly after resume... but anywhere
> between 20 minutes to few hours after the resume.
> 
> I'm here to test less crude patches but since that may take some time
> I'd hope we can get this merged and into stable ASAP. Hopefully it can
> make it to 5.13 while that branch is alive and into Fedora. It really
> makes Coffee Lake machines pretty much unusable.

Indeed, I could see where your reproducer is at best rather annoying.
But a good data point for me either way, so thank you!

							Thanx, Paul
Peter Zijlstra Sept. 17, 2021, 9:11 a.m. UTC | #6
On Thu, Sep 16, 2021 at 10:07:07AM -0500, Bjorn Helgaas wrote:
> This seems to be an ongoing issue, not just a point defect in a single
> product, and I really hate the onesy-twosy nature of this.  Is there
> really no way to detect this issue automatically or fix whatever Linux
> bug makes us trip over this?  I am no clock expert, so I have
> absolutely no idea whether this is possible.

X86 is gifted with the grant total of _0_ reliable clocks. Given no
accurate time, it is impossible to tell which one of them is broken
worst. Although I suppose we could attempt to synchronize against the
PMU or MPERF..

We could possibly disable the tsc watchdog for
X86_FEATURE_TSC_KNOWN_FREQ && X86_FEATURE_TSC_ADJUST I suppose.

And then have people with 'creative' BIOS get to keep the pieces.
Peter Zijlstra Sept. 17, 2021, 9:34 a.m. UTC | #7
On Fri, Sep 17, 2021 at 11:11:49AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 16, 2021 at 10:07:07AM -0500, Bjorn Helgaas wrote:
> > This seems to be an ongoing issue, not just a point defect in a single
> > product, and I really hate the onesy-twosy nature of this.  Is there
> > really no way to detect this issue automatically or fix whatever Linux
> > bug makes us trip over this?  I am no clock expert, so I have
> > absolutely no idea whether this is possible.
> 
> X86 is gifted with the grant total of _0_ reliable clocks. Given no
> accurate time, it is impossible to tell which one of them is broken
> worst. Although I suppose we could attempt to synchronize against the
> PMU or MPERF..
> 
> We could possibly disable the tsc watchdog for
> X86_FEATURE_TSC_KNOWN_FREQ && X86_FEATURE_TSC_ADJUST I suppose.
> 
> And then have people with 'creative' BIOS get to keep the pieces.

Alternatively, we can change what the TSC watchdog does for
X86_FEATURE_TSC_ADJUST machines. Instead of checking time against HPET
it can check if TSC_ADJUST changes. That should make it more resillient
vs HPET time itself being off.
Krzysztof Wilczyński Sept. 17, 2021, 2 p.m. UTC | #8
Hi Jakub,

[...]
> The Intel documentation does not list my device [1], but
> linuxhw [2] does, and it seems to list a few more bridges
> we do not currently cover (3e31, 3ecc, 3e35, 3e0f).

This might be out-of-scope for this particular patch, but it makes me
wonder - if we know that these other bridges were identified as having the
same issue, would it be prudent to add a quriks for them too?  It might
save some people a headache and such.

	Krzysztof
Jakub Kicinski Sept. 17, 2021, 2:58 p.m. UTC | #9
On Fri, 17 Sep 2021 16:00:03 +0200 Krzysztof Wilczyński wrote:
> > The Intel documentation does not list my device [1], but
> > linuxhw [2] does, and it seems to list a few more bridges
> > we do not currently cover (3e31, 3ecc, 3e35, 3e0f).  
> 
> This might be out-of-scope for this particular patch, but it makes me
> wonder - if we know that these other bridges were identified as having the
> same issue, would it be prudent to add a quriks for them too?  It might
> save some people a headache and such.

Indeed, I was about to do it but the disagreement between the Intel
docs and the linuxhw device list discouraged me. It's confusing that
the most popular SKU in the field (according to linuxhw, which is also
the one I have and the one addressed by this patch) is not in the
official Intel PDF. And what about the numbers that fall in between
the known revisions? Perhaps we could use a mask instead of having
10+ entries?

Then there's also a precedent of adding them one by one in the patches
quoted my Bjorn.

All in all I thought I'd just mention this and let the people familiar
with x86 platforms guide me.

Or maybe we can wait for patches from Paul or Peter to fix the less
popular devices?

Not sure.. I'm happy to produce a patch that adds more entries, or 
for someone else to do it..
Thomas Gleixner Sept. 19, 2021, 12:14 a.m. UTC | #10
On Fri, Sep 17 2021 at 11:34, Peter Zijlstra wrote:
> On Fri, Sep 17, 2021 at 11:11:49AM +0200, Peter Zijlstra wrote:
>> On Thu, Sep 16, 2021 at 10:07:07AM -0500, Bjorn Helgaas wrote:
>> > This seems to be an ongoing issue, not just a point defect in a single
>> > product, and I really hate the onesy-twosy nature of this.  Is there
>> > really no way to detect this issue automatically or fix whatever Linux
>> > bug makes us trip over this?  I am no clock expert, so I have
>> > absolutely no idea whether this is possible.

Right, we need to have all these quirks because we can't define a
generation cutoff based on family/model because X86 model is simply a
random number. There might be some scheme behind it, but it's neither
obvious nor documented.

But the HPET on the affected machines goes south when the system enters
PC10. So the right thing to do is to check whether PC10 is supported and
force disable HPET if that's the case. That disablement is required
independent of the clocksource watchdog problem because HPET is exposed
in other ways as well.

Questions for Rafael:

What's the proper way to figure out whether PC10 is supported? I got
lost in the maze of intel_idle and ACPI muck and several other places
which check that. Just grep for CPUID_MWAIT_LEAF and see how consistent
all of that is.

Why the heck can't we have _ONE_ authoritive implementation of that?
Just because, right?

Of course all of this is well documented as usual....

>> X86 is gifted with the grant total of _0_ reliable clocks. Given no
>> accurate time, it is impossible to tell which one of them is broken
>> worst. Although I suppose we could attempt to synchronize against the
>> PMU or MPERF..
>> 
>> We could possibly disable the tsc watchdog for
>> X86_FEATURE_TSC_KNOWN_FREQ && X86_FEATURE_TSC_ADJUST I suppose.
>> 
>> And then have people with 'creative' BIOS get to keep the pieces.
>
> Alternatively, we can change what the TSC watchdog does for
> X86_FEATURE_TSC_ADJUST machines. Instead of checking time against HPET
> it can check if TSC_ADJUST changes. That should make it more resillient
> vs HPET time itself being off.

I tried that and I hated the mess it created. Abusing the clocksource
watchdog machinery for that is a nightmare. Don't even think about it.

When TSC_ADJUST is available then we check the MSR when a CPU goes idle,
but not more often than once per second. My concern is that we can't
check TSC_ADJUST for modifications on fully loaded CPUs, but why do I
still care?

The requirements for ditching the watchdog should be:

    X86_FEATURE_TSC_KNOWN_FREQ &&
    X86_FEATURE_TSC_ADJUST &&
    X86_FEATURE_NONSTOP_TSC &&
    X86_FEATURE_ARAT

But expecting X86_FEATURE_TSC_KNOWN_FREQ to be set on these HPET
trainwreck equipped machines is wishful thinking:

# cpuid -1 -l 0x15
CPU:
   Time Stamp Counter/Core Crystal Clock Information (0x15):
      TSC/clock ratio = 176/2
      nominal core crystal clock = 0 Hz

We calculate that missing frequency then from leaf 0x16:

# cpuid -1 -l 0x16
CPU:
   Processor Frequency Information (0x16):
      Core Base Frequency (MHz) = 0x834 (2100)
      Core Maximum Frequency (MHz) = 0x1068 (4200)
      Bus (Reference) Frequency (MHz) = 0x64 (100)

But we don't set the TSC_KNOWN_FREQ feature bit in the case that crystal
clock is 0 and we need to use leaf 16. Which is entirely correct because
the Core Base Frequency CPUID info is a joke:

[    3.045828] tsc: Refined TSC clocksource calibration: 2111.993 MHz

The refined calibration is pretty accurate according to NTP and if you
take the CPUID 15/16 numbers into account even obvious with pure math:

  TSC/clock ratio = 176/2
  Core Base Frequency (MHz) = 0x834 (2100)

  2100 / (176 / 2) = 23.8636 (MHz)

which would be a very unusual crystal frequency. The refined calibration
makes a lot more sense:

  2112 / (176 / 2) = 24 (MHz)

which is one of the very well known crystal frequencies of these
machines.

It's 2021 now and we are still not able to get reasonable information
from hardware/firmware about this?

Can the hardware and firmware people finaly get their act together?

Here is the simple list of things we are asking for:

 - Reliable TSC which cannot be tinkered with even by "value add" BIOSes

 - Reliable information from hardware/firmware about the TSC frequency

 - Hardware enforced (or firmware assisted) guarantees of TSC being
   synchronized accross sockets

We are asking for that for more than _twenty_ years now. All what we get
are useless new features and as demonstrated in the case at hand new
types of wreckage instead of a proper solution to the underlying
problems.

I'm personally dealing with the x86 timer hardware trainwrecks for more
than 20 years now. TBH, I'm tired of this idiocy and very close to the
point where I stop caring.

Thanks,

        tglx
Wysocki, Rafael J Sept. 21, 2021, 6:05 p.m. UTC | #11
On 9/19/2021 2:14 AM, Thomas Gleixner wrote:
> On Fri, Sep 17 2021 at 11:34, Peter Zijlstra wrote:
>> On Fri, Sep 17, 2021 at 11:11:49AM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 16, 2021 at 10:07:07AM -0500, Bjorn Helgaas wrote:
>>>> This seems to be an ongoing issue, not just a point defect in a single
>>>> product, and I really hate the onesy-twosy nature of this.  Is there
>>>> really no way to detect this issue automatically or fix whatever Linux
>>>> bug makes us trip over this?  I am no clock expert, so I have
>>>> absolutely no idea whether this is possible.
> Right, we need to have all these quirks because we can't define a
> generation cutoff based on family/model because X86 model is simply a
> random number. There might be some scheme behind it, but it's neither
> obvious nor documented.
>
> But the HPET on the affected machines goes south when the system enters
> PC10. So the right thing to do is to check whether PC10 is supported and
> force disable HPET if that's the case. That disablement is required
> independent of the clocksource watchdog problem because HPET is exposed
> in other ways as well.
>
> Questions for Rafael:
>
> What's the proper way to figure out whether PC10 is supported?

I can't say without research.  I think it'd be sufficient to check if 
C10 is supported, because asking for it is the only way to get PC10.

However, even if it is supported, the problem is not there until the 
kernel asks for C10.  So instead, I'd disable the TSC watchdog on the 
first attempt to ask the processor for C10 from the cpuidle code and I'd 
do that from the relevant drivers (intel_idle and ACPI idle).

There would be no TSC watchdog for the C10 users, but wouldn't that be a 
fair game?


> I got lost in the maze of intel_idle and ACPI muck and several other places
> which check that. Just grep for CPUID_MWAIT_LEAF and see how consistent
> all of that is.
>
> Why the heck can't we have _ONE_ authoritive implementation of that?
> Just because, right?
>
> Of course all of this is well documented as usual....
>
>>> X86 is gifted with the grant total of _0_ reliable clocks. Given no
>>> accurate time, it is impossible to tell which one of them is broken
>>> worst. Although I suppose we could attempt to synchronize against the
>>> PMU or MPERF..
>>>
>>> We could possibly disable the tsc watchdog for
>>> X86_FEATURE_TSC_KNOWN_FREQ && X86_FEATURE_TSC_ADJUST I suppose.
>>>
>>> And then have people with 'creative' BIOS get to keep the pieces.
>> Alternatively, we can change what the TSC watchdog does for
>> X86_FEATURE_TSC_ADJUST machines. Instead of checking time against HPET
>> it can check if TSC_ADJUST changes. That should make it more resillient
>> vs HPET time itself being off.
> I tried that and I hated the mess it created. Abusing the clocksource
> watchdog machinery for that is a nightmare. Don't even think about it.
>
> When TSC_ADJUST is available then we check the MSR when a CPU goes idle,
> but not more often than once per second. My concern is that we can't
> check TSC_ADJUST for modifications on fully loaded CPUs, but why do I
> still care?
>
> The requirements for ditching the watchdog should be:
>
>      X86_FEATURE_TSC_KNOWN_FREQ &&
>      X86_FEATURE_TSC_ADJUST &&
>      X86_FEATURE_NONSTOP_TSC &&
>      X86_FEATURE_ARAT
>
> But expecting X86_FEATURE_TSC_KNOWN_FREQ to be set on these HPET
> trainwreck equipped machines is wishful thinking:
>
> # cpuid -1 -l 0x15
> CPU:
>     Time Stamp Counter/Core Crystal Clock Information (0x15):
>        TSC/clock ratio = 176/2
>        nominal core crystal clock = 0 Hz
>
> We calculate that missing frequency then from leaf 0x16:
>
> # cpuid -1 -l 0x16
> CPU:
>     Processor Frequency Information (0x16):
>        Core Base Frequency (MHz) = 0x834 (2100)
>        Core Maximum Frequency (MHz) = 0x1068 (4200)
>        Bus (Reference) Frequency (MHz) = 0x64 (100)
>
> But we don't set the TSC_KNOWN_FREQ feature bit in the case that crystal
> clock is 0 and we need to use leaf 16. Which is entirely correct because
> the Core Base Frequency CPUID info is a joke:
>
> [    3.045828] tsc: Refined TSC clocksource calibration: 2111.993 MHz
>
> The refined calibration is pretty accurate according to NTP and if you
> take the CPUID 15/16 numbers into account even obvious with pure math:
>
>    TSC/clock ratio = 176/2
>    Core Base Frequency (MHz) = 0x834 (2100)
>
>    2100 / (176 / 2) = 23.8636 (MHz)
>
> which would be a very unusual crystal frequency. The refined calibration
> makes a lot more sense:
>
>    2112 / (176 / 2) = 24 (MHz)
>
> which is one of the very well known crystal frequencies of these
> machines.
>
> It's 2021 now and we are still not able to get reasonable information
> from hardware/firmware about this?
>
> Can the hardware and firmware people finaly get their act together?
>
> Here is the simple list of things we are asking for:
>
>   - Reliable TSC which cannot be tinkered with even by "value add" BIOSes
>
>   - Reliable information from hardware/firmware about the TSC frequency
>
>   - Hardware enforced (or firmware assisted) guarantees of TSC being
>     synchronized accross sockets
>
> We are asking for that for more than _twenty_ years now. All what we get
> are useless new features and as demonstrated in the case at hand new
> types of wreckage instead of a proper solution to the underlying
> problems.
>
> I'm personally dealing with the x86 timer hardware trainwrecks for more
> than 20 years now. TBH, I'm tired of this idiocy and very close to the
> point where I stop caring.
>
> Thanks,
>
>          tglx
Thomas Gleixner Sept. 21, 2021, 8:18 p.m. UTC | #12
On Tue, Sep 21 2021 at 20:05, Rafael J. Wysocki wrote:
> On 9/19/2021 2:14 AM, Thomas Gleixner wrote:
>> What's the proper way to figure out whether PC10 is supported?
>
> I can't say without research.  I think it'd be sufficient to check if 
> C10 is supported, because asking for it is the only way to get PC10.

Do we have a common function for that or do I need to implement the
gazillionst CPUID query for that?

> However, even if it is supported, the problem is not there until the 
> kernel asks for C10.  So instead, I'd disable the TSC watchdog on the 
> first attempt to ask the processor for C10 from the cpuidle code and I'd 
> do that from the relevant drivers (intel_idle and ACPI idle).
>
> There would be no TSC watchdog for the C10 users, but wouldn't that be a 
> fair game?

Not really because that makes any other HPET usage broken as well and we
can't pull the rug under that. So we are better off to disable it
upfront.

Thanks,

        tglx
Rafael J. Wysocki Sept. 22, 2021, 8:27 p.m. UTC | #13
On Tue, Sep 21, 2021 at 10:18 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Sep 21 2021 at 20:05, Rafael J. Wysocki wrote:
> > On 9/19/2021 2:14 AM, Thomas Gleixner wrote:
> >> What's the proper way to figure out whether PC10 is supported?
> >
> > I can't say without research.  I think it'd be sufficient to check if
> > C10 is supported, because asking for it is the only way to get PC10.
>
> Do we have a common function for that or do I need to implement the
> gazillionst CPUID query for that?

intel_idle has intel_idle_verify_cstate() that works on MWAIT
substates from CPUID.  It looks like this could be reused.

> > However, even if it is supported, the problem is not there until the
> > kernel asks for C10.  So instead, I'd disable the TSC watchdog on the
> > first attempt to ask the processor for C10 from the cpuidle code and I'd
> > do that from the relevant drivers (intel_idle and ACPI idle).
> >
> > There would be no TSC watchdog for the C10 users, but wouldn't that be a
> > fair game?
>
> Not really because that makes any other HPET usage broken as well and we
> can't pull the rug under that. So we are better off to disable it
> upfront.

Allright.
Thomas Gleixner Sept. 22, 2021, 10:21 p.m. UTC | #14
On Wed, Sep 22 2021 at 22:27, Rafael J. Wysocki wrote:
> On Tue, Sep 21, 2021 at 10:18 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Tue, Sep 21 2021 at 20:05, Rafael J. Wysocki wrote:
>> > On 9/19/2021 2:14 AM, Thomas Gleixner wrote:
>> >> What's the proper way to figure out whether PC10 is supported?
>> >
>> > I can't say without research.  I think it'd be sufficient to check if
>> > C10 is supported, because asking for it is the only way to get PC10.
>>
>> Do we have a common function for that or do I need to implement the
>> gazillionst CPUID query for that?
>
> intel_idle has intel_idle_verify_cstate() that works on MWAIT
> substates from CPUID.  It looks like this could be reused.

Not to me. That's some cpuidle/intel_idle specific check which depends
on cpuidle_state_table being set up which is not available during early
boot.

The question I was asking whether we have a central place where we can
retrieve such information w/o invoking CPUID over and over again and
applying voodoo checks on it.

Obviously we don't, which sucks.

Thanks,

        tglx
Rafael J. Wysocki Sept. 23, 2021, 10:46 a.m. UTC | #15
On Thu, Sep 23, 2021 at 12:21 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Sep 22 2021 at 22:27, Rafael J. Wysocki wrote:
> > On Tue, Sep 21, 2021 at 10:18 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> On Tue, Sep 21 2021 at 20:05, Rafael J. Wysocki wrote:
> >> > On 9/19/2021 2:14 AM, Thomas Gleixner wrote:
> >> >> What's the proper way to figure out whether PC10 is supported?
> >> >
> >> > I can't say without research.  I think it'd be sufficient to check if
> >> > C10 is supported, because asking for it is the only way to get PC10.
> >>
> >> Do we have a common function for that or do I need to implement the
> >> gazillionst CPUID query for that?
> >
> > intel_idle has intel_idle_verify_cstate() that works on MWAIT
> > substates from CPUID.  It looks like this could be reused.
>
> Not to me. That's some cpuidle/intel_idle specific check which depends
> on cpuidle_state_table

No, it doesn't.  The only thing this depends on is mwait_substates
which directly comes from the CPUID evaluation in intel_idle_init().
The argument is an MWAIT hint, but it doesn't have to be one, it could
be a state number.

Anyway, this is just part of what is needed.

So the way to check the PC10 support is to get the mask of MWAIT
substates from CPUID (like in intel_idle_init()) and check if there
are any substates for C10 in that mask and check if PC10 is enabled in
MSR_PKG_CST_CONFIG_CONTROL (like in sklh_idle_state_table_update()).

The MWAIT substates checking part could then be used by intel_idle
(and maybe by ACPI idle too).

> being set up which is not available during early boot.

> The question I was asking whether we have a central place where we can
> retrieve such information w/o invoking CPUID over and over again and
> applying voodoo checks on it.
>
> Obviously we don't, which sucks.

Well, nobody except for intel_idle wanted it, so there was not much
point doing it centrally for just one user.
diff mbox series

Patch

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 38837dad46e6..7d2de04f8750 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -716,6 +716,8 @@  static struct chipset early_qrk[] __initdata = {
 		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
 	{ PCI_VENDOR_ID_INTEL, 0x3e20,
 		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
+	{ PCI_VENDOR_ID_INTEL, 0x3e34,
+		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
 	{ PCI_VENDOR_ID_INTEL, 0x3ec4,
 		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
 	{ PCI_VENDOR_ID_INTEL, 0x8a12,