hwmon: (peci/dimmtemp) Do not provide fake thresholds data

Message ID	20250123122003.6010-1-fercerpav@gmail.com (mailing list archive)
State	Accepted
Headers	show Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C09320E035; Thu, 23 Jan 2025 12:20:44 +0000 (UTC) From: Paul Fertser <fercerpav@gmail.com> To: Iwona Winiarska <iwona.winiarska@intel.com>, Jean Delvare <jdelvare@suse.com>, Guenter Roeck <linux@roeck-us.net>, Pierre-Louis Bossart <pierre-louis.bossart@linux.dev>, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>, Patrick Rudolph <patrick.rudolph@9elements.com>, Naresh Solanki <Naresh.Solanki@9elements.com> Cc: Joel Stanley <joel@jms.id.au>, linux-hwmon@vger.kernel.org, linux-kernel@vger.kernel.org, openbmc@lists.ozlabs.org, Ivan Mikhaylov <fr0st61te@gmail.com>, Paul Fertser <fercerpav@gmail.com>, stable@vger.kernel.org Subject: [PATCH] hwmon: (peci/dimmtemp) Do not provide fake thresholds data Date: Thu, 23 Jan 2025 15:20:02 +0300 Message-Id: <20250123122003.6010-1-fercerpav@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	hwmon: (peci/dimmtemp) Do not provide fake thresholds data \| expand hwmon: (peci/dimmtemp) Do not provide fake thresholds data

Paul Fertser Jan. 23, 2025, 12:20 p.m. UTC

When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
critical thresholds for particular DIMM the driver should return an
error to the userspace instead of giving it stale (best case) or wrong
(the structure contains all zeros after kzalloc() call) data.

The issue can be reproduced by binding the peci driver while the host is
fully booted and idle, this makes PECI interaction unreliable enough.

Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Fertser <fercerpav@gmail.com>
---
 drivers/hwmon/peci/dimmtemp.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Winiarska, Iwona Jan. 27, 2025, 4:40 p.m. UTC | #1

On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
> When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
> critical thresholds for particular DIMM the driver should return an
> error to the userspace instead of giving it stale (best case) or wrong
> (the structure contains all zeros after kzalloc() call) data.
> 
> The issue can be reproduced by binding the peci driver while the host is
> fully booted and idle, this makes PECI interaction unreliable enough.
> 
> Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
> Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Paul Fertser <fercerpav@gmail.com>

Hi!

Thank you for the patch.
Did you have a chance to test it with OpenBMC dbus-sensors?
In general, the change looks okay to me, but since it modifies the behavior
(applications will need to handle this, and returning an error will happen more
often) we need to confirm that it does not cause any regressions for userspace.

Once we are able to confirm that:

Reviewed-by: Iwona Winiarska <iwona.winiarska@intel.com>

Thanks
-Iwona

> ---
>  drivers/hwmon/peci/dimmtemp.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/hwmon/peci/dimmtemp.c b/drivers/hwmon/peci/dimmtemp.c
> index d6762259dd69..fbe82d9852e0 100644
> --- a/drivers/hwmon/peci/dimmtemp.c
> +++ b/drivers/hwmon/peci/dimmtemp.c
> @@ -127,8 +127,6 @@ static int update_thresholds(struct peci_dimmtemp *priv,
> int dimm_no)
>  		return 0;
>  
>  	ret = priv->gen_info->read_thresholds(priv, dimm_order, chan_rank,
> &data);
> -	if (ret == -ENODATA) /* Use default or previous value */
> -		return 0;
>  	if (ret)
>  		return ret;
>  
> @@ -509,11 +507,11 @@ read_thresholds_icx(struct peci_dimmtemp *priv, int
> dimm_order, int chan_rank, u
>  
>  	ret = peci_ep_pci_local_read(priv->peci_dev, 0, 13, 0, 2, 0xd4,
> &reg_val);
>  	if (ret || !(reg_val & BIT(31)))
> -		return -ENODATA; /* Use default or previous value */
> +		return -ENODATA;
>  
>  	ret = peci_ep_pci_local_read(priv->peci_dev, 0, 13, 0, 2, 0xd0,
> &reg_val);
>  	if (ret)
> -		return -ENODATA; /* Use default or previous value */
> +		return -ENODATA;
>  
>  	/*
>  	 * Device 26, Offset 224e0: IMC 0 channel 0 -> rank 0
> @@ -546,11 +544,11 @@ read_thresholds_spr(struct peci_dimmtemp *priv, int
> dimm_order, int chan_rank, u
>  
>  	ret = peci_ep_pci_local_read(priv->peci_dev, 0, 30, 0, 2, 0xd4,
> &reg_val);
>  	if (ret || !(reg_val & BIT(31)))
> -		return -ENODATA; /* Use default or previous value */
> +		return -ENODATA;
>  
>  	ret = peci_ep_pci_local_read(priv->peci_dev, 0, 30, 0, 2, 0xd0,
> &reg_val);
>  	if (ret)
> -		return -ENODATA; /* Use default or previous value */
> +		return -ENODATA;
>  
>  	/*
>  	 * Device 26, Offset 219a8: IMC 0 channel 0 -> rank 0

Guenter Roeck Jan. 27, 2025, 5:29 p.m. UTC | #2

On 1/27/25 08:40, Winiarska, Iwona wrote:
> On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
>> When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
>> critical thresholds for particular DIMM the driver should return an
>> error to the userspace instead of giving it stale (best case) or wrong
>> (the structure contains all zeros after kzalloc() call) data.
>>
>> The issue can be reproduced by binding the peci driver while the host is
>> fully booted and idle, this makes PECI interaction unreliable enough.
>>
>> Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
>> Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Paul Fertser <fercerpav@gmail.com>
> 
> Hi!
> 
> Thank you for the patch.
> Did you have a chance to test it with OpenBMC dbus-sensors?
> In general, the change looks okay to me, but since it modifies the behavior
> (applications will need to handle this, and returning an error will happen more
> often) we need to confirm that it does not cause any regressions for userspace.
> 

I would also like to understand if the error is temporary or permanent.
If it is permanent, the attributes should not be created in the first
place. It does not make sense to have limit attributes which always report
-ENODATA.

Guenter

Paul Fertser Jan. 27, 2025, 6:30 p.m. UTC | #3

Hi Guenter,

On Mon, Jan 27, 2025 at 09:29:39AM -0800, Guenter Roeck wrote:
> On 1/27/25 08:40, Winiarska, Iwona wrote:
> > On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
> > > When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
> > > critical thresholds for particular DIMM the driver should return an
> > > error to the userspace instead of giving it stale (best case) or wrong
> > > (the structure contains all zeros after kzalloc() call) data.
> > > 
> > > The issue can be reproduced by binding the peci driver while the host is
> > > fully booted and idle, this makes PECI interaction unreliable enough.
> > > 
> > > Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
> > > Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Paul Fertser <fercerpav@gmail.com>
> > 
> > Hi!
> > 
> > Thank you for the patch.
> > Did you have a chance to test it with OpenBMC dbus-sensors?
> > In general, the change looks okay to me, but since it modifies the behavior
> > (applications will need to handle this, and returning an error will happen more
> > often) we need to confirm that it does not cause any regressions for userspace.
> > 
> 
> I would also like to understand if the error is temporary or permanent.
> If it is permanent, the attributes should not be created in the first
> place. It does not make sense to have limit attributes which always report
> -ENODATA.

The error is temporary. The underlying reason is that when host CPUs
go to deep enough idle sleep state (probably C6) they stop responding
to PECI requests from BMC. Once something starts running the CPU
leaves C6 and starts responding and all the temperature data
(including the thresholds) becomes available again.

Guenter Roeck Jan. 27, 2025, 6:39 p.m. UTC | #4

On 1/27/25 10:30, Paul Fertser wrote:
> Hi Guenter,
> 
> On Mon, Jan 27, 2025 at 09:29:39AM -0800, Guenter Roeck wrote:
>> On 1/27/25 08:40, Winiarska, Iwona wrote:
>>> On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
>>>> When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
>>>> critical thresholds for particular DIMM the driver should return an
>>>> error to the userspace instead of giving it stale (best case) or wrong
>>>> (the structure contains all zeros after kzalloc() call) data.
>>>>
>>>> The issue can be reproduced by binding the peci driver while the host is
>>>> fully booted and idle, this makes PECI interaction unreliable enough.
>>>>
>>>> Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
>>>> Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Paul Fertser <fercerpav@gmail.com>
>>>
>>> Hi!
>>>
>>> Thank you for the patch.
>>> Did you have a chance to test it with OpenBMC dbus-sensors?
>>> In general, the change looks okay to me, but since it modifies the behavior
>>> (applications will need to handle this, and returning an error will happen more
>>> often) we need to confirm that it does not cause any regressions for userspace.
>>>
>>
>> I would also like to understand if the error is temporary or permanent.
>> If it is permanent, the attributes should not be created in the first
>> place. It does not make sense to have limit attributes which always report
>> -ENODATA.
> 
> The error is temporary. The underlying reason is that when host CPUs
> go to deep enough idle sleep state (probably C6) they stop responding
> to PECI requests from BMC. Once something starts running the CPU
> leaves C6 and starts responding and all the temperature data
> (including the thresholds) becomes available again.
> 

Thanks.

Next question: Is there evidence that the thresholds change while the CPU
is in a deep sleep state (or, in other words, that they are indeed stale) ?
Because if not it would be (much) better to only report -ENODATA if the
thresholds are uninitialized, and it would be even better than that if the
limits are read during initialization (and not updated at all) if they do
not change dynamically.

Guenter

Paul Fertser Jan. 27, 2025, 6:54 p.m. UTC | #5

Hi Iwona,

Thank you for the review. Please see inline.

On Mon, Jan 27, 2025 at 04:40:52PM +0000, Winiarska, Iwona wrote:
> On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
> > When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
> > critical thresholds for particular DIMM the driver should return an
> > error to the userspace instead of giving it stale (best case) or wrong
> > (the structure contains all zeros after kzalloc() call) data.
> > 
> > The issue can be reproduced by binding the peci driver while the host is
> > fully booted and idle, this makes PECI interaction unreliable enough.
> > 
> > Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
> > Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Paul Fertser <fercerpav@gmail.com>
> 
> Did you have a chance to test it with OpenBMC dbus-sensors?

Using OpenBMC dbus-sensors is exactly the reason why I'm sending this
patch, so yes, I tested it before and after the change.

> In general, the change looks okay to me, but since it modifies the behavior
> (applications will need to handle this, and returning an error will happen more
> often) we need to confirm that it does not cause any regressions for userspace.

The change is prompted by the current behaviour which is unacceptably
bad: every now and then while powering on the host for the first time
BMC happens to request one of the memory thresholds at a wrong time
(e.g. when UEFI is busy doing something which prevents normal PECI
operation); this leads to the unfixed kernel code returning zero and
dbus-sensors happily using that as a threshold value which later
results in bogus critical over temperature events for the affected
DIMM (as their normal temperatures are always above zero). It was
relatively easy to reproduce on an IceLake-based system.

I consider the current behaviour (in case of PECI timeouts when
requesting DIMM temperature thresholds) to be so broken that changing
it to do the right thing can only do good. The non-failure case is not
affected by this patch.

That said, for sensible operation a dbus-sensors change is indeed
needed and I now have a patch pending upstream review[0] to handle
those errors by retrying until success. Without the patch the daemon
would just load with those thresholds missing but it's better to have
thresholds missing than to have them at zero producing a critical error
right away I think.

[0] https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/77500/

Paul Fertser Jan. 27, 2025, 7:10 p.m. UTC | #6

On Mon, Jan 27, 2025 at 10:39:44AM -0800, Guenter Roeck wrote:
> On 1/27/25 10:30, Paul Fertser wrote:
> > Hi Guenter,
> > 
> > On Mon, Jan 27, 2025 at 09:29:39AM -0800, Guenter Roeck wrote:
> > > On 1/27/25 08:40, Winiarska, Iwona wrote:
> > > > On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
> > > > > When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
> > > > > critical thresholds for particular DIMM the driver should return an
> > > > > error to the userspace instead of giving it stale (best case) or wrong
> > > > > (the structure contains all zeros after kzalloc() call) data.
> > > > > 
> > > > > The issue can be reproduced by binding the peci driver while the host is
> > > > > fully booted and idle, this makes PECI interaction unreliable enough.
> > > > > 
> > > > > Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
> > > > > Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
> > > > > Cc: stable@vger.kernel.org
> > > > > Signed-off-by: Paul Fertser <fercerpav@gmail.com>
> > > > 
> > > > Hi!
> > > > 
> > > > Thank you for the patch.
> > > > Did you have a chance to test it with OpenBMC dbus-sensors?
> > > > In general, the change looks okay to me, but since it modifies the behavior
> > > > (applications will need to handle this, and returning an error will happen more
> > > > often) we need to confirm that it does not cause any regressions for userspace.
> > > > 
> > > 
> > > I would also like to understand if the error is temporary or permanent.
> > > If it is permanent, the attributes should not be created in the first
> > > place. It does not make sense to have limit attributes which always report
> > > -ENODATA.
> > 
> > The error is temporary. The underlying reason is that when host CPUs
> > go to deep enough idle sleep state (probably C6) they stop responding
> > to PECI requests from BMC. Once something starts running the CPU
> > leaves C6 and starts responding and all the temperature data
> > (including the thresholds) becomes available again.
> > 
> 
> Thanks.
> 
> Next question: Is there evidence that the thresholds change while the CPU
> is in a deep sleep state (or, in other words, that they are indeed stale) ?
> Because if not it would be (much) better to only report -ENODATA if the
> thresholds are uninitialized, and it would be even better than that if the
> limits are read during initialization (and not updated at all) if they do
> not change dynamically.

From BMC point of view when getting a timeout there is little
difference between the host not answering being in idle deep sleep
state and between host being completely powered off. Now I can imagine
a server system where BMC keeps running and the server has its DIMMs
physically changed to a different model with different threshold.

Whether it's realistic scenario and whether it's worth caching the
thresholds in the kernel I hope Iwona can clarify. In my current
opinion the added complexity isn't worth it, the PECI operation needs
to be reliable enough anyway for BMC to monitor at least the CPU
temperatures once a second to feed this essential data to the cooling
fans control loop. And if we can read CPU temperatures we can also
read DIMM thresholds when we need them and worse case retry a few
times while starting up the daemon.

Guenter Roeck Jan. 28, 2025, 3:34 a.m. UTC | #7

On 1/27/25 11:10, Paul Fertser wrote:
> On Mon, Jan 27, 2025 at 10:39:44AM -0800, Guenter Roeck wrote:
>> On 1/27/25 10:30, Paul Fertser wrote:
>>> Hi Guenter,
>>>
>>> On Mon, Jan 27, 2025 at 09:29:39AM -0800, Guenter Roeck wrote:
>>>> On 1/27/25 08:40, Winiarska, Iwona wrote:
>>>>> On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote:
>>>>>> When an Icelake or Sapphire Rapids CPU isn't providing the maximum and
>>>>>> critical thresholds for particular DIMM the driver should return an
>>>>>> error to the userspace instead of giving it stale (best case) or wrong
>>>>>> (the structure contains all zeros after kzalloc() call) data.
>>>>>>
>>>>>> The issue can be reproduced by binding the peci driver while the host is
>>>>>> fully booted and idle, this makes PECI interaction unreliable enough.
>>>>>>
>>>>>> Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver")
>>>>>> Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support")
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Signed-off-by: Paul Fertser <fercerpav@gmail.com>
>>>>>
>>>>> Hi!
>>>>>
>>>>> Thank you for the patch.
>>>>> Did you have a chance to test it with OpenBMC dbus-sensors?
>>>>> In general, the change looks okay to me, but since it modifies the behavior
>>>>> (applications will need to handle this, and returning an error will happen more
>>>>> often) we need to confirm that it does not cause any regressions for userspace.
>>>>>
>>>>
>>>> I would also like to understand if the error is temporary or permanent.
>>>> If it is permanent, the attributes should not be created in the first
>>>> place. It does not make sense to have limit attributes which always report
>>>> -ENODATA.
>>>
>>> The error is temporary. The underlying reason is that when host CPUs
>>> go to deep enough idle sleep state (probably C6) they stop responding
>>> to PECI requests from BMC. Once something starts running the CPU
>>> leaves C6 and starts responding and all the temperature data
>>> (including the thresholds) becomes available again.
>>>
>>
>> Thanks.
>>
>> Next question: Is there evidence that the thresholds change while the CPU
>> is in a deep sleep state (or, in other words, that they are indeed stale) ?
>> Because if not it would be (much) better to only report -ENODATA if the
>> thresholds are uninitialized, and it would be even better than that if the
>> limits are read during initialization (and not updated at all) if they do
>> not change dynamically.
> 
>>From BMC point of view when getting a timeout there is little
> difference between the host not answering being in idle deep sleep
> state and between host being completely powered off. Now I can imagine
> a server system where BMC keeps running and the server has its DIMMs
> physically changed to a different model with different threshold.
> 
> Whether it's realistic scenario and whether it's worth caching the
> thresholds in the kernel I hope Iwona can clarify. In my current
> opinion the added complexity isn't worth it, the PECI operation needs
> to be reliable enough anyway for BMC to monitor at least the CPU
> temperatures once a second to feed this essential data to the cooling
> fans control loop. And if we can read CPU temperatures we can also
> read DIMM thresholds when we need them and worse case retry a few
> times while starting up the daemon.
> 

Makes sense.

Applied.

Thanks,
Guenter

hwmon: (peci/dimmtemp) Do not provide fake thresholds data

Commit Message

Comments

Patch