diff mbox series

PCI: Add sysfs attribute for PCI device power state

Message ID 20201102141520.831630-1-luzmaximilian@gmail.com (mailing list archive)
State Accepted, archived
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Add sysfs attribute for PCI device power state | expand

Commit Message

Maximilian Luz Nov. 2, 2020, 2:15 p.m. UTC
While most PCI power-states can be queried from user-space via lspci,
this has some limits. Specifically, lspci fails to provide an accurate
value when the device is in D3cold as it has to resume the device before
it can access its power state via the configuration space, leading to it
reporting D0 or another on-state. Thus lspci can, for example, not be
used to diagnose power-consumption issues for devices that can enter
D3cold or to ensure that devices properly enter D3cold at all.

To alleviate this issue, introduce a new sysfs device attribute for the
PCI power state, showing the current power state as seen by the kernel.

Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
---
 Documentation/ABI/testing/sysfs-bus-pci |  9 +++++++++
 drivers/pci/pci-sysfs.c                 | 12 ++++++++++++
 2 files changed, 21 insertions(+)

Comments

Krzysztof Wilczyński Nov. 15, 2020, 6:08 a.m. UTC | #1
Hi Maximilian,

On 20-11-02 15:15:20, Maximilian Luz wrote:
> While most PCI power-states can be queried from user-space via lspci,
> this has some limits. Specifically, lspci fails to provide an accurate
> value when the device is in D3cold as it has to resume the device before
> it can access its power state via the configuration space, leading to it
> reporting D0 or another on-state. Thus lspci can, for example, not be
> used to diagnose power-consumption issues for devices that can enter
> D3cold or to ensure that devices properly enter D3cold at all.
> 
> To alleviate this issue, introduce a new sysfs device attribute for the
> PCI power state, showing the current power state as seen by the kernel.

Very nice!  Thank you for adding this.

[...]
> +/* PCI power state */
> +static ssize_t power_state_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	pci_power_t state = READ_ONCE(pci_dev->current_state);
> +
> +	return sprintf(buf, "%s\n", pci_power_name(state));
> +}
> +static DEVICE_ATTR_RO(power_state);
[...]

Curious, why did you decide to use the READ_ONCE() macro here?  Some
other drivers exposing data through sysfs use, but certainly not all.

Krzysztof
Maximilian Luz Nov. 15, 2020, 12:45 p.m. UTC | #2
On 11/15/20 7:08 AM, Krzysztof Wilczyński wrote:
> Hi Maximilian,
> 
> On 20-11-02 15:15:20, Maximilian Luz wrote:
>> While most PCI power-states can be queried from user-space via lspci,
>> this has some limits. Specifically, lspci fails to provide an accurate
>> value when the device is in D3cold as it has to resume the device before
>> it can access its power state via the configuration space, leading to it
>> reporting D0 or another on-state. Thus lspci can, for example, not be
>> used to diagnose power-consumption issues for devices that can enter
>> D3cold or to ensure that devices properly enter D3cold at all.
>>
>> To alleviate this issue, introduce a new sysfs device attribute for the
>> PCI power state, showing the current power state as seen by the kernel.
> 
> Very nice!  Thank you for adding this.
> 
> [...]
>> +/* PCI power state */
>> +static ssize_t power_state_show(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	struct pci_dev *pci_dev = to_pci_dev(dev);
>> +	pci_power_t state = READ_ONCE(pci_dev->current_state);
>> +
>> +	return sprintf(buf, "%s\n", pci_power_name(state));
>> +}
>> +static DEVICE_ATTR_RO(power_state);
> [...]
> 
> Curious, why did you decide to use the READ_ONCE() macro here?  Some
> other drivers exposing data through sysfs use, but certainly not all.

As far as I can tell current_state is normally guarded by the device
lock, but here we don't hold that lock. I generally prefer to be
explicit about those kinds of access, if only to document that the value
can change here.

In this case it should work fine without it, but this also has the
benefit that if someone were to add a change like

   if (state > x)
           state = y;

later on (here or even in pci_power_name() due to inlining), things
wouldn't break as we explicitly forbid the compiler to load
current_state more than once. Without the READ_ONCE, the compiler would
be theoretically allowed to do two separate reads then (although
arguably unlikely it would end up doing that).

Also there's no downside of marking it as READ_ONCE here as far as I can
tell, as in that context the value will always have to be loaded from
memory.

So in short, mostly personal preference rooted in documentation and
avoiding potential (unlikely) future mishaps.

Regards,
Max
Bjorn Helgaas Nov. 15, 2020, 8:27 p.m. UTC | #3
On Sun, Nov 15, 2020 at 01:45:18PM +0100, Maximilian Luz wrote:
> On 11/15/20 7:08 AM, Krzysztof Wilczyński wrote:
> > On 20-11-02 15:15:20, Maximilian Luz wrote:
> > > While most PCI power-states can be queried from user-space via lspci,
> > > this has some limits. Specifically, lspci fails to provide an accurate
> > > value when the device is in D3cold as it has to resume the device before
> > > it can access its power state via the configuration space, leading to it
> > > reporting D0 or another on-state. Thus lspci can, for example, not be
> > > used to diagnose power-consumption issues for devices that can enter
> > > D3cold or to ensure that devices properly enter D3cold at all.
> > > 
> > > To alleviate this issue, introduce a new sysfs device attribute for the
> > > PCI power state, showing the current power state as seen by the kernel.
> > 
> > Very nice!  Thank you for adding this.
> > 
> > [...]
> > > +/* PCI power state */
> > > +static ssize_t power_state_show(struct device *dev,
> > > +				struct device_attribute *attr, char *buf)
> > > +{
> > > +	struct pci_dev *pci_dev = to_pci_dev(dev);
> > > +	pci_power_t state = READ_ONCE(pci_dev->current_state);
> > > +
> > > +	return sprintf(buf, "%s\n", pci_power_name(state));
> > > +}
> > > +static DEVICE_ATTR_RO(power_state);
> > [...]
> > 
> > Curious, why did you decide to use the READ_ONCE() macro here?  Some
> > other drivers exposing data through sysfs use, but certainly not all.
> 
> As far as I can tell current_state is normally guarded by the device
> lock, but here we don't hold that lock. I generally prefer to be
> explicit about those kinds of access, if only to document that the value
> can change here.
> 
> In this case it should work fine without it, but this also has the
> benefit that if someone were to add a change like
> 
>   if (state > x)
>           state = y;
> 
> later on (here or even in pci_power_name() due to inlining), things
> wouldn't break as we explicitly forbid the compiler to load
> current_state more than once. Without the READ_ONCE, the compiler would
> be theoretically allowed to do two separate reads then (although
> arguably unlikely it would end up doing that).
> 
> Also there's no downside of marking it as READ_ONCE here as far as I can
> tell, as in that context the value will always have to be loaded from
> memory.

I think something read from sysfs is a snapshot with no guarantee
about how long it will remain valid, so I don't see a problem with the
value being stale by the time userspace consumes it.

If there's a downside to doing two separate reads, we could mention
that in the commit log or a comment.

If there's not a specific reason for using READ_ONCE(), I think we
should omit it because using it in one place but not others suggests
that there's something special about this place.

Bjorn
Maximilian Luz Nov. 15, 2020, 9:19 p.m. UTC | #4
On 11/15/20 9:27 PM, Bjorn Helgaas wrote:

[...]

> I think something read from sysfs is a snapshot with no guarantee
> about how long it will remain valid, so I don't see a problem with the
> value being stale by the time userspace consumes it.

I agree on this, and the READ_ONCE won't protect against it. The
READ_ONCE would only protect against future changes, e.g. something like

     const char *state_names[] = { ... };

     // check if state is invalid
     if (READ(pci_dev->current_state) >= ARRAY_SIZE(state_names))
             return sprintf(..., "invalid");
     else    // look state up in table
             return sprintf(..., state_names[READ(pci_dev->current_state)])

Note that I've explicitly marked the problematic reads here: If those
are done separately, the invalidity check may pass, but by the time the
state name is looked up, the value may have changed and may be invalid.

Note further that if we have something like

     pci_power_t state = pci_dev->current_state;

the compiler is, in theory, free to replace each access to "state" with
a read to pci_dev->current_state. As far as I can tell, the whole point
of READ_ONCE is to prevent that and ensure that there is only one read.

Note also that something like this could be easily introduced by
changing the code in pci_power_name(), as that is likely inlined by the
compiler. I'm not entirely sure, but I think that the compiler is allowed
to, at least theoretically, split that into two reads here and inlining
might be done before further optimization.

On the other hand, the changes that could lead to issues above are
fairly unlikely to cause them as the compiler will _probably_ read the
value only once anyways.

> If there's a downside to doing two separate reads, we could mention
> that in the commit log or a comment.
> 
> If there's not a specific reason for using READ_ONCE(), I think we
> should omit it because using it in one place but not others suggests
> that there's something special about this place.

I'd argue that there is indeed something special about this place:
current_state is accessed without holding the device lock (unless I'm
mistaken and sysfs attributes do acquire the device lock automatically)
and the state is normally only accessed/changed under it.

Apart from (hopefully) preventing somewhat unlikely future issues and
highlighting that it is (somewhat of a) special case, the READ_ONCE does
not serve any purpose here. As the code is now, omitting it will not
cause any issues (or really should not make any difference in produced
code).

All in all, I'm not entirely sure that it's a good idea to drop the
READ_ONCE, but I'll defer to you for that judgement.

Regards,
Max
Bjorn Helgaas Nov. 18, 2020, 5:49 p.m. UTC | #5
On Sun, Nov 15, 2020 at 10:19:22PM +0100, Maximilian Luz wrote:
> On 11/15/20 9:27 PM, Bjorn Helgaas wrote:
> 
> [...]
> 
> > I think something read from sysfs is a snapshot with no guarantee
> > about how long it will remain valid, so I don't see a problem with the
> > value being stale by the time userspace consumes it.
> 
> I agree on this, and the READ_ONCE won't protect against it. The
> READ_ONCE would only protect against future changes, e.g. something like
> 
>     const char *state_names[] = { ... };
> 
>     // check if state is invalid
>     if (READ(pci_dev->current_state) >= ARRAY_SIZE(state_names))
>             return sprintf(..., "invalid");
>     else    // look state up in table
>             return sprintf(..., state_names[READ(pci_dev->current_state)])
> 
> Note that I've explicitly marked the problematic reads here: If those
> are done separately, the invalidity check may pass, but by the time the
> state name is looked up, the value may have changed and may be invalid.
> 
> Note further that if we have something like
> 
>     pci_power_t state = pci_dev->current_state;
> 
> the compiler is, in theory, free to replace each access to "state" with
> a read to pci_dev->current_state. As far as I can tell, the whole point
> of READ_ONCE is to prevent that and ensure that there is only one read.
> 
> Note also that something like this could be easily introduced by
> changing the code in pci_power_name(), as that is likely inlined by the
> compiler. I'm not entirely sure, but I think that the compiler is allowed
> to, at least theoretically, split that into two reads here and inlining
> might be done before further optimization.
> 
> On the other hand, the changes that could lead to issues above are
> fairly unlikely to cause them as the compiler will _probably_ read the
> value only once anyways.

Well, OK, I see your point.  But I'm not convinced it's worth
cluttering the code for this.  There must be dozens of similar cases,
and if we do need to worry about this, I'd like to do it
systematically for all of drivers/pci/ instead of doing it piecemeal.

I do think it's probably worth making sure we can't set
dev->current_state to something that's invalid, and also taking a look
at the PCI core interfaces that take a pci_power_t, i.e., those in
include/linux/pci.h, to make sure they do the right thing if a driver
supplies garbage.

> > If there's a downside to doing two separate reads, we could mention
> > that in the commit log or a comment.
> > 
> > If there's not a specific reason for using READ_ONCE(), I think we
> > should omit it because using it in one place but not others suggests
> > that there's something special about this place.
> 
> I'd argue that there is indeed something special about this place:
> current_state is accessed without holding the device lock (unless I'm
> mistaken and sysfs attributes do acquire the device lock automatically)
> and the state is normally only accessed/changed under it.
> 
> Apart from (hopefully) preventing somewhat unlikely future issues and
> highlighting that it is (somewhat of a) special case, the READ_ONCE does
> not serve any purpose here. As the code is now, omitting it will not
> cause any issues (or really should not make any difference in produced
> code).
> 
> All in all, I'm not entirely sure that it's a good idea to drop the
> READ_ONCE, but I'll defer to you for that judgement.
> 
> Regards,
> Max
Bjorn Helgaas Nov. 18, 2020, 7:19 p.m. UTC | #6
[+cc Krzysztof, Rafael in case you have a suggestion about the
filename (or anythnig else :))]

On Mon, Nov 02, 2020 at 03:15:20PM +0100, Maximilian Luz wrote:
> While most PCI power-states can be queried from user-space via lspci,
> this has some limits. Specifically, lspci fails to provide an accurate
> value when the device is in D3cold as it has to resume the device before
> it can access its power state via the configuration space, leading to it
> reporting D0 or another on-state. Thus lspci can, for example, not be
> used to diagnose power-consumption issues for devices that can enter
> D3cold or to ensure that devices properly enter D3cold at all.
> 
> To alleviate this issue, introduce a new sysfs device attribute for the
> PCI power state, showing the current power state as seen by the kernel.
> 
> Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>

Applied as below to pci/pm for v5.11.

> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  9 +++++++++
>  drivers/pci/pci-sysfs.c                 | 12 ++++++++++++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 450296cc7948..881040af2611 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -360,3 +360,12 @@ Contact:	Heiner Kallweit <hkallweit1@gmail.com>
>  Description:	If ASPM is supported for an endpoint, these files can be
>  		used to disable or enable the individual power management
>  		states. Write y/1/on to enable, n/0/off to disable.
> +
> +What:		/sys/bus/pci/devices/.../power_state

I guess this will be alongside the existing "power/" directory.
Rafael, is there any precedent we should be following here?

> +Date:		November 2020
> +Contact:	Linux PCI developers <linux-pci@vger.kernel.org>
> +Description:
> +		This file contains the current PCI power state of the device.
> +		The value comes from the PCI kernel device state and can be one
> +		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> +		The file is read only.
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index d15c881e2e7e..b15f754e6346 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -124,6 +124,17 @@ static ssize_t cpulistaffinity_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(cpulistaffinity);
>  
> +/* PCI power state */
> +static ssize_t power_state_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	pci_power_t state = READ_ONCE(pci_dev->current_state);
> +
> +	return sprintf(buf, "%s\n", pci_power_name(state));
> +}
> +static DEVICE_ATTR_RO(power_state);
> +
>  /* show resources */
>  static ssize_t resource_show(struct device *dev, struct device_attribute *attr,
>  			     char *buf)
> @@ -581,6 +592,7 @@ static ssize_t driver_override_show(struct device *dev,
>  static DEVICE_ATTR_RW(driver_override);
>  
>  static struct attribute *pci_dev_attrs[] = {
> +	&dev_attr_power_state.attr,
>  	&dev_attr_resource.attr,
>  	&dev_attr_vendor.attr,
>  	&dev_attr_device.attr,

commit 9f1c0ebea21a ("PCI: Add sysfs attribute for device power state")
Author: Maximilian Luz <luzmaximilian@gmail.com>
Date:   Mon Nov 2 15:15:20 2020 +0100

    PCI: Add sysfs attribute for device power state
    
    While PCI power states D0-D3hot can be queried from user-space via lspci,
    D3cold cannot.  lspci cannot provide an accurate value when the device is
    in D3cold as it has to restore the device to D0 before it can access its
    power state via the configuration space, leading to it reporting D0 or
    another on-state. Thus lspci cannot be used to diagnose power consumption
    issues for devices that can enter D3cold or to ensure that devices properly
    enter D3cold at all.
    
    Add a new sysfs device attribute for the PCI power state, showing the
    current power state as seen by the kernel.
    
    [bhelgaas: drop READ_ONCE(), see discussion at the link]
    Link: https://lore.kernel.org/r/20201102141520.831630-1-luzmaximilian@gmail.com
    Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 450296cc7948..881040af2611 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -360,3 +360,12 @@ Contact:	Heiner Kallweit <hkallweit1@gmail.com>
 Description:	If ASPM is supported for an endpoint, these files can be
 		used to disable or enable the individual power management
 		states. Write y/1/on to enable, n/0/off to disable.
+
+What:		/sys/bus/pci/devices/.../power_state
+Date:		November 2020
+Contact:	Linux PCI developers <linux-pci@vger.kernel.org>
+Description:
+		This file contains the current PCI power state of the device.
+		The value comes from the PCI kernel device state and can be one
+		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
+		The file is read only.
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d15c881e2e7e..fb072f4b3176 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -124,6 +124,15 @@ static ssize_t cpulistaffinity_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(cpulistaffinity);
 
+static ssize_t power_state_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%s\n", pci_power_name(pdev->current_state));
+}
+static DEVICE_ATTR_RO(power_state);
+
 /* show resources */
 static ssize_t resource_show(struct device *dev, struct device_attribute *attr,
 			     char *buf)
@@ -581,6 +590,7 @@ static ssize_t driver_override_show(struct device *dev,
 static DEVICE_ATTR_RW(driver_override);
 
 static struct attribute *pci_dev_attrs[] = {
+	&dev_attr_power_state.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_vendor.attr,
 	&dev_attr_device.attr,
Maximilian Luz Nov. 18, 2020, 11:28 p.m. UTC | #7
On 11/18/20 6:49 PM, Bjorn Helgaas wrote:
> On Sun, Nov 15, 2020 at 10:19:22PM +0100, Maximilian Luz wrote:
>> On 11/15/20 9:27 PM, Bjorn Helgaas wrote:
>>
>> [...]
>>
>>> I think something read from sysfs is a snapshot with no guarantee
>>> about how long it will remain valid, so I don't see a problem with the
>>> value being stale by the time userspace consumes it.
>>
>> I agree on this, and the READ_ONCE won't protect against it. The
>> READ_ONCE would only protect against future changes, e.g. something like
>>
>>      const char *state_names[] = { ... };
>>
>>      // check if state is invalid
>>      if (READ(pci_dev->current_state) >= ARRAY_SIZE(state_names))
>>              return sprintf(..., "invalid");
>>      else    // look state up in table
>>              return sprintf(..., state_names[READ(pci_dev->current_state)])
>>
>> Note that I've explicitly marked the problematic reads here: If those
>> are done separately, the invalidity check may pass, but by the time the
>> state name is looked up, the value may have changed and may be invalid.
>>
>> Note further that if we have something like
>>
>>      pci_power_t state = pci_dev->current_state;
>>
>> the compiler is, in theory, free to replace each access to "state" with
>> a read to pci_dev->current_state. As far as I can tell, the whole point
>> of READ_ONCE is to prevent that and ensure that there is only one read.
>>
>> Note also that something like this could be easily introduced by
>> changing the code in pci_power_name(), as that is likely inlined by the
>> compiler. I'm not entirely sure, but I think that the compiler is allowed
>> to, at least theoretically, split that into two reads here and inlining
>> might be done before further optimization.
>>
>> On the other hand, the changes that could lead to issues above are
>> fairly unlikely to cause them as the compiler will _probably_ read the
>> value only once anyways.
> 
> Well, OK, I see your point.  But I'm not convinced it's worth
> cluttering the code for this.  There must be dozens of similar cases,
> and if we do need to worry about this, I'd like to do it
> systematically for all of drivers/pci/ instead of doing it piecemeal.

Fair enough, that's a valid point.

For full formal correctness, writes to current_state should probably
have also been guarded with WRITE_ONCE to prevent the compiler from
splitting the write instruction in addition to the read with READ_ONCE
in this patch.

Again, that's mostly a point about formal correctness and issues like
that shouldn't happen in practice (or if they would, would probably have
broken other parts of the kernel already).

Looking at other sysfs functions, it seems like most of them would have
the same issues, so it makes sense to drop the READ_ONCE.

> I do think it's probably worth making sure we can't set
> dev->current_state to something that's invalid, and also taking a look
> at the PCI core interfaces that take a pci_power_t, i.e., those in
> include/linux/pci.h, to make sure they do the right thing if a driver
> supplies garbage.

As far as I can tell, those checks should already be there.

Thanks,
Max
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 450296cc7948..881040af2611 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -360,3 +360,12 @@  Contact:	Heiner Kallweit <hkallweit1@gmail.com>
 Description:	If ASPM is supported for an endpoint, these files can be
 		used to disable or enable the individual power management
 		states. Write y/1/on to enable, n/0/off to disable.
+
+What:		/sys/bus/pci/devices/.../power_state
+Date:		November 2020
+Contact:	Linux PCI developers <linux-pci@vger.kernel.org>
+Description:
+		This file contains the current PCI power state of the device.
+		The value comes from the PCI kernel device state and can be one
+		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
+		The file is read only.
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d15c881e2e7e..b15f754e6346 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -124,6 +124,17 @@  static ssize_t cpulistaffinity_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(cpulistaffinity);
 
+/* PCI power state */
+static ssize_t power_state_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+	pci_power_t state = READ_ONCE(pci_dev->current_state);
+
+	return sprintf(buf, "%s\n", pci_power_name(state));
+}
+static DEVICE_ATTR_RO(power_state);
+
 /* show resources */
 static ssize_t resource_show(struct device *dev, struct device_attribute *attr,
 			     char *buf)
@@ -581,6 +592,7 @@  static ssize_t driver_override_show(struct device *dev,
 static DEVICE_ATTR_RW(driver_override);
 
 static struct attribute *pci_dev_attrs[] = {
+	&dev_attr_power_state.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_vendor.attr,
 	&dev_attr_device.attr,