diff mbox series

[RFC,net-next,1/8] ethtool: Add ability to control transceiver modules' low power mode

Message ID 20210809102152.719961-2-idosch@idosch.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series ethtool: Add ability to control transceiver modules | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 12 maintainers not CCed: corbet@lwn.net arnd@arndb.de ffmancera@riseup.net yangbo.lu@nxp.com danieller@nvidia.com zhengyongjun3@huawei.com saeedm@nvidia.com yajun.deng@linux.dev hkallweit1@gmail.com linux-doc@vger.kernel.org vladyslavt@nvidia.com johannes.berg@intel.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 2069 this patch: 2069
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 2061 this patch: 2061
netdev/header_inline success Link

Commit Message

Ido Schimmel Aug. 9, 2021, 10:21 a.m. UTC
From: Ido Schimmel <idosch@nvidia.com>

Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
modules parameters and retrieve their status.

The first parameter to control is the low power mode of the module. It
is only relevant for paged memory modules, as flat memory modules always
operate in low power mode.

When a paged memory module is in low power mode, its power consumption
is reduced to the minimum, the management interface towards the host is
available and the data path is deactivated.

User space can choose to put modules that are not currently in use in
low power mode and transition them to high power mode before putting the
associated ports administratively up.

Transitioning into low power mode means loss of carrier, so error is
returned when the netdev is administratively up.

The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).

The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the CPU.

CMIS testing
============

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

The module is not in low power mode, as it is not forced by hardware
(LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off).

The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space.

 $ ethtool --show-module swp11
 Module parameters for swp11:
 low-power false

Turn on low power mode:

 # ethtool --set-module swp11 low-power on

Query low power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 low-power true

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x01 (ModuleLowPwr)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : On

Allow the module to transition out of low power mode:

 # ethtool --set-module swp11 low-power off

Query low power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 low-power false

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

SFF-8636 testing
================

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7733 mW / -1.12 dBm
 Transmit avg optical power (Channel 2)    : 0.7649 mW / -1.16 dBm
 Transmit avg optical power (Channel 3)    : 0.7743 mW / -1.11 dBm
 Transmit avg optical power (Channel 4)    : 0.7837 mW / -1.06 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9186 mW / -0.37 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9136 mW / -0.39 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.8986 mW / -0.46 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8701 mW / -0.60 dBm

The module is not in low power mode, as it is not forced by hardware
(Power override is on) or by software (Power set is off).

The low power mode can be queried from the kernel. In case Power
override was off, the kernel would need to take into account the state
of the LPMode signal, which is not visible to user space.

 $ ethtool --show-module swp13
 Module parameters for swp13:
 low-power false

Turn on low power mode:

 # ethtool --set-module swp13 low-power on

Query low power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 low-power true

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
 Power set                                 : On
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm

Allow the module to transition out of low power mode:

 # ethtool --set-module swp13 low-power off

Query low power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 low-power false

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7783 mW / -1.09 dBm
 Transmit avg optical power (Channel 2)    : 0.7806 mW / -1.08 dBm
 Transmit avg optical power (Channel 3)    : 0.7885 mW / -1.03 dBm
 Transmit avg optical power (Channel 4)    : 0.7985 mW / -0.98 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9124 mW / -0.40 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9071 mW / -0.42 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.8993 mW / -0.46 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8644 mW / -0.63 dBm

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ethtool-netlink.rst |  55 +++++-
 include/linux/ethtool.h                      |   9 +
 include/uapi/linux/ethtool_netlink.h         |  16 ++
 net/ethtool/Makefile                         |   2 +-
 net/ethtool/module.c                         | 184 +++++++++++++++++++
 net/ethtool/netlink.c                        |  19 ++
 net/ethtool/netlink.h                        |   4 +
 7 files changed, 286 insertions(+), 3 deletions(-)
 create mode 100644 net/ethtool/module.c

Comments

Andrew Lunn Aug. 9, 2021, 2:28 p.m. UTC | #1
On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
> 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
> modules parameters and retrieve their status.

Hi Ido

I've not read all the patchset yet, but i like the general direction.

> The first parameter to control is the low power mode of the module. It
> is only relevant for paged memory modules, as flat memory modules always
> operate in low power mode.
> 
> When a paged memory module is in low power mode, its power consumption
> is reduced to the minimum, the management interface towards the host is
> available and the data path is deactivated.
> 
> User space can choose to put modules that are not currently in use in
> low power mode and transition them to high power mode before putting the
> associated ports administratively up.
> 
> Transitioning into low power mode means loss of carrier, so error is
> returned when the netdev is administratively up.

However, i don't get this use case. With copper PHYs, putting the link
administratively down results in a call into phylib and into the
driver to down the link. This effectively puts the PHY into a low
power mode. The management interface, as defined by C22 and C45 remain
available, but the data path is disabled. For a 1G PHY, this can save
a few watts.

For SFPs managed by phylink and the kernal SFP driver, the exact same
happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus
still works, but the data path is disabled.

So i would expect a driver using firmware, not Linux code to manage
SFPs, to just do this on link down. Why do we need user space
involved?

    Andrew
Ido Schimmel Aug. 10, 2021, 7:26 a.m. UTC | #2
On Mon, Aug 09, 2021 at 04:28:32PM +0200, Andrew Lunn wrote:
> On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote:
> > From: Ido Schimmel <idosch@nvidia.com>
> > 
> > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
> > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
> > modules parameters and retrieve their status.
> 
> Hi Ido
> 
> I've not read all the patchset yet, but i like the general direction.
> 
> > The first parameter to control is the low power mode of the module. It
> > is only relevant for paged memory modules, as flat memory modules always
> > operate in low power mode.
> > 
> > When a paged memory module is in low power mode, its power consumption
> > is reduced to the minimum, the management interface towards the host is
> > available and the data path is deactivated.
> > 
> > User space can choose to put modules that are not currently in use in
> > low power mode and transition them to high power mode before putting the
> > associated ports administratively up.
> > 
> > Transitioning into low power mode means loss of carrier, so error is
> > returned when the netdev is administratively up.
> 
> However, i don't get this use case. With copper PHYs, putting the link
> administratively down results in a call into phylib and into the
> driver to down the link. This effectively puts the PHY into a low
> power mode. The management interface, as defined by C22 and C45 remain
> available, but the data path is disabled. For a 1G PHY, this can save
> a few watts.
> 
> For SFPs managed by phylink and the kernal SFP driver, the exact same
> happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus
> still works, but the data path is disabled.
> 
> So i would expect a driver using firmware, not Linux code to manage
> SFPs, to just do this on link down. Why do we need user space
> involved?

The transition from low power to high power can take a few seconds with
QSFP/QSFP-DD and it's likely to only get longer with future / more
complex modules. Therefore, to reduce link-up time, the firmware
automatically transitions modules to high power mode.

There is obviously a trade-off here between power consumption and
link-up time. My understanding is that Mellanox is not the only vendor
favoring shorter link-up times as users have the ability to control the
low power mode of the modules in other implementations.

Regarding "why do we need user space involved?", by default, it does not
need to be involved (the system works without this API), but if it wants
to reduce the power consumption by setting unused modules to low power
mode, then it will need to use this API.
Andrew Lunn Aug. 10, 2021, 1:52 p.m. UTC | #3
> The transition from low power to high power can take a few seconds with
> QSFP/QSFP-DD and it's likely to only get longer with future / more
> complex modules. Therefore, to reduce link-up time, the firmware
> automatically transitions modules to high power mode.
> 
> There is obviously a trade-off here between power consumption and
> link-up time. My understanding is that Mellanox is not the only vendor
> favoring shorter link-up times as users have the ability to control the
> low power mode of the modules in other implementations.
> 
> Regarding "why do we need user space involved?", by default, it does not
> need to be involved (the system works without this API), but if it wants
> to reduce the power consumption by setting unused modules to low power
> mode, then it will need to use this API.

O.K. Thanks for the better explanation. Some of this should go into
the commit message.

I suggest it gets a different name and semantics, to avoid
confusion. I think we should consider this the default power mode for
when the link is administratively down, rather than direct control
over the modules power mode. The driver should transition the module
to this setting on link down, be it high power or low power. That
saves a lot of complexity, since i assume you currently need a udev
script or something which sets it to low power mode on link down,
where as you can avoid this be configuring the default and let the
driver do it.

I also wonder if a hierarchy is needed? You can set the default for
the switch, and then override is per module? I _guess_ most users will
decide at a switch level they want to save power and pay the penalty
over longer link up times. But then we have the question, is it an
ethtool option, or a devlink parameter?

	Andrew
Jakub Kicinski Aug. 10, 2021, 1:59 p.m. UTC | #4
On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > The transition from low power to high power can take a few seconds with
> > QSFP/QSFP-DD and it's likely to only get longer with future / more
> > complex modules. Therefore, to reduce link-up time, the firmware
> > automatically transitions modules to high power mode.
> > 
> > There is obviously a trade-off here between power consumption and
> > link-up time. My understanding is that Mellanox is not the only vendor
> > favoring shorter link-up times as users have the ability to control the
> > low power mode of the modules in other implementations.
> > 
> > Regarding "why do we need user space involved?", by default, it does not
> > need to be involved (the system works without this API), but if it wants
> > to reduce the power consumption by setting unused modules to low power
> > mode, then it will need to use this API.  
> 
> O.K. Thanks for the better explanation. Some of this should go into
> the commit message.
> 
> I suggest it gets a different name and semantics, to avoid
> confusion. I think we should consider this the default power mode for
> when the link is administratively down, rather than direct control
> over the modules power mode. The driver should transition the module
> to this setting on link down, be it high power or low power. That
> saves a lot of complexity, since i assume you currently need a udev
> script or something which sets it to low power mode on link down,
> where as you can avoid this be configuring the default and let the
> driver do it.

Good point. And actually NICs have similar knobs, exposed via ethtool
priv flags today. Intel NICs for example. Maybe we should create a
"really power the port down policy" API?

Jake do you know what the use cases for Intel are? Are they SFP, MAC,
or NC-SI related?

> I also wonder if a hierarchy is needed? You can set the default for
> the switch, and then override is per module? I _guess_ most users will
> decide at a switch level they want to save power and pay the penalty
> over longer link up times. But then we have the question, is it an
> ethtool option, or a devlink parameter?
Ido Schimmel Aug. 10, 2021, 8:46 p.m. UTC | #5
On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > > The transition from low power to high power can take a few seconds with
> > > QSFP/QSFP-DD and it's likely to only get longer with future / more
> > > complex modules. Therefore, to reduce link-up time, the firmware
> > > automatically transitions modules to high power mode.
> > > 
> > > There is obviously a trade-off here between power consumption and
> > > link-up time. My understanding is that Mellanox is not the only vendor
> > > favoring shorter link-up times as users have the ability to control the
> > > low power mode of the modules in other implementations.
> > > 
> > > Regarding "why do we need user space involved?", by default, it does not
> > > need to be involved (the system works without this API), but if it wants
> > > to reduce the power consumption by setting unused modules to low power
> > > mode, then it will need to use this API.  
> > 
> > O.K. Thanks for the better explanation. Some of this should go into
> > the commit message.
> > 
> > I suggest it gets a different name and semantics, to avoid
> > confusion. I think we should consider this the default power mode for
> > when the link is administratively down, rather than direct control
> > over the modules power mode. The driver should transition the module
> > to this setting on link down, be it high power or low power. That
> > saves a lot of complexity, since i assume you currently need a udev
> > script or something which sets it to low power mode on link down,
> > where as you can avoid this be configuring the default and let the
> > driver do it.
> 
> Good point. And actually NICs have similar knobs, exposed via ethtool
> priv flags today. Intel NICs for example. Maybe we should create a
> "really power the port down policy" API?

See below about Intel. I'm not sure it's the same thing...

I'm against adding a vague "really power the port down policy" API. The
API proposed in the patch is well-defined, its implementation is
documented in standards, its implications are clear and we offer APIs
that give user space full observability into its operation.

A vague API means that it is going to be abused and user space will get
different results over different implementations. After reading the
*commit messages* about the private flags, I'm not sure what the flags
really do, what is their true motivation, implications or how do I get
observability into their operation. I'm not too hopeful about the user
documentation.

Also, like I mentioned in the cover letter, given the complexity of
these modules and as they become more common, it is likely that we will
need to extend the API to control more parameters and expose more
diagnostic information. I would really like to keep it clean and
contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
different APIs.

> 
> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> or NC-SI related?

I went through all the Intel drivers that implement these operations and
I believe you are talking about these commits:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

There isn't too much information about the motivation, but maybe it has
something to do with multi-host controllers where you want to prevent
one host from taking the physical link down for all the other hosts
sharing it? I remember such issues with mlx5.

> 
> > I also wonder if a hierarchy is needed? You can set the default for
> > the switch, and then override is per module? I _guess_ most users will
> > decide at a switch level they want to save power and pay the penalty
> > over longer link up times. But then we have the question, is it an
> > ethtool option, or a devlink parameter?
Jacob Keller Aug. 10, 2021, 9:38 p.m. UTC | #6
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, August 10, 2021 7:00 AM
> To: Andrew Lunn <andrew@lunn.ch>; Keller, Jacob E <jacob.e.keller@intel.com>
> Cc: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org;
> davem@davemloft.net; mkubecek@suse.cz; pali@kernel.org;
> vadimp@nvidia.com; mlxsw@nvidia.com; Ido Schimmel <idosch@nvidia.com>
> Subject: Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver
> modules' low power mode
> 
> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > > The transition from low power to high power can take a few seconds with
> > > QSFP/QSFP-DD and it's likely to only get longer with future / more
> > > complex modules. Therefore, to reduce link-up time, the firmware
> > > automatically transitions modules to high power mode.
> > >
> > > There is obviously a trade-off here between power consumption and
> > > link-up time. My understanding is that Mellanox is not the only vendor
> > > favoring shorter link-up times as users have the ability to control the
> > > low power mode of the modules in other implementations.
> > >
> > > Regarding "why do we need user space involved?", by default, it does not
> > > need to be involved (the system works without this API), but if it wants
> > > to reduce the power consumption by setting unused modules to low power
> > > mode, then it will need to use this API.
> >
> > O.K. Thanks for the better explanation. Some of this should go into
> > the commit message.
> >
> > I suggest it gets a different name and semantics, to avoid
> > confusion. I think we should consider this the default power mode for
> > when the link is administratively down, rather than direct control
> > over the modules power mode. The driver should transition the module
> > to this setting on link down, be it high power or low power. That
> > saves a lot of complexity, since i assume you currently need a udev
> > script or something which sets it to low power mode on link down,
> > where as you can avoid this be configuring the default and let the
> > driver do it.
> 
> Good point. And actually NICs have similar knobs, exposed via ethtool
> priv flags today. Intel NICs for example. Maybe we should create a
> "really power the port down policy" API?
> 
> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> or NC-SI related?


Offhand I don't know. I think we have some requirements documents I can look up. I'll try to get back to you soon if I can find any further information. (Yes, I wish the commit messages gave stronger motivation too...)

Thanks,
Jake

> 
> > I also wonder if a hierarchy is needed? You can set the default for
> > the switch, and then override is per module? I _guess_ most users will
> > decide at a switch level they want to save power and pay the penalty
> > over longer link up times. But then we have the question, is it an
> > ethtool option, or a devlink parameter?
Jacob Keller Aug. 10, 2021, 10 p.m. UTC | #7
On 8/10/2021 1:46 PM, Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
>> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
>>>> The transition from low power to high power can take a few seconds with
>>>> QSFP/QSFP-DD and it's likely to only get longer with future / more
>>>> complex modules. Therefore, to reduce link-up time, the firmware
>>>> automatically transitions modules to high power mode.
>>>>
>>>> There is obviously a trade-off here between power consumption and
>>>> link-up time. My understanding is that Mellanox is not the only vendor
>>>> favoring shorter link-up times as users have the ability to control the
>>>> low power mode of the modules in other implementations.
>>>>
>>>> Regarding "why do we need user space involved?", by default, it does not
>>>> need to be involved (the system works without this API), but if it wants
>>>> to reduce the power consumption by setting unused modules to low power
>>>> mode, then it will need to use this API.  
>>>
>>> O.K. Thanks for the better explanation. Some of this should go into
>>> the commit message.
>>>
>>> I suggest it gets a different name and semantics, to avoid
>>> confusion. I think we should consider this the default power mode for
>>> when the link is administratively down, rather than direct control
>>> over the modules power mode. The driver should transition the module
>>> to this setting on link down, be it high power or low power. That
>>> saves a lot of complexity, since i assume you currently need a udev
>>> script or something which sets it to low power mode on link down,
>>> where as you can avoid this be configuring the default and let the
>>> driver do it.
>>
>> Good point. And actually NICs have similar knobs, exposed via ethtool
>> priv flags today. Intel NICs for example. Maybe we should create a
>> "really power the port down policy" API?
> 
> See below about Intel. I'm not sure it's the same thing...
> 
> I'm against adding a vague "really power the port down policy" API. The
> API proposed in the patch is well-defined, its implementation is
> documented in standards, its implications are clear and we offer APIs
> that give user space full observability into its operation.
> 
> A vague API means that it is going to be abused and user space will get
> different results over different implementations. After reading the
> *commit messages* about the private flags, I'm not sure what the flags
> really do, what is their true motivation, implications or how do I get
> observability into their operation. I'm not too hopeful about the user
> documentation.
> 
> Also, like I mentioned in the cover letter, given the complexity of
> these modules and as they become more common, it is likely that we will
> need to extend the API to control more parameters and expose more
> diagnostic information. I would really like to keep it clean and
> contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
> different APIs.
> 
>>
>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>> or NC-SI related?
> 
> I went through all the Intel drivers that implement these operations and
> I believe you are talking about these commits:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
> 
> There isn't too much information about the motivation, but maybe it has
> something to do with multi-host controllers where you want to prevent
> one host from taking the physical link down for all the other hosts
> sharing it? I remember such issues with mlx5.
> 

Ok, I found some more information here. The primary motivation of the
changes in the i40e and ice drivers is from customer requests asking to
have the link go down when the port is administratively disabled. This
is because if the link is down then the switch on the other side will
see the port not having link and will stop trying to send traffic to it.

As far as I can tell, the reason its a flag is because some users wanted
the behavior the other way.

I'm not sure it's really related to the behavior here.

For what it's worth, I'm in favor of containing things like this into
ethtool as well.

Thanks,
Jake
Jakub Kicinski Aug. 10, 2021, 10:05 p.m. UTC | #8
On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:  
> > > O.K. Thanks for the better explanation. Some of this should go into
> > > the commit message.
> > > 
> > > I suggest it gets a different name and semantics, to avoid
> > > confusion. I think we should consider this the default power mode for
> > > when the link is administratively down, rather than direct control
> > > over the modules power mode. The driver should transition the module
> > > to this setting on link down, be it high power or low power. That
> > > saves a lot of complexity, since i assume you currently need a udev
> > > script or something which sets it to low power mode on link down,
> > > where as you can avoid this be configuring the default and let the
> > > driver do it.  
> > 
> > Good point. And actually NICs have similar knobs, exposed via ethtool
> > priv flags today. Intel NICs for example. Maybe we should create a
> > "really power the port down policy" API?  
> 
> See below about Intel. I'm not sure it's the same thing...
> 
> I'm against adding a vague "really power the port down policy" API. The
> API proposed in the patch is well-defined, its implementation is
> documented in standards, its implications are clear and we offer APIs
> that give user space full observability into its operation.
> 
> A vague API means that it is going to be abused and user space will get
> different results over different implementations. After reading the
> *commit messages* about the private flags, I'm not sure what the flags
> really do, what is their true motivation, implications or how do I get
> observability into their operation. I'm not too hopeful about the user
> documentation.
> 
> Also, like I mentioned in the cover letter, given the complexity of
> these modules and as they become more common, it is likely that we will
> need to extend the API to control more parameters and expose more
> diagnostic information. I would really like to keep it clean and
> contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
> different APIs.

The patch is well defined but it doesn't provide user with the answer
to the question "why is the SFP still up if I asked it to be down?"
It's good to match specs closely but Linux may need to reconcile
multiple policies.

IIUC if Intel decides to keep the SFP up for "other" reasons the
situation will look like this:

 $ ethtool --show-module eth0
 Module parameters for eth0:
 low-power true

 # ethtool -m eth0
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off


IOW the low-power mode is a way for user to express preference to
shut down/keep up the SFP, but it's not necessarily going to be
the only "policy" that matters. If other policies (e.g. NC-SI) express
preference to keep the interface up it will stay up, right?

The LowPwrRequestSW is not directly controlled by low-power && IF_UP.

What I had in mind was something along the lines of a bitmap of reasons
which are allowed to keep the interface up, and for your use case
the reason would be something like SFP_ALWAYS_ON, but other reasons
(well defined) may also keep the ifc up.

That's just to explain what I meant, if it's going to be clear to
everyone that low-power != LowPwrRequestSW I'm fine either way.
Jakub Kicinski Aug. 10, 2021, 10:06 p.m. UTC | #9
On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
> >> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> >> or NC-SI related?  
> > 
> > I went through all the Intel drivers that implement these operations and
> > I believe you are talking about these commits:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
> > 
> > There isn't too much information about the motivation, but maybe it has
> > something to do with multi-host controllers where you want to prevent
> > one host from taking the physical link down for all the other hosts
> > sharing it? I remember such issues with mlx5.
> >   
> 
> Ok, I found some more information here. The primary motivation of the
> changes in the i40e and ice drivers is from customer requests asking to
> have the link go down when the port is administratively disabled. This
> is because if the link is down then the switch on the other side will
> see the port not having link and will stop trying to send traffic to it.
> 
> As far as I can tell, the reason its a flag is because some users wanted
> the behavior the other way.
> 
> I'm not sure it's really related to the behavior here.
> 
> For what it's worth, I'm in favor of containing things like this into
> ethtool as well.

I think the question was the inverse - why not always shut down the
port if the interface is brought down?
Jacob Keller Aug. 10, 2021, 10:18 p.m. UTC | #10
On 8/10/2021 3:06 PM, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>>>> or NC-SI related?  
>>>
>>> I went through all the Intel drivers that implement these operations and
>>> I believe you are talking about these commits:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
>>>
>>> There isn't too much information about the motivation, but maybe it has
>>> something to do with multi-host controllers where you want to prevent
>>> one host from taking the physical link down for all the other hosts
>>> sharing it? I remember such issues with mlx5.
>>>   
>>
>> Ok, I found some more information here. The primary motivation of the
>> changes in the i40e and ice drivers is from customer requests asking to
>> have the link go down when the port is administratively disabled. This
>> is because if the link is down then the switch on the other side will
>> see the port not having link and will stop trying to send traffic to it.
>>
>> As far as I can tell, the reason its a flag is because some users wanted
>> the behavior the other way.
>>
>> I'm not sure it's really related to the behavior here.
>>
>> For what it's worth, I'm in favor of containing things like this into
>> ethtool as well.
> 
> I think the question was the inverse - why not always shut down the
> port if the interface is brought down?
> 

That... is a better question yes. Unfortunately so far I haven't found
any argument for not doing this. Only a bit about many requests to have
this behavior. It might just be inertia to maintain current behavior by
default...
Jacob Keller Aug. 10, 2021, 10:24 p.m. UTC | #11
On 8/10/2021 3:06 PM, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>>>> or NC-SI related?  
>>>
>>> I went through all the Intel drivers that implement these operations and
>>> I believe you are talking about these commits:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
>>>
>>> There isn't too much information about the motivation, but maybe it has
>>> something to do with multi-host controllers where you want to prevent
>>> one host from taking the physical link down for all the other hosts
>>> sharing it? I remember such issues with mlx5.
>>>   
>>
>> Ok, I found some more information here. The primary motivation of the
>> changes in the i40e and ice drivers is from customer requests asking to
>> have the link go down when the port is administratively disabled. This
>> is because if the link is down then the switch on the other side will
>> see the port not having link and will stop trying to send traffic to it.
>>
>> As far as I can tell, the reason its a flag is because some users wanted
>> the behavior the other way.
>>
>> I'm not sure it's really related to the behavior here.
>>
>> For what it's worth, I'm in favor of containing things like this into
>> ethtool as well.
> 
> I think the question was the inverse - why not always shut down the
> port if the interface is brought down?
> 

So far the best I've found after digging is that forcing total shutdown
causes manageability and VF traffic to stop. Other customers want this
traffic to continue even when the PF port is brought down.

Thanks,
Jake
Andrew Lunn Aug. 10, 2021, 10:31 p.m. UTC | #12
On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
> > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> > >> or NC-SI related?  
> > > 
> > > I went through all the Intel drivers that implement these operations and
> > > I believe you are talking about these commits:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
> > > 
> > > There isn't too much information about the motivation, but maybe it has
> > > something to do with multi-host controllers where you want to prevent
> > > one host from taking the physical link down for all the other hosts
> > > sharing it? I remember such issues with mlx5.
> > >   
> > 
> > Ok, I found some more information here. The primary motivation of the
> > changes in the i40e and ice drivers is from customer requests asking to
> > have the link go down when the port is administratively disabled. This
> > is because if the link is down then the switch on the other side will
> > see the port not having link and will stop trying to send traffic to it.
> > 
> > As far as I can tell, the reason its a flag is because some users wanted
> > the behavior the other way.
> > 
> > I'm not sure it's really related to the behavior here.
> 
> I think the question was the inverse - why not always shut down the
> port if the interface is brought down?

Humm. Something does not seem right here. I would assume that when you
administratively configure the link down, the SERDES in the MAC would
stop sending anything. So the module has nothing to send. The link
peer SERDES then looses sync, and reports that upwards as carrier
lost.

Does the i40e and ice leave its SERDES running when the link is
configured down? Or is the switch FUBAR and does not consider SERDES
loss of sync as carrier down?

Ido's use case does seem to be different. The link is down. Do we want
to leave the module active, probably sending a bit stream of all 0,
maybe noise, or do we want to power the module down, so it does not
send anything at all.

     Andrew
Andrew Lunn Aug. 10, 2021, 10:51 p.m. UTC | #13
> The patch is well defined but it doesn't provide user with the answer
> to the question "why is the SFP still up if I asked it to be down?"

Can the user even see that the SFP is still up? Does ip link show
give:

3: eth42: <BROADCAST,MULTICAST,LOWER_UP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

i.e. LOWER_UP despite DOWN?

Roopa added:

    rtnetlink: add support for protodown reason

Maybe we need the opposite as well?

      Andrew
Jacob Keller Aug. 11, 2021, 12:38 a.m. UTC | #14
On 8/10/2021 3:31 PM, Andrew Lunn wrote:
> On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote:
>> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
>>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>>>>> or NC-SI related?  
>>>>
>>>> I went through all the Intel drivers that implement these operations and
>>>> I believe you are talking about these commits:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
>>>>
>>>> There isn't too much information about the motivation, but maybe it has
>>>> something to do with multi-host controllers where you want to prevent
>>>> one host from taking the physical link down for all the other hosts
>>>> sharing it? I remember such issues with mlx5.
>>>>   
>>>
>>> Ok, I found some more information here. The primary motivation of the
>>> changes in the i40e and ice drivers is from customer requests asking to
>>> have the link go down when the port is administratively disabled. This
>>> is because if the link is down then the switch on the other side will
>>> see the port not having link and will stop trying to send traffic to it.
>>>
>>> As far as I can tell, the reason its a flag is because some users wanted
>>> the behavior the other way.
>>>
>>> I'm not sure it's really related to the behavior here.
>>
>> I think the question was the inverse - why not always shut down the
>> port if the interface is brought down?
> 
> Humm. Something does not seem right here. I would assume that when you
> administratively configure the link down, the SERDES in the MAC would
> stop sending anything. So the module has nothing to send. The link
> peer SERDES then looses sync, and reports that upwards as carrier
> lost.
> 

Right....

> Does the i40e and ice leave its SERDES running when the link is
> configured down? Or is the switch FUBAR and does not consider SERDES
> loss of sync as carrier down?
>


It's not clear to me. I tried to test with the driver, and it looks like
upstream doesn't yet have the link-down-on-close merged into net-next,
so I grabbed our out-of-tree driver.

Interestingly, both ip link show and ethtool do not report link as up
when the device is closed (ip link set enp175f0s0 down).....

So... whatever difference link-down-on-close makes we're definitely not
reporting things up.


I don't have a setup to confirm anything else right now unfortunately,
but I suspect something is wrong with the implementation of
link-down-on-close (at the very least it seems like we should still be
reporting LOWER_UP.... no?)

Unless there's some other weirdness like with QSFP or other
multi-port-single-cable setups?

I even tried adding some VFs and I see that regardless of whether
link-down-on-close is set, I can see link up on the VF....

Hmmmmm.

> Ido's use case does seem to be different. The link is down. Do we want
> to leave the module active, probably sending a bit stream of all 0,
> maybe noise, or do we want to power the module down, so it does not
> send anything at all.
> 
>      Andrew
> 

Right, I think these two cases are very different.
Ido Schimmel Aug. 11, 2021, 11:33 a.m. UTC | #15
On Tue, Aug 10, 2021 at 03:05:44PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote:
> > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:  
> > > > O.K. Thanks for the better explanation. Some of this should go into
> > > > the commit message.
> > > > 
> > > > I suggest it gets a different name and semantics, to avoid
> > > > confusion. I think we should consider this the default power mode for
> > > > when the link is administratively down, rather than direct control
> > > > over the modules power mode. The driver should transition the module
> > > > to this setting on link down, be it high power or low power. That
> > > > saves a lot of complexity, since i assume you currently need a udev
> > > > script or something which sets it to low power mode on link down,
> > > > where as you can avoid this be configuring the default and let the
> > > > driver do it.  
> > > 
> > > Good point. And actually NICs have similar knobs, exposed via ethtool
> > > priv flags today. Intel NICs for example. Maybe we should create a
> > > "really power the port down policy" API?  
> > 
> > See below about Intel. I'm not sure it's the same thing...
> > 
> > I'm against adding a vague "really power the port down policy" API. The
> > API proposed in the patch is well-defined, its implementation is
> > documented in standards, its implications are clear and we offer APIs
> > that give user space full observability into its operation.
> > 
> > A vague API means that it is going to be abused and user space will get
> > different results over different implementations. After reading the
> > *commit messages* about the private flags, I'm not sure what the flags
> > really do, what is their true motivation, implications or how do I get
> > observability into their operation. I'm not too hopeful about the user
> > documentation.
> > 
> > Also, like I mentioned in the cover letter, given the complexity of
> > these modules and as they become more common, it is likely that we will
> > need to extend the API to control more parameters and expose more
> > diagnostic information. I would really like to keep it clean and
> > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
> > different APIs.
> 
> The patch is well defined but it doesn't provide user with the answer
> to the question "why is the SFP still up if I asked it to be down?"

But you didn't ask the module to be "down", you asked the MAC. See more
below.

> It's good to match specs closely but Linux may need to reconcile
> multiple policies.
> 
> IIUC if Intel decides to keep the SFP up for "other" reasons the
> situation will look like this:

Intel did not decide to keep the module "up", it decided to keep both
the MAC and the module up. If one of them was down, the peer would
notice it, but it didn't (according to Jake's reply). This is a very
problematic behavior as you are telling your peer that everything is
fine, but in practice you are dropping all of its packets. I hit the
exact same issue with mlx5 a few years ago and when I investigated the
reason for this behavior I was referred to multi-host systems where you
don't want one host to take down the shared link for all the rest. See:

https://www.mellanox.com/sites/default/files/doc-2020/sb-externally-connected-multi-host.pdf

> 
>  $ ethtool --show-module eth0
>  Module parameters for eth0:
>  low-power true
> 
>  # ethtool -m eth0
>  Module State                              : 0x03 (ModuleReady)
>  LowPwrAllowRequestHW                      : Off
>  LowPwrRequestSW                           : Off

This output is wrong. In Intel's case "ip link show" will report the
link as down (according to Jake's reply) despite the MAC being up. On
the other hand, the output of "ethtool --show-module eth0" will show
that the module is not in low power mode and it will be correct.

> 
> 
> IOW the low-power mode is a way for user to express preference to
> shut down/keep up the SFP,

Yes, it controls the module, not the MAC. If you want to get a carrier,
both the module and the MAC need to be operational. See following
example where swp13 and swp14 are connected to each other:

$ ip link show dev swp13
127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

# ip link set dev swp13 down

# ethtool --set-module swp13 low-power on

$ ethtool --show-module swp13
Module parameters for swp13:
low-power true

# ip link set dev swp13 up

$ ip link show dev swp13
127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

# ip link set dev swp13 down

# ethtool --set-module swp13 low-power off

$ ethtool --show-module swp13
Module parameters for swp13:
low-power false

# ip link set dev swp13 up

$ ip link show dev swp13
127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

> but it's not necessarily going to be the only "policy" that matters.
> If other policies (e.g. NC-SI) express preference to keep the
> interface up it will stay up, right?
> 
> The LowPwrRequestSW is not directly controlled by low-power && IF_UP.
> 
> What I had in mind was something along the lines of a bitmap of reasons
> which are allowed to keep the interface up, and for your use case
> the reason would be something like SFP_ALWAYS_ON, but other reasons
> (well defined) may also keep the ifc up.
> 
> That's just to explain what I meant, if it's going to be clear to
> everyone that low-power != LowPwrRequestSW I'm fine either way.

I think we mixed two different use cases here. The first is a way to
make sure the link is fully operational (both MAC and module). In this
case, contrary to the expected behavior, "ip link set dev eth0 down"
will not take the MAC down and the peer will not lose its carrier. This
is most likely motivated by special hardware designs or exotic use cases
like I mentioned above.

The use case for this patch is completely different. Today, when you do
"ip link set dev eth0 up" you expect to gain a carrier which means that
both the MAC and the module are operational. The latter can be made
operational following the user request (e.g., SFP driver) or as soon as
it was plugged-in, by the device's firmware.

When you do "ip link set dev eth0 down" you expect the reverse to happen
and this is what happens today. In implementations where the module is
always operational, it stays in high power mode and continues to get
warm and consume unnecessary amount of power.

Some users might not be OK with that and would like more control, which
is exactly what this patch is doing. This patch does not change existing
behavior, the API has clear semantics and implications and user space
has full observability into its operation.

If in the future someone sees the need to add 'protoup', it can be done:

# ip link set dev eth0 up  # MAC and module are operational
# ip link set dev eth0 protoup on
# ip link set dev eth0 down # returns an error / ignore
# ethtool --set-module eth0 low-power on # returns an error because of admin up

You can obviously engineer situations that do not make any sense. Like
this:

# ethtool --set-module eth0 low-power on
# ip link set dev eth0 up
# ip link set dev eth0 protoup on

The MAC is operational and you can't take it down, but you will never
get a carrier because you explicitly asked the module to stay in low
power mode.
Jakub Kicinski Aug. 11, 2021, 1:03 p.m. UTC | #16
On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> # ethtool --set-module swp13 low-power on
> 
> $ ethtool --show-module swp13
> Module parameters for swp13:
> low-power true
> 
> # ip link set dev swp13 up
> 
> $ ip link show dev swp13
> 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
>     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff
> 
> $ ip link show dev swp14
> 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
>     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

Oh, so if we set low-power true the carrier will never show up?
I thought Andrew suggested the setting is only taken into account 
when netdev is down. That made so much sense to me I assumed we'll
just go with that. I must have misunderstood.
Andrew Lunn Aug. 11, 2021, 2:36 p.m. UTC | #17
On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> > # ethtool --set-module swp13 low-power on
> > 
> > $ ethtool --show-module swp13
> > Module parameters for swp13:
> > low-power true
> > 
> > # ip link set dev swp13 up
> > 
> > $ ip link show dev swp13
> > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> >     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff
> > 
> > $ ip link show dev swp14
> > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> >     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff
> 
> Oh, so if we set low-power true the carrier will never show up?
> I thought Andrew suggested the setting is only taken into account 
> when netdev is down.

Yes, that was my intention. If this low power mode also applies when
the interface is admin up, it sounds like a foot gun. ip link show
gives you no idea why the carrier is down, and people will assume the
cable or peer is broken. We at least need a new flag, LOWER_DISABLED
or similar to give the poor user some chance to figure out what is
going on.

To me, this setting should only apply when the link is admin down.

	Andrew
Ido Schimmel Aug. 11, 2021, 7:37 p.m. UTC | #18
On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote:
> On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:
> > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> > > # ethtool --set-module swp13 low-power on
> > > 
> > > $ ethtool --show-module swp13
> > > Module parameters for swp13:
> > > low-power true
> > > 
> > > # ip link set dev swp13 up
> > > 
> > > $ ip link show dev swp13
> > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> > >     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff
> > > 
> > > $ ip link show dev swp14
> > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> > >     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff
> > 
> > Oh, so if we set low-power true the carrier will never show up?
> > I thought Andrew suggested the setting is only taken into account 
> > when netdev is down.
> 
> Yes, that was my intention. If this low power mode also applies when
> the interface is admin up, it sounds like a foot gun. ip link show
> gives you no idea why the carrier is down, and people will assume the
> cable or peer is broken. We at least need a new flag, LOWER_DISABLED
> or similar to give the poor user some chance to figure out what is
> going on.

The canonical way to report such errors is via LINKSTATE_GET and I will
add an extended sub-state describing the problem.

> 
> To me, this setting should only apply when the link is admin down.

I don't want to bake such an assumption into the kernel, but I have a
suggestion that resolves the issue.

We currently have a single attribute that encodes the desired state on
SET messages and the operational state on GET_REPLY messages
(ETHTOOL_A_MODULE_LOW_POWER_ENABLED):

$ ethtool --show-module swp11
Module parameters for swp11:
low-power true

It is not defined very well when a module is not connected despite being
a very interesting use case. We really need to have two attributes. The
first one describing the power mode policy and the second one describing
the operational power mode which is only reported when a module is
plugged in.

For the policy we can have these values:

1. low: Always transition the module to low power mode
2. high: Always transition the module to high power mode
3. high-on-up: Transition the module to high power mode when a port
using it is administratively up. Otherwise, low

A different policy for up/down seems like an overkill for me.

See example usage below.

No module connected:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high

Like I mentioned before, this is the default on Mellanox systems so this
new attribute allows user space to query the default policy.

Change to a different policy:

# ethtool --set-module swp11 power-mode-policy high-on-up

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up

After a module was connected:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up
power-mode low

# ip link set dev swp11 up

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up
low-power high

# ip link set dev swp11 down

# ethtool --set-module swp11 power-mode-policy low

# ip link set dev swp11 up

$ ethtool swp11
...
Link detected: no (Cable issue, Module is in low power mode)

I'm quite happy with the above. Might change a few things as I implement
it, but you get the gist. WDYT?
Jakub Kicinski Aug. 11, 2021, 8:30 p.m. UTC | #19
On Wed, 11 Aug 2021 22:37:53 +0300 Ido Schimmel wrote:
> On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote:
> > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:  
> > > Oh, so if we set low-power true the carrier will never show up?
> > > I thought Andrew suggested the setting is only taken into account 
> > > when netdev is down.  
> > 
> > Yes, that was my intention. If this low power mode also applies when
> > the interface is admin up, it sounds like a foot gun. ip link show
> > gives you no idea why the carrier is down, and people will assume the
> > cable or peer is broken. We at least need a new flag, LOWER_DISABLED
> > or similar to give the poor user some chance to figure out what is
> > going on.  
> 
> The canonical way to report such errors is via LINKSTATE_GET and I will
> add an extended sub-state describing the problem.
>  
> > To me, this setting should only apply when the link is admin down.  
> 
> I don't want to bake such an assumption into the kernel, but I have a
> suggestion that resolves the issue.
> 
> We currently have a single attribute that encodes the desired state on
> SET messages and the operational state on GET_REPLY messages
> (ETHTOOL_A_MODULE_LOW_POWER_ENABLED):
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> low-power true
> 
> It is not defined very well when a module is not connected despite being
> a very interesting use case. We really need to have two attributes. The
> first one describing the power mode policy and the second one describing
> the operational power mode which is only reported when a module is
> plugged in.
> 
> For the policy we can have these values:
> 
> 1. low: Always transition the module to low power mode
> 2. high: Always transition the module to high power mode
> 3. high-on-up: Transition the module to high power mode when a port
> using it is administratively up. Otherwise, low
> 
> A different policy for up/down seems like an overkill for me.
> 
> See example usage below.
> 
> No module connected:
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high
> 
> Like I mentioned before, this is the default on Mellanox systems so this
> new attribute allows user space to query the default policy.
> 
> Change to a different policy:
> 
> # ethtool --set-module swp11 power-mode-policy high-on-up
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> 
> After a module was connected:
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> power-mode low
> 
> # ip link set dev swp11 up
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> low-power high
> 
> # ip link set dev swp11 down
> 
> # ethtool --set-module swp11 power-mode-policy low
> 
> # ip link set dev swp11 up
> 
> $ ethtool swp11
> ...
> Link detected: no (Cable issue, Module is in low power mode)
> 
> I'm quite happy with the above. Might change a few things as I implement
> it, but you get the gist. WDYT?

Isn't the "low-power" attr just duplicating the relevant bits from -m?
Andrew Lunn Aug. 11, 2021, 8:42 p.m. UTC | #20
> For the policy we can have these values:
> 
> 1. low: Always transition the module to low power mode
> 2. high: Always transition the module to high power mode
> 3. high-on-up: Transition the module to high power mode when a port
> using it is administratively up. Otherwise, low
> 
> A different policy for up/down seems like an overkill for me.

O.K. The current kernel SFP driver is going to default to high-on-up,
which is what kernel driven copper PHYs also do.

> After a module was connected:
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> power-mode low
> 
> # ip link set dev swp11 up
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> low-power high
> 
> # ip link set dev swp11 down

You missed a show here. I expect it to be:

> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> power-mode low

since it is now down.

I suppose we should consider the bigger picture. Is this feature
limited to just SFP modules, or does it make sense to any other sort
of network technology? CAN, bluetooth, 5G, IPoAC?

   Andrew
Andrew Lunn Aug. 11, 2021, 8:57 p.m. UTC | #21
> Isn't the "low-power" attr just duplicating the relevant bits from -m?

Do all SFPs report it in the dump? I'm thinking GPON, 1G modules with
a TX_ENABLE pin? INF-8074 does not specify a bit in the 'EEPROM' data
to indicate the status. So you need to know the state of the GPIO
driving the TX_ENABLE pin.

   Andrew
Ido Schimmel Aug. 11, 2021, 9:04 p.m. UTC | #22
On Wed, Aug 11, 2021 at 01:30:06PM -0700, Jakub Kicinski wrote:
> Isn't the "low-power" attr just duplicating the relevant bits from -m?

If low power is forced by hardware, then it depends on the assertion /
de-assertion of the hardware signal which is obviously not part of the
EEPROM dump.

I knew this would come up, so I mentioned it in the commit message:

"The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space."
diff mbox series

Patch

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index c86628e6a235..07eac5bc9cfc 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -179,7 +179,7 @@  according to message purpose:
 
 Userspace to kernel:
 
-  ===================================== ================================
+  ===================================== =================================
   ``ETHTOOL_MSG_STRSET_GET``            get string set
   ``ETHTOOL_MSG_LINKINFO_GET``          get link settings
   ``ETHTOOL_MSG_LINKINFO_SET``          set link settings
@@ -213,7 +213,9 @@  Userspace to kernel:
   ``ETHTOOL_MSG_MODULE_EEPROM_GET``     read SFP module EEPROM
   ``ETHTOOL_MSG_STATS_GET``             get standard statistics
   ``ETHTOOL_MSG_PHC_VCLOCKS_GET``       get PHC virtual clocks info
-  ===================================== ================================
+  ``ETHTOOL_MSG_MODULE_SET``            set transceiver module parameters
+  ``ETHTOOL_MSG_MODULE_GET``            get transceiver module parameters
+  ===================================== =================================
 
 Kernel to userspace:
 
@@ -252,6 +254,7 @@  Kernel to userspace:
   ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY``  read SFP module EEPROM
   ``ETHTOOL_MSG_STATS_GET_REPLY``          standard statistics
   ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY``    PHC virtual clocks info
+  ``ETHTOOL_MSG_MODULE_GET_REPLY``         transceiver module parameters
   ======================================== =================================
 
 ``GET`` requests are sent by userspace applications to retrieve device
@@ -1498,6 +1501,52 @@  Kernel response contents:
   ``ETHTOOL_A_PHC_VCLOCKS_INDEX``       s32     PHC index array
   ====================================  ======  ==========================
 
+MODULE_GET
+==========
+
+Gets transceiver module parameters.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``            nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  reply header
+  ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED``  bool    low power mode is enabled
+  ======================================  ======  ==========================
+
+The optional ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` attribute encodes whether
+low power mode is forced by the host.
+
+MODULE_SET
+==========
+
+Sets transceiver module parameters.
+
+Request contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  request header
+  ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED``  bool    low power mode is enabled
+  ======================================  ======  ==========================
+
+When set, the optional ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` attribute is used
+to force the module to stay in low power mode or force it back into low power
+mode. When cleared, capable modules can transition to high power mode.
+
+To avoid changes to the operational state of the device, low power mode can
+only be set when the device is administratively down.
+
+For SFF-8636 modules, low power mode is forced by the host according to table
+6-10 in revision 2.10a of the specification.
+
+For CMIS modules, low power mode is forced by the host according to table 6-12
+in revision 5.0 of the specification.
+
 Request translation
 ===================
 
@@ -1597,4 +1646,6 @@  are netlink only.
   n/a                                 ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``
   n/a                                 ``ETHTOOL_MSG_TUNNEL_INFO_GET``
   n/a                                 ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_GET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_SET``
   =================================== =====================================
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 4711b96dae0c..04286debdcdc 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -570,6 +570,10 @@  struct ethtool_module_eeprom {
  * @get_eth_ctrl_stats: Query some of the IEEE 802.3 MAC Ctrl statistics.
  * @get_rmon_stats: Query some of the RMON (RFC 2819) statistics.
  *	Set %ranges to a pointer to zero-terminated array of byte ranges.
+ * @get_module_low_power: Get the low power mode status of the plug-in module
+ *	used by the network device.
+ * @set_module_low_power: Set the low power mode status of the plug-in module
+ *	used by the network device.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -689,6 +693,11 @@  struct ethtool_ops {
 	void	(*get_rmon_stats)(struct net_device *dev,
 				  struct ethtool_rmon_stats *rmon_stats,
 				  const struct ethtool_rmon_hist_range **ranges);
+	int	(*get_module_low_power)(struct net_device *dev,
+					bool *p_low_power,
+					struct netlink_ext_ack *extack);
+	int	(*set_module_low_power)(struct net_device *dev, bool low_power,
+					struct netlink_ext_ack *extack);
 };
 
 int ethtool_check_ops(const struct ethtool_ops *ops);
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index b3b93710eff7..72fb821f3928 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -47,6 +47,8 @@  enum {
 	ETHTOOL_MSG_MODULE_EEPROM_GET,
 	ETHTOOL_MSG_STATS_GET,
 	ETHTOOL_MSG_PHC_VCLOCKS_GET,
+	ETHTOOL_MSG_MODULE_GET,
+	ETHTOOL_MSG_MODULE_SET,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_USER_CNT,
@@ -90,6 +92,8 @@  enum {
 	ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY,
 	ETHTOOL_MSG_STATS_GET_REPLY,
 	ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY,
+	ETHTOOL_MSG_MODULE_GET_REPLY,
+	ETHTOOL_MSG_MODULE_NTF,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_KERNEL_CNT,
@@ -831,6 +835,18 @@  enum {
 	ETHTOOL_A_STATS_RMON_MAX = (__ETHTOOL_A_STATS_RMON_CNT - 1)
 };
 
+/* MODULE */
+
+enum {
+	ETHTOOL_A_MODULE_UNSPEC,
+	ETHTOOL_A_MODULE_HEADER,		/* nest - _A_HEADER_* */
+	ETHTOOL_A_MODULE_LOW_POWER_ENABLED,	/* u8 */
+
+	/* add new constants above here */
+	__ETHTOOL_A_MODULE_CNT,
+	ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1)
+};
+
 /* generic netlink info */
 #define ETHTOOL_GENL_NAME "ethtool"
 #define ETHTOOL_GENL_VERSION 1
diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile
index 0a19470efbfb..b76432e70e6b 100644
--- a/net/ethtool/Makefile
+++ b/net/ethtool/Makefile
@@ -7,4 +7,4 @@  obj-$(CONFIG_ETHTOOL_NETLINK)	+= ethtool_nl.o
 ethtool_nl-y	:= netlink.o bitset.o strset.o linkinfo.o linkmodes.o \
 		   linkstate.o debug.o wol.o features.o privflags.o rings.o \
 		   channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \
-		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o
+		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o
diff --git a/net/ethtool/module.c b/net/ethtool/module.c
new file mode 100644
index 000000000000..947f2188d725
--- /dev/null
+++ b/net/ethtool/module.c
@@ -0,0 +1,184 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/ethtool.h>
+
+#include "netlink.h"
+#include "common.h"
+#include "bitset.h"
+
+struct module_req_info {
+	struct ethnl_req_info		base;
+};
+
+struct module_reply_data {
+	struct ethnl_reply_data		base;
+	u8 low_power:1,
+	   low_power_valid:1;
+};
+
+#define MODULE_REPDATA(__reply_base) \
+	container_of(__reply_base, struct module_reply_data, base)
+
+/* MODULE_GET */
+
+const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1] = {
+	[ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+};
+
+static int module_get_low_power(struct net_device *dev,
+				struct module_reply_data *data,
+				struct netlink_ext_ack *extack)
+{
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	bool low_power;
+	int ret;
+
+	if (!ops->get_module_low_power)
+		return 0;
+
+	ret = ops->get_module_low_power(dev, &low_power, extack);
+	if (ret < 0)
+		return ret;
+
+	data->low_power = low_power;
+	data->low_power_valid = true;
+
+	return 0;
+}
+
+static int module_prepare_data(const struct ethnl_req_info *req_base,
+			       struct ethnl_reply_data *reply_base,
+			       struct genl_info *info)
+{
+	struct module_reply_data *data = MODULE_REPDATA(reply_base);
+	struct netlink_ext_ack *extack = info ? info->extack : NULL;
+	struct net_device *dev = reply_base->dev;
+	int ret;
+
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		return ret;
+
+	ret = module_get_low_power(dev, data, extack);
+	if (ret < 0)
+		goto out_complete;
+
+out_complete:
+	ethnl_ops_complete(dev);
+	return ret;
+}
+
+static int module_reply_size(const struct ethnl_req_info *req_base,
+			     const struct ethnl_reply_data *reply_base)
+{
+	struct module_reply_data *data = MODULE_REPDATA(reply_base);
+	int len = 0;
+
+	if (data->low_power_valid)
+		len += nla_total_size(sizeof(u8)); /* _MODULE_LOW_POWER_ENABLED */
+
+	return len;
+}
+
+static int module_fill_reply(struct sk_buff *skb,
+			     const struct ethnl_req_info *req_base,
+			     const struct ethnl_reply_data *reply_base)
+{
+	const struct module_reply_data *data = MODULE_REPDATA(reply_base);
+
+	if (data->low_power_valid &&
+	    nla_put_u8(skb, ETHTOOL_A_MODULE_LOW_POWER_ENABLED,
+		       data->low_power))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+const struct ethnl_request_ops ethnl_module_request_ops = {
+	.request_cmd		= ETHTOOL_MSG_MODULE_GET,
+	.reply_cmd		= ETHTOOL_MSG_MODULE_GET_REPLY,
+	.hdr_attr		= ETHTOOL_A_MODULE_HEADER,
+	.req_info_size		= sizeof(struct module_req_info),
+	.reply_data_size	= sizeof(struct module_reply_data),
+
+	.prepare_data		= module_prepare_data,
+	.reply_size		= module_reply_size,
+	.fill_reply		= module_fill_reply,
+};
+
+/* MODULE_SET */
+
+const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1] = {
+	[ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+	[ETHTOOL_A_MODULE_LOW_POWER_ENABLED] = NLA_POLICY_MAX(NLA_U8, 1),
+};
+
+static int module_set_low_power(struct net_device *dev, struct nlattr **tb,
+				bool *p_mod, struct netlink_ext_ack *extack)
+{
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	bool low_power_new, low_power;
+	int ret;
+
+	if (!tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED])
+		return 0;
+
+	if (!ops->get_module_low_power || !ops->set_module_low_power) {
+		NL_SET_ERR_MSG_ATTR(extack,
+				    tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED],
+				    "Setting low power mode is not supported by this device");
+		return -EOPNOTSUPP;
+	}
+
+	if (netif_running(dev)) {
+		NL_SET_ERR_MSG_ATTR(extack,
+				    tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED],
+				    "Cannot set low power mode when port is administratively up");
+		return -EINVAL;
+	}
+
+	low_power_new = !!nla_get_u8(tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED]);
+	ret = ops->get_module_low_power(dev, &low_power, extack);
+	if (ret < 0)
+		return ret;
+	*p_mod = low_power_new != low_power;
+
+	return ops->set_module_low_power(dev, low_power_new, extack);
+}
+
+int ethnl_set_module(struct sk_buff *skb, struct genl_info *info)
+{
+	struct ethnl_req_info req_info = {};
+	struct nlattr **tb = info->attrs;
+	struct net_device *dev;
+	bool mod = false;
+	int ret;
+
+	ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MODULE_HEADER],
+					 genl_info_net(info), info->extack,
+					 true);
+	if (ret < 0)
+		return ret;
+	dev = req_info.dev;
+
+	rtnl_lock();
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		goto out_rtnl;
+
+	ret = module_set_low_power(dev, tb, &mod, info->extack);
+	if (ret < 0)
+		goto out_ops;
+
+	if (!mod)
+		goto out_ops;
+
+	ethtool_notify(dev, ETHTOOL_MSG_MODULE_NTF, NULL);
+
+out_ops:
+	ethnl_ops_complete(dev);
+out_rtnl:
+	rtnl_unlock();
+	dev_put(dev);
+	return ret;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 1797a0a90019..38b44c0291b1 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -282,6 +282,7 @@  ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = {
 	[ETHTOOL_MSG_MODULE_EEPROM_GET]	= &ethnl_module_eeprom_request_ops,
 	[ETHTOOL_MSG_STATS_GET]		= &ethnl_stats_request_ops,
 	[ETHTOOL_MSG_PHC_VCLOCKS_GET]	= &ethnl_phc_vclocks_request_ops,
+	[ETHTOOL_MSG_MODULE_GET]	= &ethnl_module_request_ops,
 };
 
 static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb)
@@ -593,6 +594,7 @@  ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = {
 	[ETHTOOL_MSG_PAUSE_NTF]		= &ethnl_pause_request_ops,
 	[ETHTOOL_MSG_EEE_NTF]		= &ethnl_eee_request_ops,
 	[ETHTOOL_MSG_FEC_NTF]		= &ethnl_fec_request_ops,
+	[ETHTOOL_MSG_MODULE_NTF]	= &ethnl_module_request_ops,
 };
 
 /* default notification handler */
@@ -686,6 +688,7 @@  static const ethnl_notify_handler_t ethnl_notify_handlers[] = {
 	[ETHTOOL_MSG_PAUSE_NTF]		= ethnl_default_notify,
 	[ETHTOOL_MSG_EEE_NTF]		= ethnl_default_notify,
 	[ETHTOOL_MSG_FEC_NTF]		= ethnl_default_notify,
+	[ETHTOOL_MSG_MODULE_NTF]	= ethnl_default_notify,
 };
 
 void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data)
@@ -999,6 +1002,22 @@  static const struct genl_ops ethtool_genl_ops[] = {
 		.policy = ethnl_phc_vclocks_get_policy,
 		.maxattr = ARRAY_SIZE(ethnl_phc_vclocks_get_policy) - 1,
 	},
+	{
+		.cmd	= ETHTOOL_MSG_MODULE_GET,
+		.doit	= ethnl_default_doit,
+		.start	= ethnl_default_start,
+		.dumpit	= ethnl_default_dumpit,
+		.done	= ethnl_default_done,
+		.policy = ethnl_module_get_policy,
+		.maxattr = ARRAY_SIZE(ethnl_module_get_policy) - 1,
+	},
+	{
+		.cmd	= ETHTOOL_MSG_MODULE_SET,
+		.flags	= GENL_UNS_ADMIN_PERM,
+		.doit	= ethnl_set_module,
+		.policy = ethnl_module_set_policy,
+		.maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1,
+	},
 };
 
 static const struct genl_multicast_group ethtool_nl_mcgrps[] = {
diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h
index 077aac3929a8..cf0fcbfe3c5c 100644
--- a/net/ethtool/netlink.h
+++ b/net/ethtool/netlink.h
@@ -337,6 +337,7 @@  extern const struct ethnl_request_ops ethnl_fec_request_ops;
 extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops;
 extern const struct ethnl_request_ops ethnl_stats_request_ops;
 extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops;
+extern const struct ethnl_request_ops ethnl_module_request_ops;
 
 extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1];
 extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1];
@@ -373,6 +374,8 @@  extern const struct nla_policy ethnl_fec_set_policy[ETHTOOL_A_FEC_AUTO + 1];
 extern const struct nla_policy ethnl_module_eeprom_get_policy[ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS + 1];
 extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1];
 extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1];
+extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1];
+extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1];
 
 int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info);
 int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info);
@@ -391,6 +394,7 @@  int ethnl_tunnel_info_doit(struct sk_buff *skb, struct genl_info *info);
 int ethnl_tunnel_info_start(struct netlink_callback *cb);
 int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
 int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info);
+int ethnl_set_module(struct sk_buff *skb, struct genl_info *info);
 
 extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN];
 extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];