Message ID | 20210809102152.719961-2-idosch@idosch.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | ethtool: Add ability to control transceiver modules | expand |
On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote: > From: Ido Schimmel <idosch@nvidia.com> > > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver > modules parameters and retrieve their status. Hi Ido I've not read all the patchset yet, but i like the general direction. > The first parameter to control is the low power mode of the module. It > is only relevant for paged memory modules, as flat memory modules always > operate in low power mode. > > When a paged memory module is in low power mode, its power consumption > is reduced to the minimum, the management interface towards the host is > available and the data path is deactivated. > > User space can choose to put modules that are not currently in use in > low power mode and transition them to high power mode before putting the > associated ports administratively up. > > Transitioning into low power mode means loss of carrier, so error is > returned when the netdev is administratively up. However, i don't get this use case. With copper PHYs, putting the link administratively down results in a call into phylib and into the driver to down the link. This effectively puts the PHY into a low power mode. The management interface, as defined by C22 and C45 remain available, but the data path is disabled. For a 1G PHY, this can save a few watts. For SFPs managed by phylink and the kernal SFP driver, the exact same happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus still works, but the data path is disabled. So i would expect a driver using firmware, not Linux code to manage SFPs, to just do this on link down. Why do we need user space involved? Andrew
On Mon, Aug 09, 2021 at 04:28:32PM +0200, Andrew Lunn wrote: > On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote: > > From: Ido Schimmel <idosch@nvidia.com> > > > > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and > > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver > > modules parameters and retrieve their status. > > Hi Ido > > I've not read all the patchset yet, but i like the general direction. > > > The first parameter to control is the low power mode of the module. It > > is only relevant for paged memory modules, as flat memory modules always > > operate in low power mode. > > > > When a paged memory module is in low power mode, its power consumption > > is reduced to the minimum, the management interface towards the host is > > available and the data path is deactivated. > > > > User space can choose to put modules that are not currently in use in > > low power mode and transition them to high power mode before putting the > > associated ports administratively up. > > > > Transitioning into low power mode means loss of carrier, so error is > > returned when the netdev is administratively up. > > However, i don't get this use case. With copper PHYs, putting the link > administratively down results in a call into phylib and into the > driver to down the link. This effectively puts the PHY into a low > power mode. The management interface, as defined by C22 and C45 remain > available, but the data path is disabled. For a 1G PHY, this can save > a few watts. > > For SFPs managed by phylink and the kernal SFP driver, the exact same > happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus > still works, but the data path is disabled. > > So i would expect a driver using firmware, not Linux code to manage > SFPs, to just do this on link down. Why do we need user space > involved? The transition from low power to high power can take a few seconds with QSFP/QSFP-DD and it's likely to only get longer with future / more complex modules. Therefore, to reduce link-up time, the firmware automatically transitions modules to high power mode. There is obviously a trade-off here between power consumption and link-up time. My understanding is that Mellanox is not the only vendor favoring shorter link-up times as users have the ability to control the low power mode of the modules in other implementations. Regarding "why do we need user space involved?", by default, it does not need to be involved (the system works without this API), but if it wants to reduce the power consumption by setting unused modules to low power mode, then it will need to use this API.
> The transition from low power to high power can take a few seconds with > QSFP/QSFP-DD and it's likely to only get longer with future / more > complex modules. Therefore, to reduce link-up time, the firmware > automatically transitions modules to high power mode. > > There is obviously a trade-off here between power consumption and > link-up time. My understanding is that Mellanox is not the only vendor > favoring shorter link-up times as users have the ability to control the > low power mode of the modules in other implementations. > > Regarding "why do we need user space involved?", by default, it does not > need to be involved (the system works without this API), but if it wants > to reduce the power consumption by setting unused modules to low power > mode, then it will need to use this API. O.K. Thanks for the better explanation. Some of this should go into the commit message. I suggest it gets a different name and semantics, to avoid confusion. I think we should consider this the default power mode for when the link is administratively down, rather than direct control over the modules power mode. The driver should transition the module to this setting on link down, be it high power or low power. That saves a lot of complexity, since i assume you currently need a udev script or something which sets it to low power mode on link down, where as you can avoid this be configuring the default and let the driver do it. I also wonder if a hierarchy is needed? You can set the default for the switch, and then override is per module? I _guess_ most users will decide at a switch level they want to save power and pay the penalty over longer link up times. But then we have the question, is it an ethtool option, or a devlink parameter? Andrew
On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > The transition from low power to high power can take a few seconds with > > QSFP/QSFP-DD and it's likely to only get longer with future / more > > complex modules. Therefore, to reduce link-up time, the firmware > > automatically transitions modules to high power mode. > > > > There is obviously a trade-off here between power consumption and > > link-up time. My understanding is that Mellanox is not the only vendor > > favoring shorter link-up times as users have the ability to control the > > low power mode of the modules in other implementations. > > > > Regarding "why do we need user space involved?", by default, it does not > > need to be involved (the system works without this API), but if it wants > > to reduce the power consumption by setting unused modules to low power > > mode, then it will need to use this API. > > O.K. Thanks for the better explanation. Some of this should go into > the commit message. > > I suggest it gets a different name and semantics, to avoid > confusion. I think we should consider this the default power mode for > when the link is administratively down, rather than direct control > over the modules power mode. The driver should transition the module > to this setting on link down, be it high power or low power. That > saves a lot of complexity, since i assume you currently need a udev > script or something which sets it to low power mode on link down, > where as you can avoid this be configuring the default and let the > driver do it. Good point. And actually NICs have similar knobs, exposed via ethtool priv flags today. Intel NICs for example. Maybe we should create a "really power the port down policy" API? Jake do you know what the use cases for Intel are? Are they SFP, MAC, or NC-SI related? > I also wonder if a hierarchy is needed? You can set the default for > the switch, and then override is per module? I _guess_ most users will > decide at a switch level they want to save power and pay the penalty > over longer link up times. But then we have the question, is it an > ethtool option, or a devlink parameter?
On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > The transition from low power to high power can take a few seconds with > > > QSFP/QSFP-DD and it's likely to only get longer with future / more > > > complex modules. Therefore, to reduce link-up time, the firmware > > > automatically transitions modules to high power mode. > > > > > > There is obviously a trade-off here between power consumption and > > > link-up time. My understanding is that Mellanox is not the only vendor > > > favoring shorter link-up times as users have the ability to control the > > > low power mode of the modules in other implementations. > > > > > > Regarding "why do we need user space involved?", by default, it does not > > > need to be involved (the system works without this API), but if it wants > > > to reduce the power consumption by setting unused modules to low power > > > mode, then it will need to use this API. > > > > O.K. Thanks for the better explanation. Some of this should go into > > the commit message. > > > > I suggest it gets a different name and semantics, to avoid > > confusion. I think we should consider this the default power mode for > > when the link is administratively down, rather than direct control > > over the modules power mode. The driver should transition the module > > to this setting on link down, be it high power or low power. That > > saves a lot of complexity, since i assume you currently need a udev > > script or something which sets it to low power mode on link down, > > where as you can avoid this be configuring the default and let the > > driver do it. > > Good point. And actually NICs have similar knobs, exposed via ethtool > priv flags today. Intel NICs for example. Maybe we should create a > "really power the port down policy" API? See below about Intel. I'm not sure it's the same thing... I'm against adding a vague "really power the port down policy" API. The API proposed in the patch is well-defined, its implementation is documented in standards, its implications are clear and we offer APIs that give user space full observability into its operation. A vague API means that it is going to be abused and user space will get different results over different implementations. After reading the *commit messages* about the private flags, I'm not sure what the flags really do, what is their true motivation, implications or how do I get observability into their operation. I'm not too hopeful about the user documentation. Also, like I mentioned in the cover letter, given the complexity of these modules and as they become more common, it is likely that we will need to extend the API to control more parameters and expose more diagnostic information. I would really like to keep it clean and contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over different APIs. > > Jake do you know what the use cases for Intel are? Are they SFP, MAC, > or NC-SI related? I went through all the Intel drivers that implement these operations and I believe you are talking about these commits: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 There isn't too much information about the motivation, but maybe it has something to do with multi-host controllers where you want to prevent one host from taking the physical link down for all the other hosts sharing it? I remember such issues with mlx5. > > > I also wonder if a hierarchy is needed? You can set the default for > > the switch, and then override is per module? I _guess_ most users will > > decide at a switch level they want to save power and pay the penalty > > over longer link up times. But then we have the question, is it an > > ethtool option, or a devlink parameter?
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Tuesday, August 10, 2021 7:00 AM > To: Andrew Lunn <andrew@lunn.ch>; Keller, Jacob E <jacob.e.keller@intel.com> > Cc: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org; > davem@davemloft.net; mkubecek@suse.cz; pali@kernel.org; > vadimp@nvidia.com; mlxsw@nvidia.com; Ido Schimmel <idosch@nvidia.com> > Subject: Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver > modules' low power mode > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > The transition from low power to high power can take a few seconds with > > > QSFP/QSFP-DD and it's likely to only get longer with future / more > > > complex modules. Therefore, to reduce link-up time, the firmware > > > automatically transitions modules to high power mode. > > > > > > There is obviously a trade-off here between power consumption and > > > link-up time. My understanding is that Mellanox is not the only vendor > > > favoring shorter link-up times as users have the ability to control the > > > low power mode of the modules in other implementations. > > > > > > Regarding "why do we need user space involved?", by default, it does not > > > need to be involved (the system works without this API), but if it wants > > > to reduce the power consumption by setting unused modules to low power > > > mode, then it will need to use this API. > > > > O.K. Thanks for the better explanation. Some of this should go into > > the commit message. > > > > I suggest it gets a different name and semantics, to avoid > > confusion. I think we should consider this the default power mode for > > when the link is administratively down, rather than direct control > > over the modules power mode. The driver should transition the module > > to this setting on link down, be it high power or low power. That > > saves a lot of complexity, since i assume you currently need a udev > > script or something which sets it to low power mode on link down, > > where as you can avoid this be configuring the default and let the > > driver do it. > > Good point. And actually NICs have similar knobs, exposed via ethtool > priv flags today. Intel NICs for example. Maybe we should create a > "really power the port down policy" API? > > Jake do you know what the use cases for Intel are? Are they SFP, MAC, > or NC-SI related? Offhand I don't know. I think we have some requirements documents I can look up. I'll try to get back to you soon if I can find any further information. (Yes, I wish the commit messages gave stronger motivation too...) Thanks, Jake > > > I also wonder if a hierarchy is needed? You can set the default for > > the switch, and then override is per module? I _guess_ most users will > > decide at a switch level they want to save power and pay the penalty > > over longer link up times. But then we have the question, is it an > > ethtool option, or a devlink parameter?
On 8/10/2021 1:46 PM, Ido Schimmel wrote: > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: >> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: >>>> The transition from low power to high power can take a few seconds with >>>> QSFP/QSFP-DD and it's likely to only get longer with future / more >>>> complex modules. Therefore, to reduce link-up time, the firmware >>>> automatically transitions modules to high power mode. >>>> >>>> There is obviously a trade-off here between power consumption and >>>> link-up time. My understanding is that Mellanox is not the only vendor >>>> favoring shorter link-up times as users have the ability to control the >>>> low power mode of the modules in other implementations. >>>> >>>> Regarding "why do we need user space involved?", by default, it does not >>>> need to be involved (the system works without this API), but if it wants >>>> to reduce the power consumption by setting unused modules to low power >>>> mode, then it will need to use this API. >>> >>> O.K. Thanks for the better explanation. Some of this should go into >>> the commit message. >>> >>> I suggest it gets a different name and semantics, to avoid >>> confusion. I think we should consider this the default power mode for >>> when the link is administratively down, rather than direct control >>> over the modules power mode. The driver should transition the module >>> to this setting on link down, be it high power or low power. That >>> saves a lot of complexity, since i assume you currently need a udev >>> script or something which sets it to low power mode on link down, >>> where as you can avoid this be configuring the default and let the >>> driver do it. >> >> Good point. And actually NICs have similar knobs, exposed via ethtool >> priv flags today. Intel NICs for example. Maybe we should create a >> "really power the port down policy" API? > > See below about Intel. I'm not sure it's the same thing... > > I'm against adding a vague "really power the port down policy" API. The > API proposed in the patch is well-defined, its implementation is > documented in standards, its implications are clear and we offer APIs > that give user space full observability into its operation. > > A vague API means that it is going to be abused and user space will get > different results over different implementations. After reading the > *commit messages* about the private flags, I'm not sure what the flags > really do, what is their true motivation, implications or how do I get > observability into their operation. I'm not too hopeful about the user > documentation. > > Also, like I mentioned in the cover letter, given the complexity of > these modules and as they become more common, it is likely that we will > need to extend the API to control more parameters and expose more > diagnostic information. I would really like to keep it clean and > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over > different APIs. > >> >> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >> or NC-SI related? > > I went through all the Intel drivers that implement these operations and > I believe you are talking about these commits: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 > > There isn't too much information about the motivation, but maybe it has > something to do with multi-host controllers where you want to prevent > one host from taking the physical link down for all the other hosts > sharing it? I remember such issues with mlx5. > Ok, I found some more information here. The primary motivation of the changes in the i40e and ice drivers is from customer requests asking to have the link go down when the port is administratively disabled. This is because if the link is down then the switch on the other side will see the port not having link and will stop trying to send traffic to it. As far as I can tell, the reason its a flag is because some users wanted the behavior the other way. I'm not sure it's really related to the behavior here. For what it's worth, I'm in favor of containing things like this into ethtool as well. Thanks, Jake
On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote: > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > O.K. Thanks for the better explanation. Some of this should go into > > > the commit message. > > > > > > I suggest it gets a different name and semantics, to avoid > > > confusion. I think we should consider this the default power mode for > > > when the link is administratively down, rather than direct control > > > over the modules power mode. The driver should transition the module > > > to this setting on link down, be it high power or low power. That > > > saves a lot of complexity, since i assume you currently need a udev > > > script or something which sets it to low power mode on link down, > > > where as you can avoid this be configuring the default and let the > > > driver do it. > > > > Good point. And actually NICs have similar knobs, exposed via ethtool > > priv flags today. Intel NICs for example. Maybe we should create a > > "really power the port down policy" API? > > See below about Intel. I'm not sure it's the same thing... > > I'm against adding a vague "really power the port down policy" API. The > API proposed in the patch is well-defined, its implementation is > documented in standards, its implications are clear and we offer APIs > that give user space full observability into its operation. > > A vague API means that it is going to be abused and user space will get > different results over different implementations. After reading the > *commit messages* about the private flags, I'm not sure what the flags > really do, what is their true motivation, implications or how do I get > observability into their operation. I'm not too hopeful about the user > documentation. > > Also, like I mentioned in the cover letter, given the complexity of > these modules and as they become more common, it is likely that we will > need to extend the API to control more parameters and expose more > diagnostic information. I would really like to keep it clean and > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over > different APIs. The patch is well defined but it doesn't provide user with the answer to the question "why is the SFP still up if I asked it to be down?" It's good to match specs closely but Linux may need to reconcile multiple policies. IIUC if Intel decides to keep the SFP up for "other" reasons the situation will look like this: $ ethtool --show-module eth0 Module parameters for eth0: low-power true # ethtool -m eth0 Module State : 0x03 (ModuleReady) LowPwrAllowRequestHW : Off LowPwrRequestSW : Off IOW the low-power mode is a way for user to express preference to shut down/keep up the SFP, but it's not necessarily going to be the only "policy" that matters. If other policies (e.g. NC-SI) express preference to keep the interface up it will stay up, right? The LowPwrRequestSW is not directly controlled by low-power && IF_UP. What I had in mind was something along the lines of a bitmap of reasons which are allowed to keep the interface up, and for your use case the reason would be something like SFP_ALWAYS_ON, but other reasons (well defined) may also keep the ifc up. That's just to explain what I meant, if it's going to be clear to everyone that low-power != LowPwrRequestSW I'm fine either way.
On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC, > >> or NC-SI related? > > > > I went through all the Intel drivers that implement these operations and > > I believe you are talking about these commits: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 > > > > There isn't too much information about the motivation, but maybe it has > > something to do with multi-host controllers where you want to prevent > > one host from taking the physical link down for all the other hosts > > sharing it? I remember such issues with mlx5. > > > > Ok, I found some more information here. The primary motivation of the > changes in the i40e and ice drivers is from customer requests asking to > have the link go down when the port is administratively disabled. This > is because if the link is down then the switch on the other side will > see the port not having link and will stop trying to send traffic to it. > > As far as I can tell, the reason its a flag is because some users wanted > the behavior the other way. > > I'm not sure it's really related to the behavior here. > > For what it's worth, I'm in favor of containing things like this into > ethtool as well. I think the question was the inverse - why not always shut down the port if the interface is brought down?
On 8/10/2021 3:06 PM, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: >>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >>>> or NC-SI related? >>> >>> I went through all the Intel drivers that implement these operations and >>> I believe you are talking about these commits: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 >>> >>> There isn't too much information about the motivation, but maybe it has >>> something to do with multi-host controllers where you want to prevent >>> one host from taking the physical link down for all the other hosts >>> sharing it? I remember such issues with mlx5. >>> >> >> Ok, I found some more information here. The primary motivation of the >> changes in the i40e and ice drivers is from customer requests asking to >> have the link go down when the port is administratively disabled. This >> is because if the link is down then the switch on the other side will >> see the port not having link and will stop trying to send traffic to it. >> >> As far as I can tell, the reason its a flag is because some users wanted >> the behavior the other way. >> >> I'm not sure it's really related to the behavior here. >> >> For what it's worth, I'm in favor of containing things like this into >> ethtool as well. > > I think the question was the inverse - why not always shut down the > port if the interface is brought down? > That... is a better question yes. Unfortunately so far I haven't found any argument for not doing this. Only a bit about many requests to have this behavior. It might just be inertia to maintain current behavior by default...
On 8/10/2021 3:06 PM, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: >>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >>>> or NC-SI related? >>> >>> I went through all the Intel drivers that implement these operations and >>> I believe you are talking about these commits: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 >>> >>> There isn't too much information about the motivation, but maybe it has >>> something to do with multi-host controllers where you want to prevent >>> one host from taking the physical link down for all the other hosts >>> sharing it? I remember such issues with mlx5. >>> >> >> Ok, I found some more information here. The primary motivation of the >> changes in the i40e and ice drivers is from customer requests asking to >> have the link go down when the port is administratively disabled. This >> is because if the link is down then the switch on the other side will >> see the port not having link and will stop trying to send traffic to it. >> >> As far as I can tell, the reason its a flag is because some users wanted >> the behavior the other way. >> >> I'm not sure it's really related to the behavior here. >> >> For what it's worth, I'm in favor of containing things like this into >> ethtool as well. > > I think the question was the inverse - why not always shut down the > port if the interface is brought down? > So far the best I've found after digging is that forcing total shutdown causes manageability and VF traffic to stop. Other customers want this traffic to continue even when the PF port is brought down. Thanks, Jake
On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: > > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC, > > >> or NC-SI related? > > > > > > I went through all the Intel drivers that implement these operations and > > > I believe you are talking about these commits: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 > > > > > > There isn't too much information about the motivation, but maybe it has > > > something to do with multi-host controllers where you want to prevent > > > one host from taking the physical link down for all the other hosts > > > sharing it? I remember such issues with mlx5. > > > > > > > Ok, I found some more information here. The primary motivation of the > > changes in the i40e and ice drivers is from customer requests asking to > > have the link go down when the port is administratively disabled. This > > is because if the link is down then the switch on the other side will > > see the port not having link and will stop trying to send traffic to it. > > > > As far as I can tell, the reason its a flag is because some users wanted > > the behavior the other way. > > > > I'm not sure it's really related to the behavior here. > > I think the question was the inverse - why not always shut down the > port if the interface is brought down? Humm. Something does not seem right here. I would assume that when you administratively configure the link down, the SERDES in the MAC would stop sending anything. So the module has nothing to send. The link peer SERDES then looses sync, and reports that upwards as carrier lost. Does the i40e and ice leave its SERDES running when the link is configured down? Or is the switch FUBAR and does not consider SERDES loss of sync as carrier down? Ido's use case does seem to be different. The link is down. Do we want to leave the module active, probably sending a bit stream of all 0, maybe noise, or do we want to power the module down, so it does not send anything at all. Andrew
> The patch is well defined but it doesn't provide user with the answer > to the question "why is the SFP still up if I asked it to be down?" Can the user even see that the SFP is still up? Does ip link show give: 3: eth42: <BROADCAST,MULTICAST,LOWER_UP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 i.e. LOWER_UP despite DOWN? Roopa added: rtnetlink: add support for protodown reason Maybe we need the opposite as well? Andrew
On 8/10/2021 3:31 PM, Andrew Lunn wrote: > On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote: >> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: >>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >>>>> or NC-SI related? >>>> >>>> I went through all the Intel drivers that implement these operations and >>>> I believe you are talking about these commits: >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 >>>> >>>> There isn't too much information about the motivation, but maybe it has >>>> something to do with multi-host controllers where you want to prevent >>>> one host from taking the physical link down for all the other hosts >>>> sharing it? I remember such issues with mlx5. >>>> >>> >>> Ok, I found some more information here. The primary motivation of the >>> changes in the i40e and ice drivers is from customer requests asking to >>> have the link go down when the port is administratively disabled. This >>> is because if the link is down then the switch on the other side will >>> see the port not having link and will stop trying to send traffic to it. >>> >>> As far as I can tell, the reason its a flag is because some users wanted >>> the behavior the other way. >>> >>> I'm not sure it's really related to the behavior here. >> >> I think the question was the inverse - why not always shut down the >> port if the interface is brought down? > > Humm. Something does not seem right here. I would assume that when you > administratively configure the link down, the SERDES in the MAC would > stop sending anything. So the module has nothing to send. The link > peer SERDES then looses sync, and reports that upwards as carrier > lost. > Right.... > Does the i40e and ice leave its SERDES running when the link is > configured down? Or is the switch FUBAR and does not consider SERDES > loss of sync as carrier down? > It's not clear to me. I tried to test with the driver, and it looks like upstream doesn't yet have the link-down-on-close merged into net-next, so I grabbed our out-of-tree driver. Interestingly, both ip link show and ethtool do not report link as up when the device is closed (ip link set enp175f0s0 down)..... So... whatever difference link-down-on-close makes we're definitely not reporting things up. I don't have a setup to confirm anything else right now unfortunately, but I suspect something is wrong with the implementation of link-down-on-close (at the very least it seems like we should still be reporting LOWER_UP.... no?) Unless there's some other weirdness like with QSFP or other multi-port-single-cable setups? I even tried adding some VFs and I see that regardless of whether link-down-on-close is set, I can see link up on the VF.... Hmmmmm. > Ido's use case does seem to be different. The link is down. Do we want > to leave the module active, probably sending a bit stream of all 0, > maybe noise, or do we want to power the module down, so it does not > send anything at all. > > Andrew > Right, I think these two cases are very different.
On Tue, Aug 10, 2021 at 03:05:44PM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote: > > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: > > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > > O.K. Thanks for the better explanation. Some of this should go into > > > > the commit message. > > > > > > > > I suggest it gets a different name and semantics, to avoid > > > > confusion. I think we should consider this the default power mode for > > > > when the link is administratively down, rather than direct control > > > > over the modules power mode. The driver should transition the module > > > > to this setting on link down, be it high power or low power. That > > > > saves a lot of complexity, since i assume you currently need a udev > > > > script or something which sets it to low power mode on link down, > > > > where as you can avoid this be configuring the default and let the > > > > driver do it. > > > > > > Good point. And actually NICs have similar knobs, exposed via ethtool > > > priv flags today. Intel NICs for example. Maybe we should create a > > > "really power the port down policy" API? > > > > See below about Intel. I'm not sure it's the same thing... > > > > I'm against adding a vague "really power the port down policy" API. The > > API proposed in the patch is well-defined, its implementation is > > documented in standards, its implications are clear and we offer APIs > > that give user space full observability into its operation. > > > > A vague API means that it is going to be abused and user space will get > > different results over different implementations. After reading the > > *commit messages* about the private flags, I'm not sure what the flags > > really do, what is their true motivation, implications or how do I get > > observability into their operation. I'm not too hopeful about the user > > documentation. > > > > Also, like I mentioned in the cover letter, given the complexity of > > these modules and as they become more common, it is likely that we will > > need to extend the API to control more parameters and expose more > > diagnostic information. I would really like to keep it clean and > > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over > > different APIs. > > The patch is well defined but it doesn't provide user with the answer > to the question "why is the SFP still up if I asked it to be down?" But you didn't ask the module to be "down", you asked the MAC. See more below. > It's good to match specs closely but Linux may need to reconcile > multiple policies. > > IIUC if Intel decides to keep the SFP up for "other" reasons the > situation will look like this: Intel did not decide to keep the module "up", it decided to keep both the MAC and the module up. If one of them was down, the peer would notice it, but it didn't (according to Jake's reply). This is a very problematic behavior as you are telling your peer that everything is fine, but in practice you are dropping all of its packets. I hit the exact same issue with mlx5 a few years ago and when I investigated the reason for this behavior I was referred to multi-host systems where you don't want one host to take down the shared link for all the rest. See: https://www.mellanox.com/sites/default/files/doc-2020/sb-externally-connected-multi-host.pdf > > $ ethtool --show-module eth0 > Module parameters for eth0: > low-power true > > # ethtool -m eth0 > Module State : 0x03 (ModuleReady) > LowPwrAllowRequestHW : Off > LowPwrRequestSW : Off This output is wrong. In Intel's case "ip link show" will report the link as down (according to Jake's reply) despite the MAC being up. On the other hand, the output of "ethtool --show-module eth0" will show that the module is not in low power mode and it will be correct. > > > IOW the low-power mode is a way for user to express preference to > shut down/keep up the SFP, Yes, it controls the module, not the MAC. If you want to get a carrier, both the module and the MAC need to be operational. See following example where swp13 and swp14 are connected to each other: $ ip link show dev swp13 127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff $ ip link show dev swp14 128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff # ip link set dev swp13 down # ethtool --set-module swp13 low-power on $ ethtool --show-module swp13 Module parameters for swp13: low-power true # ip link set dev swp13 up $ ip link show dev swp13 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff $ ip link show dev swp14 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff # ip link set dev swp13 down # ethtool --set-module swp13 low-power off $ ethtool --show-module swp13 Module parameters for swp13: low-power false # ip link set dev swp13 up $ ip link show dev swp13 127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff $ ip link show dev swp14 128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff > but it's not necessarily going to be the only "policy" that matters. > If other policies (e.g. NC-SI) express preference to keep the > interface up it will stay up, right? > > The LowPwrRequestSW is not directly controlled by low-power && IF_UP. > > What I had in mind was something along the lines of a bitmap of reasons > which are allowed to keep the interface up, and for your use case > the reason would be something like SFP_ALWAYS_ON, but other reasons > (well defined) may also keep the ifc up. > > That's just to explain what I meant, if it's going to be clear to > everyone that low-power != LowPwrRequestSW I'm fine either way. I think we mixed two different use cases here. The first is a way to make sure the link is fully operational (both MAC and module). In this case, contrary to the expected behavior, "ip link set dev eth0 down" will not take the MAC down and the peer will not lose its carrier. This is most likely motivated by special hardware designs or exotic use cases like I mentioned above. The use case for this patch is completely different. Today, when you do "ip link set dev eth0 up" you expect to gain a carrier which means that both the MAC and the module are operational. The latter can be made operational following the user request (e.g., SFP driver) or as soon as it was plugged-in, by the device's firmware. When you do "ip link set dev eth0 down" you expect the reverse to happen and this is what happens today. In implementations where the module is always operational, it stays in high power mode and continues to get warm and consume unnecessary amount of power. Some users might not be OK with that and would like more control, which is exactly what this patch is doing. This patch does not change existing behavior, the API has clear semantics and implications and user space has full observability into its operation. If in the future someone sees the need to add 'protoup', it can be done: # ip link set dev eth0 up # MAC and module are operational # ip link set dev eth0 protoup on # ip link set dev eth0 down # returns an error / ignore # ethtool --set-module eth0 low-power on # returns an error because of admin up You can obviously engineer situations that do not make any sense. Like this: # ethtool --set-module eth0 low-power on # ip link set dev eth0 up # ip link set dev eth0 protoup on The MAC is operational and you can't take it down, but you will never get a carrier because you explicitly asked the module to stay in low power mode.
On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote: > # ethtool --set-module swp13 low-power on > > $ ethtool --show-module swp13 > Module parameters for swp13: > low-power true > > # ip link set dev swp13 up > > $ ip link show dev swp13 > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff > > $ ip link show dev swp14 > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff Oh, so if we set low-power true the carrier will never show up? I thought Andrew suggested the setting is only taken into account when netdev is down. That made so much sense to me I assumed we'll just go with that. I must have misunderstood.
On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote: > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote: > > # ethtool --set-module swp13 low-power on > > > > $ ethtool --show-module swp13 > > Module parameters for swp13: > > low-power true > > > > # ip link set dev swp13 up > > > > $ ip link show dev swp13 > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff > > > > $ ip link show dev swp14 > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff > > Oh, so if we set low-power true the carrier will never show up? > I thought Andrew suggested the setting is only taken into account > when netdev is down. Yes, that was my intention. If this low power mode also applies when the interface is admin up, it sounds like a foot gun. ip link show gives you no idea why the carrier is down, and people will assume the cable or peer is broken. We at least need a new flag, LOWER_DISABLED or similar to give the poor user some chance to figure out what is going on. To me, this setting should only apply when the link is admin down. Andrew
On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote: > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote: > > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote: > > > # ethtool --set-module swp13 low-power on > > > > > > $ ethtool --show-module swp13 > > > Module parameters for swp13: > > > low-power true > > > > > > # ip link set dev swp13 up > > > > > > $ ip link show dev swp13 > > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > > link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff > > > > > > $ ip link show dev swp14 > > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > > link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff > > > > Oh, so if we set low-power true the carrier will never show up? > > I thought Andrew suggested the setting is only taken into account > > when netdev is down. > > Yes, that was my intention. If this low power mode also applies when > the interface is admin up, it sounds like a foot gun. ip link show > gives you no idea why the carrier is down, and people will assume the > cable or peer is broken. We at least need a new flag, LOWER_DISABLED > or similar to give the poor user some chance to figure out what is > going on. The canonical way to report such errors is via LINKSTATE_GET and I will add an extended sub-state describing the problem. > > To me, this setting should only apply when the link is admin down. I don't want to bake such an assumption into the kernel, but I have a suggestion that resolves the issue. We currently have a single attribute that encodes the desired state on SET messages and the operational state on GET_REPLY messages (ETHTOOL_A_MODULE_LOW_POWER_ENABLED): $ ethtool --show-module swp11 Module parameters for swp11: low-power true It is not defined very well when a module is not connected despite being a very interesting use case. We really need to have two attributes. The first one describing the power mode policy and the second one describing the operational power mode which is only reported when a module is plugged in. For the policy we can have these values: 1. low: Always transition the module to low power mode 2. high: Always transition the module to high power mode 3. high-on-up: Transition the module to high power mode when a port using it is administratively up. Otherwise, low A different policy for up/down seems like an overkill for me. See example usage below. No module connected: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high Like I mentioned before, this is the default on Mellanox systems so this new attribute allows user space to query the default policy. Change to a different policy: # ethtool --set-module swp11 power-mode-policy high-on-up $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high-on-up After a module was connected: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high-on-up power-mode low # ip link set dev swp11 up $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high-on-up low-power high # ip link set dev swp11 down # ethtool --set-module swp11 power-mode-policy low # ip link set dev swp11 up $ ethtool swp11 ... Link detected: no (Cable issue, Module is in low power mode) I'm quite happy with the above. Might change a few things as I implement it, but you get the gist. WDYT?
On Wed, 11 Aug 2021 22:37:53 +0300 Ido Schimmel wrote: > On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote: > > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote: > > > Oh, so if we set low-power true the carrier will never show up? > > > I thought Andrew suggested the setting is only taken into account > > > when netdev is down. > > > > Yes, that was my intention. If this low power mode also applies when > > the interface is admin up, it sounds like a foot gun. ip link show > > gives you no idea why the carrier is down, and people will assume the > > cable or peer is broken. We at least need a new flag, LOWER_DISABLED > > or similar to give the poor user some chance to figure out what is > > going on. > > The canonical way to report such errors is via LINKSTATE_GET and I will > add an extended sub-state describing the problem. > > > To me, this setting should only apply when the link is admin down. > > I don't want to bake such an assumption into the kernel, but I have a > suggestion that resolves the issue. > > We currently have a single attribute that encodes the desired state on > SET messages and the operational state on GET_REPLY messages > (ETHTOOL_A_MODULE_LOW_POWER_ENABLED): > > $ ethtool --show-module swp11 > Module parameters for swp11: > low-power true > > It is not defined very well when a module is not connected despite being > a very interesting use case. We really need to have two attributes. The > first one describing the power mode policy and the second one describing > the operational power mode which is only reported when a module is > plugged in. > > For the policy we can have these values: > > 1. low: Always transition the module to low power mode > 2. high: Always transition the module to high power mode > 3. high-on-up: Transition the module to high power mode when a port > using it is administratively up. Otherwise, low > > A different policy for up/down seems like an overkill for me. > > See example usage below. > > No module connected: > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high > > Like I mentioned before, this is the default on Mellanox systems so this > new attribute allows user space to query the default policy. > > Change to a different policy: > > # ethtool --set-module swp11 power-mode-policy high-on-up > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > > After a module was connected: > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > power-mode low > > # ip link set dev swp11 up > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > low-power high > > # ip link set dev swp11 down > > # ethtool --set-module swp11 power-mode-policy low > > # ip link set dev swp11 up > > $ ethtool swp11 > ... > Link detected: no (Cable issue, Module is in low power mode) > > I'm quite happy with the above. Might change a few things as I implement > it, but you get the gist. WDYT? Isn't the "low-power" attr just duplicating the relevant bits from -m?
> For the policy we can have these values: > > 1. low: Always transition the module to low power mode > 2. high: Always transition the module to high power mode > 3. high-on-up: Transition the module to high power mode when a port > using it is administratively up. Otherwise, low > > A different policy for up/down seems like an overkill for me. O.K. The current kernel SFP driver is going to default to high-on-up, which is what kernel driven copper PHYs also do. > After a module was connected: > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > power-mode low > > # ip link set dev swp11 up > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > low-power high > > # ip link set dev swp11 down You missed a show here. I expect it to be: > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > power-mode low since it is now down. I suppose we should consider the bigger picture. Is this feature limited to just SFP modules, or does it make sense to any other sort of network technology? CAN, bluetooth, 5G, IPoAC? Andrew
> Isn't the "low-power" attr just duplicating the relevant bits from -m?
Do all SFPs report it in the dump? I'm thinking GPON, 1G modules with
a TX_ENABLE pin? INF-8074 does not specify a bit in the 'EEPROM' data
to indicate the status. So you need to know the state of the GPIO
driving the TX_ENABLE pin.
Andrew
On Wed, Aug 11, 2021 at 01:30:06PM -0700, Jakub Kicinski wrote:
> Isn't the "low-power" attr just duplicating the relevant bits from -m?
If low power is forced by hardware, then it depends on the assertion /
de-assertion of the hardware signal which is obviously not part of the
EEPROM dump.
I knew this would come up, so I mentioned it in the commit message:
"The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space."
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index c86628e6a235..07eac5bc9cfc 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -179,7 +179,7 @@ according to message purpose: Userspace to kernel: - ===================================== ================================ + ===================================== ================================= ``ETHTOOL_MSG_STRSET_GET`` get string set ``ETHTOOL_MSG_LINKINFO_GET`` get link settings ``ETHTOOL_MSG_LINKINFO_SET`` set link settings @@ -213,7 +213,9 @@ Userspace to kernel: ``ETHTOOL_MSG_MODULE_EEPROM_GET`` read SFP module EEPROM ``ETHTOOL_MSG_STATS_GET`` get standard statistics ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` get PHC virtual clocks info - ===================================== ================================ + ``ETHTOOL_MSG_MODULE_SET`` set transceiver module parameters + ``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters + ===================================== ================================= Kernel to userspace: @@ -252,6 +254,7 @@ Kernel to userspace: ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY`` read SFP module EEPROM ``ETHTOOL_MSG_STATS_GET_REPLY`` standard statistics ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info + ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters ======================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -1498,6 +1501,52 @@ Kernel response contents: ``ETHTOOL_A_PHC_VCLOCKS_INDEX`` s32 PHC index array ==================================== ====== ========================== +MODULE_GET +========== + +Gets transceiver module parameters. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_MODULE_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ====================================== ====== ========================== + ``ETHTOOL_A_MODULE_HEADER`` nested reply header + ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` bool low power mode is enabled + ====================================== ====== ========================== + +The optional ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` attribute encodes whether +low power mode is forced by the host. + +MODULE_SET +========== + +Sets transceiver module parameters. + +Request contents: + + ====================================== ====== ========================== + ``ETHTOOL_A_MODULE_HEADER`` nested request header + ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` bool low power mode is enabled + ====================================== ====== ========================== + +When set, the optional ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` attribute is used +to force the module to stay in low power mode or force it back into low power +mode. When cleared, capable modules can transition to high power mode. + +To avoid changes to the operational state of the device, low power mode can +only be set when the device is administratively down. + +For SFF-8636 modules, low power mode is forced by the host according to table +6-10 in revision 2.10a of the specification. + +For CMIS modules, low power mode is forced by the host according to table 6-12 +in revision 5.0 of the specification. + Request translation =================== @@ -1597,4 +1646,6 @@ are netlink only. n/a ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT`` n/a ``ETHTOOL_MSG_TUNNEL_INFO_GET`` n/a ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` + n/a ``ETHTOOL_MSG_MODULE_GET`` + n/a ``ETHTOOL_MSG_MODULE_SET`` =================================== ===================================== diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index 4711b96dae0c..04286debdcdc 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -570,6 +570,10 @@ struct ethtool_module_eeprom { * @get_eth_ctrl_stats: Query some of the IEEE 802.3 MAC Ctrl statistics. * @get_rmon_stats: Query some of the RMON (RFC 2819) statistics. * Set %ranges to a pointer to zero-terminated array of byte ranges. + * @get_module_low_power: Get the low power mode status of the plug-in module + * used by the network device. + * @set_module_low_power: Set the low power mode status of the plug-in module + * used by the network device. * * All operations are optional (i.e. the function pointer may be set * to %NULL) and callers must take this into account. Callers must @@ -689,6 +693,11 @@ struct ethtool_ops { void (*get_rmon_stats)(struct net_device *dev, struct ethtool_rmon_stats *rmon_stats, const struct ethtool_rmon_hist_range **ranges); + int (*get_module_low_power)(struct net_device *dev, + bool *p_low_power, + struct netlink_ext_ack *extack); + int (*set_module_low_power)(struct net_device *dev, bool low_power, + struct netlink_ext_ack *extack); }; int ethtool_check_ops(const struct ethtool_ops *ops); diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index b3b93710eff7..72fb821f3928 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -47,6 +47,8 @@ enum { ETHTOOL_MSG_MODULE_EEPROM_GET, ETHTOOL_MSG_STATS_GET, ETHTOOL_MSG_PHC_VCLOCKS_GET, + ETHTOOL_MSG_MODULE_GET, + ETHTOOL_MSG_MODULE_SET, /* add new constants above here */ __ETHTOOL_MSG_USER_CNT, @@ -90,6 +92,8 @@ enum { ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY, ETHTOOL_MSG_STATS_GET_REPLY, ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY, + ETHTOOL_MSG_MODULE_GET_REPLY, + ETHTOOL_MSG_MODULE_NTF, /* add new constants above here */ __ETHTOOL_MSG_KERNEL_CNT, @@ -831,6 +835,18 @@ enum { ETHTOOL_A_STATS_RMON_MAX = (__ETHTOOL_A_STATS_RMON_CNT - 1) }; +/* MODULE */ + +enum { + ETHTOOL_A_MODULE_UNSPEC, + ETHTOOL_A_MODULE_HEADER, /* nest - _A_HEADER_* */ + ETHTOOL_A_MODULE_LOW_POWER_ENABLED, /* u8 */ + + /* add new constants above here */ + __ETHTOOL_A_MODULE_CNT, + ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1) +}; + /* generic netlink info */ #define ETHTOOL_GENL_NAME "ethtool" #define ETHTOOL_GENL_VERSION 1 diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile index 0a19470efbfb..b76432e70e6b 100644 --- a/net/ethtool/Makefile +++ b/net/ethtool/Makefile @@ -7,4 +7,4 @@ obj-$(CONFIG_ETHTOOL_NETLINK) += ethtool_nl.o ethtool_nl-y := netlink.o bitset.o strset.o linkinfo.o linkmodes.o \ linkstate.o debug.o wol.o features.o privflags.o rings.o \ channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \ - tunnels.o fec.o eeprom.o stats.o phc_vclocks.o + tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o diff --git a/net/ethtool/module.c b/net/ethtool/module.c new file mode 100644 index 000000000000..947f2188d725 --- /dev/null +++ b/net/ethtool/module.c @@ -0,0 +1,184 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include <linux/ethtool.h> + +#include "netlink.h" +#include "common.h" +#include "bitset.h" + +struct module_req_info { + struct ethnl_req_info base; +}; + +struct module_reply_data { + struct ethnl_reply_data base; + u8 low_power:1, + low_power_valid:1; +}; + +#define MODULE_REPDATA(__reply_base) \ + container_of(__reply_base, struct module_reply_data, base) + +/* MODULE_GET */ + +const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1] = { + [ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), +}; + +static int module_get_low_power(struct net_device *dev, + struct module_reply_data *data, + struct netlink_ext_ack *extack) +{ + const struct ethtool_ops *ops = dev->ethtool_ops; + bool low_power; + int ret; + + if (!ops->get_module_low_power) + return 0; + + ret = ops->get_module_low_power(dev, &low_power, extack); + if (ret < 0) + return ret; + + data->low_power = low_power; + data->low_power_valid = true; + + return 0; +} + +static int module_prepare_data(const struct ethnl_req_info *req_base, + struct ethnl_reply_data *reply_base, + struct genl_info *info) +{ + struct module_reply_data *data = MODULE_REPDATA(reply_base); + struct netlink_ext_ack *extack = info ? info->extack : NULL; + struct net_device *dev = reply_base->dev; + int ret; + + ret = ethnl_ops_begin(dev); + if (ret < 0) + return ret; + + ret = module_get_low_power(dev, data, extack); + if (ret < 0) + goto out_complete; + +out_complete: + ethnl_ops_complete(dev); + return ret; +} + +static int module_reply_size(const struct ethnl_req_info *req_base, + const struct ethnl_reply_data *reply_base) +{ + struct module_reply_data *data = MODULE_REPDATA(reply_base); + int len = 0; + + if (data->low_power_valid) + len += nla_total_size(sizeof(u8)); /* _MODULE_LOW_POWER_ENABLED */ + + return len; +} + +static int module_fill_reply(struct sk_buff *skb, + const struct ethnl_req_info *req_base, + const struct ethnl_reply_data *reply_base) +{ + const struct module_reply_data *data = MODULE_REPDATA(reply_base); + + if (data->low_power_valid && + nla_put_u8(skb, ETHTOOL_A_MODULE_LOW_POWER_ENABLED, + data->low_power)) + return -EMSGSIZE; + + return 0; +} + +const struct ethnl_request_ops ethnl_module_request_ops = { + .request_cmd = ETHTOOL_MSG_MODULE_GET, + .reply_cmd = ETHTOOL_MSG_MODULE_GET_REPLY, + .hdr_attr = ETHTOOL_A_MODULE_HEADER, + .req_info_size = sizeof(struct module_req_info), + .reply_data_size = sizeof(struct module_reply_data), + + .prepare_data = module_prepare_data, + .reply_size = module_reply_size, + .fill_reply = module_fill_reply, +}; + +/* MODULE_SET */ + +const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1] = { + [ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), + [ETHTOOL_A_MODULE_LOW_POWER_ENABLED] = NLA_POLICY_MAX(NLA_U8, 1), +}; + +static int module_set_low_power(struct net_device *dev, struct nlattr **tb, + bool *p_mod, struct netlink_ext_ack *extack) +{ + const struct ethtool_ops *ops = dev->ethtool_ops; + bool low_power_new, low_power; + int ret; + + if (!tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED]) + return 0; + + if (!ops->get_module_low_power || !ops->set_module_low_power) { + NL_SET_ERR_MSG_ATTR(extack, + tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED], + "Setting low power mode is not supported by this device"); + return -EOPNOTSUPP; + } + + if (netif_running(dev)) { + NL_SET_ERR_MSG_ATTR(extack, + tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED], + "Cannot set low power mode when port is administratively up"); + return -EINVAL; + } + + low_power_new = !!nla_get_u8(tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED]); + ret = ops->get_module_low_power(dev, &low_power, extack); + if (ret < 0) + return ret; + *p_mod = low_power_new != low_power; + + return ops->set_module_low_power(dev, low_power_new, extack); +} + +int ethnl_set_module(struct sk_buff *skb, struct genl_info *info) +{ + struct ethnl_req_info req_info = {}; + struct nlattr **tb = info->attrs; + struct net_device *dev; + bool mod = false; + int ret; + + ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MODULE_HEADER], + genl_info_net(info), info->extack, + true); + if (ret < 0) + return ret; + dev = req_info.dev; + + rtnl_lock(); + ret = ethnl_ops_begin(dev); + if (ret < 0) + goto out_rtnl; + + ret = module_set_low_power(dev, tb, &mod, info->extack); + if (ret < 0) + goto out_ops; + + if (!mod) + goto out_ops; + + ethtool_notify(dev, ETHTOOL_MSG_MODULE_NTF, NULL); + +out_ops: + ethnl_ops_complete(dev); +out_rtnl: + rtnl_unlock(); + dev_put(dev); + return ret; +} diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c index 1797a0a90019..38b44c0291b1 100644 --- a/net/ethtool/netlink.c +++ b/net/ethtool/netlink.c @@ -282,6 +282,7 @@ ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = { [ETHTOOL_MSG_MODULE_EEPROM_GET] = ðnl_module_eeprom_request_ops, [ETHTOOL_MSG_STATS_GET] = ðnl_stats_request_ops, [ETHTOOL_MSG_PHC_VCLOCKS_GET] = ðnl_phc_vclocks_request_ops, + [ETHTOOL_MSG_MODULE_GET] = ðnl_module_request_ops, }; static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb) @@ -593,6 +594,7 @@ ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = { [ETHTOOL_MSG_PAUSE_NTF] = ðnl_pause_request_ops, [ETHTOOL_MSG_EEE_NTF] = ðnl_eee_request_ops, [ETHTOOL_MSG_FEC_NTF] = ðnl_fec_request_ops, + [ETHTOOL_MSG_MODULE_NTF] = ðnl_module_request_ops, }; /* default notification handler */ @@ -686,6 +688,7 @@ static const ethnl_notify_handler_t ethnl_notify_handlers[] = { [ETHTOOL_MSG_PAUSE_NTF] = ethnl_default_notify, [ETHTOOL_MSG_EEE_NTF] = ethnl_default_notify, [ETHTOOL_MSG_FEC_NTF] = ethnl_default_notify, + [ETHTOOL_MSG_MODULE_NTF] = ethnl_default_notify, }; void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data) @@ -999,6 +1002,22 @@ static const struct genl_ops ethtool_genl_ops[] = { .policy = ethnl_phc_vclocks_get_policy, .maxattr = ARRAY_SIZE(ethnl_phc_vclocks_get_policy) - 1, }, + { + .cmd = ETHTOOL_MSG_MODULE_GET, + .doit = ethnl_default_doit, + .start = ethnl_default_start, + .dumpit = ethnl_default_dumpit, + .done = ethnl_default_done, + .policy = ethnl_module_get_policy, + .maxattr = ARRAY_SIZE(ethnl_module_get_policy) - 1, + }, + { + .cmd = ETHTOOL_MSG_MODULE_SET, + .flags = GENL_UNS_ADMIN_PERM, + .doit = ethnl_set_module, + .policy = ethnl_module_set_policy, + .maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1, + }, }; static const struct genl_multicast_group ethtool_nl_mcgrps[] = { diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h index 077aac3929a8..cf0fcbfe3c5c 100644 --- a/net/ethtool/netlink.h +++ b/net/ethtool/netlink.h @@ -337,6 +337,7 @@ extern const struct ethnl_request_ops ethnl_fec_request_ops; extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops; extern const struct ethnl_request_ops ethnl_stats_request_ops; extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops; +extern const struct ethnl_request_ops ethnl_module_request_ops; extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1]; extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1]; @@ -373,6 +374,8 @@ extern const struct nla_policy ethnl_fec_set_policy[ETHTOOL_A_FEC_AUTO + 1]; extern const struct nla_policy ethnl_module_eeprom_get_policy[ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS + 1]; extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1]; extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1]; +extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1]; +extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1]; int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info); int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info); @@ -391,6 +394,7 @@ int ethnl_tunnel_info_doit(struct sk_buff *skb, struct genl_info *info); int ethnl_tunnel_info_start(struct netlink_callback *cb); int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info); +int ethnl_set_module(struct sk_buff *skb, struct genl_info *info); extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN]; extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];