diff mbox series

[RFC] Providing a helper for PCS inband negotiation

Message ID ZGIkGmyL8yL1q1zp@shell.armlinux.org.uk (mailing list archive)
State RFC
Headers show
Series [RFC] Providing a helper for PCS inband negotiation | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Russell King (Oracle) May 15, 2023, 12:22 p.m. UTC
All,

One issue that seems to come up with PCS drivers is when should inband
negotiation be enabled. Some SFP modules have specific requirements when
it comes to whether inband AN should be enabled.

Having each network driver implement their own decision making is rather
sub-optimal as this doesn't lead to maximum inter-operability and a
consistent experience with SFP modules when used with different network
drivers.

This has been bugging me for a while, and I've had in my net-queue some
patches that introduced a simple helper that took the code in
phylink_mii_c22_pcs_config() and pulled that into an inline function:

        if (mode == MLO_AN_INBAND &&
            (interface == PHY_INTERFACE_MODE_SGMII ||
             interface == PHY_INTERFACE_MODE_QSGMII ||
             linkmode_test_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, advertising)))

but I haven't been happy with this approach. I did this when working on
the mv88e6xxx conversion to phylink-PCS, because I didn't want to
duplicate that decision making in that driver.

Having been through all PCS drivers thus far, the majority of them are
concerned with "should some form of in-band signalling be enabled or
not?" and a simple boolean would suffice for that.

However, there are two (sparx5 and lan966x) which appear to want
slightly more information than that. These have a port configuration
structure that wants to know if we are using inband mode (in other
words, phylink is in MLO_AN_INBAND mode) and whether autoneg is
enabled (the Autoneg bit in the advertising bitmap.) See
lan966x_pcs_config().

This is further decoded in lan966x_port_pcs_set() such that if we're in
MLO_AN_INBAND mode, and:
 if we're using SGMII, or
 if the number of ports is four (QSGMII, QUSGMII), or
 if we're in 1000base-X mode and Autoneg is set,
then inband mode is enabled. However, if we aren't in MLO_AN_INBAND
mode, then an "outband" mode is used (driver's terminology, but the
comments about it seem to talk about "inband" here.)

It looks like Sparx5 is pretty similar to lan966x.

So, it seems that a boolean won't do, and we at least need a tristate
outband/inband-an-disabled/inband-an-enabled indication.

However, we also have to remember that there are protocols such as
10GBASE-R which do not have any inband capability. This suggests we
actually need a quad-state - none/outband/inband-an-disabled/
inband-an-enabled. That said, we have hardly any PCS that this would
end up returning "NONE" - but there are some (e.g. mvneta and mvpp2
which always provide a PCS, even for e.g. RGMII.)

So, to cover this, I'm proposing a series of steps:

1. introduce a new helper function - phylink_pcs_neg_mode() - as per
   the patch below
2. switch phylink_mii_c22_pcs_config() to use it
3. pull that helper into phylink_mii_c22_pcs_config() callers
4. convert all pcs_config() implementations that fiddle with inband AN
   configuration to use this helper
5. finally pull the helper up into phylink and change the pcs_config()
   method prototype:

-       int (*pcs_config)(struct phylink_pcs *pcs, unsigned int mode,
-                         phy_interface_t interface,
+       int (*pcs_config)(struct phylink_pcs *pcs, phy_interface_t interface,
                          const unsigned long *advertising,
+                         unsigned int neg_mode,
                          bool permit_pause_to_mac);

There are two questions that came up while creating the patches for the
above five steps:

1. Should 10GBASE-KR be included in the SGMII et.al. case in the code?
   Any other interface modes that should be there? Obviously,
   PHYLINK_PCS_NEG_NONE is not correct for 10GBASE-KR since it does use
   inband AN. Does it make sense for the user to disable inband AN for
   10GBASE-KR? If so, maybe it should be under the 1000base-X case.

2. XLGMII.. Looking at the XPCS driver, it's unclear whether Clause 73
   AN gets used for this. A quick scan of IEEE 802.3 suggests that
   XLGMII doesn't have any support for any inband signalling, and it's
   just an intermediary protocol between the MAC (more specifically the
   RS, but for the purposes of this I'll just refer to MAC) and the
   attached PCS, and any autonegotiation happens after the XLGMII link.

   In terms of PCS, until PHY_INTERFACE_MODE_XLGMII was introduced, the
   PHY interface mode for serdes based protocols described the protocol
   used on the media side of the PCS and the next part of the system
   towards the media, but XLGMII describes the MAC-to-PCS protocol
   (which could equally be GMII, XGMII etc). For example, in 802.3, the
   stack for 1000BASE-X describes the MAC-to-PCS link as GMII, but we
   have omitted that detail, and we just call it
   PHY_INTERFACE_MODE_1000BASEX.

   PHY_INTERFACE_MODE_XLGMII has as much meaning whether the PCS uses
   media-facing inband AN as PHY_INTERFACE_MODE_GMII would do if it
   were to be used with a PCS operating in SGMII or 1000BASE-X mode.

   If XLGMII is used with a PCS that supports clause 73 AN, then there
   is obviously AN involved. However, XLGMII can also be used with a
   PCS/PMA/PMD that ends up producing 40GBASE-R which has no inband AN.

   So, I'm not sure what the behaviour of phylink_pcs_neg_mode() should
   be when faced with PHY_INTERFACE_MODE_XLGMII - does this need yet
   another state?

3. Should the pcs_link_up() method also be passed this "neg_mode" state
   and should the existing "mode" argument be dropped?

Finally, there's the obvious question whether this approach is even a
good idea.

Please let me have some thoughts.

8<===
From: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Subject: [PATCH] net: phylink: add phylink_pcs_neg_mode()

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 include/linux/phylink.h | 72 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

Comments

Andrew Lunn May 15, 2023, 1:58 p.m. UTC | #1
> 2. XLGMII.. Looking at the XPCS driver, it's unclear whether Clause 73
>    AN gets used for this. A quick scan of IEEE 802.3 suggests that
>    XLGMII doesn't have any support for any inband signalling, and it's
>    just an intermediary protocol between the MAC (more specifically the
>    RS, but for the purposes of this I'll just refer to MAC) and the
>    attached PCS, and any autonegotiation happens after the XLGMII link.

So isn't XLGMII then a generic PHY thing, not a phylink thing?

Or am i not correctly understanding how
drivers/phy/marvell/phy-*-comphy.c and
drivers/phy/microchip/*_serdev.c fit into the overall picture?

	Andrew
Russell King (Oracle) May 15, 2023, 3:34 p.m. UTC | #2
On Mon, May 15, 2023 at 03:58:09PM +0200, Andrew Lunn wrote:
> > 2. XLGMII.. Looking at the XPCS driver, it's unclear whether Clause 73
> >    AN gets used for this. A quick scan of IEEE 802.3 suggests that
> >    XLGMII doesn't have any support for any inband signalling, and it's
> >    just an intermediary protocol between the MAC (more specifically the
> >    RS, but for the purposes of this I'll just refer to MAC) and the
> >    attached PCS, and any autonegotiation happens after the XLGMII link.
> 
> So isn't XLGMII then a generic PHY thing, not a phylink thing?

Honestly, I'm really not sure. It feels more like an internal-SoC
thing. See my last diagram why I think this.

> Or am i not correctly understanding how
> drivers/phy/marvell/phy-*-comphy.c and
> drivers/phy/microchip/*_serdev.c fit into the overall picture?

In the case of serdes-based protocols with an external PHY, the general
structure we have is:

------------------------------+
  SoC                         |
+-----+   +-----+  +--------+ |          +--------------+
| MAC +---+ PCS +--+ SERDES +------------+ Ethernet PHY +---- Media
+-----+   +-----+  +--------+ |    ^     +--------------+
------------------------------+    |
                                   |
PHY_INTERFACE_MODE_xxx has referred to this bit, and this is the bit
where inband can occur.

In the case of multi-rate implementations, there can be one of many
different PCS that can be placed there, and the SERDES handles
converting the data stream from the PCS into its appropriate
electrical form, essentially covering the PMA and PMD functions.
(That same SERDES block generally can also handle PCIe and SATA.)

In the case of non-serdes based protocols, then it's essentially the
same as the above, but with the PCS and SERDES blocks removed.

For Fibre based connections:

------------------------------+
  SoC                         |
+-----+   +-----+  +--------+ |
| MAC +---+ PCS +--+ SERDES +------------ Media
+-----+   +-----+  +--------+ |    ^
------------------------------+    |
                                   |
PHY_INTERFACE_MODE_xxx has referred to this bit, and 1000BASE-X runs
negotiation on this. 10GBASE-R also used for fibre has no negotiation,
but there's still the 10GBASE-R PCS and the 802.3 PMA/PMD are subsumed
by the SERDES block.

What I think seems to be the case with XPCS is:

------------------------------+
  SoC                         |
+-----+   +-----+  +--------+ |
| MAC +---+ PCS +--+ SERDES +----------- Media
+-----+ ^ +-----+  +--------+ |
--------|---------------------+
        |
PHY_INTERFACE_MODE_xxx seems to be referring to this bit. When clause 73
negotiation is used, that happens where I've stated "media" above, and
that's involved with negotiating backplane protocols e.g. 40GBASE-KR4,
40GBASE-CR4, 25GBASE-KR, 10GBASE-KR, 1000BASE-KX etc on the bit I've
called "Media" above.

However, we can also have this for a fibre link:

------------------------------+
  SoC                         |
+-----+   +-----+  +--------+ |
| MAC +---+ PCS +--+ SERDES +----------- Fibre
+-----+ ^ +-----+  +--------+ |    ^
--------|---------------------+    |
        |                          |
     XLGMII                    40GBASE-R

Given that in this case, we'd want PHY_INTERFACE_MODE_xxx to say
40GBASE-R, using the existing PHY_INTERFACE_MODE_xxx to specify at
where I've pointed XLGMII just makes things confused... but in the
case above with clause 73 negotiation, we wouldn't have a standard
PHY_INTERFACE_MODE_xxx specifier for the external "media" side
because that's dependent on the result of the negotiation.

So... this seems to be a right can of wriggly things.
Russell King (Oracle) May 15, 2023, 3:44 p.m. UTC | #3
On Mon, May 15, 2023 at 04:34:33PM +0100, Russell King (Oracle) wrote:
> On Mon, May 15, 2023 at 03:58:09PM +0200, Andrew Lunn wrote:
> > > 2. XLGMII.. Looking at the XPCS driver, it's unclear whether Clause 73
> > >    AN gets used for this. A quick scan of IEEE 802.3 suggests that
> > >    XLGMII doesn't have any support for any inband signalling, and it's
> > >    just an intermediary protocol between the MAC (more specifically the
> > >    RS, but for the purposes of this I'll just refer to MAC) and the
> > >    attached PCS, and any autonegotiation happens after the XLGMII link.
> > 
> > So isn't XLGMII then a generic PHY thing, not a phylink thing?
> 
> Honestly, I'm really not sure. It feels more like an internal-SoC
> thing. See my last diagram why I think this.
> 
> > Or am i not correctly understanding how
> > drivers/phy/marvell/phy-*-comphy.c and
> > drivers/phy/microchip/*_serdev.c fit into the overall picture?
> 
> In the case of serdes-based protocols with an external PHY, the general
> structure we have is:
> 
> ------------------------------+
>   SoC                         |
> +-----+   +-----+  +--------+ |          +--------------+
> | MAC +---+ PCS +--+ SERDES +------------+ Ethernet PHY +---- Media
> +-----+   +-----+  +--------+ |    ^     +--------------+
> ------------------------------+    |
>                                    |
> PHY_INTERFACE_MODE_xxx has referred to this bit, and this is the bit
> where inband can occur.
> 
> In the case of multi-rate implementations, there can be one of many
> different PCS that can be placed there, and the SERDES handles
> converting the data stream from the PCS into its appropriate
> electrical form, essentially covering the PMA and PMD functions.
> (That same SERDES block generally can also handle PCIe and SATA.)
> 
> In the case of non-serdes based protocols, then it's essentially the
> same as the above, but with the PCS and SERDES blocks removed.
> 
> For Fibre based connections:
> 
> ------------------------------+
>   SoC                         |
> +-----+   +-----+  +--------+ |
> | MAC +---+ PCS +--+ SERDES +------------ Media
> +-----+   +-----+  +--------+ |    ^
> ------------------------------+    |
>                                    |
> PHY_INTERFACE_MODE_xxx has referred to this bit, and 1000BASE-X runs
> negotiation on this. 10GBASE-R also used for fibre has no negotiation,
> but there's still the 10GBASE-R PCS and the 802.3 PMA/PMD are subsumed
> by the SERDES block.
> 
> What I think seems to be the case with XPCS is:
> 
> ------------------------------+
>   SoC                         |
> +-----+   +-----+  +--------+ |
> | MAC +---+ PCS +--+ SERDES +----------- Media
> +-----+ ^ +-----+  +--------+ |
> --------|---------------------+
>         |
> PHY_INTERFACE_MODE_xxx seems to be referring to this bit. When clause 73
> negotiation is used, that happens where I've stated "media" above, and
> that's involved with negotiating backplane protocols e.g. 40GBASE-KR4,
> 40GBASE-CR4, 25GBASE-KR, 10GBASE-KR, 1000BASE-KX etc on the bit I've
> called "Media" above.
> 
> However, we can also have this for a fibre link:
> 
> ------------------------------+
>   SoC                         |
> +-----+   +-----+  +--------+ |
> | MAC +---+ PCS +--+ SERDES +----------- Fibre
> +-----+ ^ +-----+  +--------+ |    ^
> --------|---------------------+    |
>         |                          |
>      XLGMII                    40GBASE-R
> 
> Given that in this case, we'd want PHY_INTERFACE_MODE_xxx to say
> 40GBASE-R, using the existing PHY_INTERFACE_MODE_xxx to specify at
> where I've pointed XLGMII just makes things confused... but in the
> case above with clause 73 negotiation, we wouldn't have a standard
> PHY_INTERFACE_MODE_xxx specifier for the external "media" side
> because that's dependent on the result of the negotiation.
> 
> So... this seems to be a right can of wriggly things.

I should point out that in many of these cases, this may also be true:

--------+   +---------------------+
  SoC   |   |        PHY          |
+-----+ |   | +-----+  +--------+ |
| MAC +-------+ PCS +--+ SERDES +------------ Media
+-----+ | ^ | +-----+  +--------+ |    ^
--------+ | +---------------------+    |
          |                            |
        GMII                       1000BASE-X
        XGMII                       10GBASE-R
       XLGMII                       40GBASE-R

One would expect PHY_INTERFACE_MODE_xxx to refer to the first in this
case. If we decide that is the right model, then we shouldn't have
added PHY_INTERFACE_MODE_1000BASEX!

Then we get to the fun that is SGMII, which if we think the above is
the right model, we get GMII, a PHY that converts GMII to SGMII, and
then another PHY that converts SGMII to whatever the media wants.

In all of these cases, the interface mode on the left has no bearing
on whether the PHY block in the above (whether its separate or part
of the SoC) performs any kind of negotiation.
Vladimir Oltean May 15, 2023, 7:56 p.m. UTC | #4
Hello,

On Mon, May 15, 2023 at 01:22:50PM +0100, Russell King (Oracle) wrote:
> 1. Should 10GBASE-KR be included in the SGMII et.al. case in the code?
>    Any other interface modes that should be there? Obviously,
>    PHYLINK_PCS_NEG_NONE is not correct for 10GBASE-KR since it does use
>    inband AN. Does it make sense for the user to disable inband AN for
>    10GBASE-KR? If so, maybe it should be under the 1000base-X case.

What physical process (reference to IEEE 802.3 clause is fine) would be
controlled by the phylink_pcs_neg_mode() function for the so-called
10GBASE-KR phy-mode?
Russell King (Oracle) May 15, 2023, 9:45 p.m. UTC | #5
On Mon, May 15, 2023 at 10:56:16PM +0300, Vladimir Oltean wrote:
> Hello,
> 
> On Mon, May 15, 2023 at 01:22:50PM +0100, Russell King (Oracle) wrote:
> > 1. Should 10GBASE-KR be included in the SGMII et.al. case in the code?
> >    Any other interface modes that should be there? Obviously,
> >    PHYLINK_PCS_NEG_NONE is not correct for 10GBASE-KR since it does use
> >    inband AN. Does it make sense for the user to disable inband AN for
> >    10GBASE-KR? If so, maybe it should be under the 1000base-X case.
> 
> What physical process (reference to IEEE 802.3 clause is fine) would be
> controlled by the phylink_pcs_neg_mode() function for the so-called
> 10GBASE-KR phy-mode?

Clause 73.1:

"While implementation of Auto-Negotiation is mandatory for Backplane
Ethernet PHY s, the use of Auto-Negotiation is optional."

"The Auto-Negotiation function allows the devices to switch between
the various operational modes in an orderly fashion, permits
management to disable or enable the Auto-Negotiation function, and
allows management to select a specific operational mode."

"The Auto-Negotiation function also provides a parallel detection
function to allow Backplane Ethernet devices to connect to other
Backplane Ethernet devices that have Auto-Negotiation disabled and
interoperate with legacy devices that do not support Clause 73
Auto-Negotiation."

So, my reading of these statements is that the _user_ should be
able to control via ethtool whether Clause 73 negotiation is
performed on a 10GBASE-KR (or any other backplane link that
uses clause 73 negotiation.) Having extracted that from 802.3,
I now believe it should be treated the same as 1000BASE-X, and
the Autoneg bit in ethtool should determine whether Clause 73
negotiation is used for 10GBASE-KR (and any other Clause 73
using protocol.)
Vladimir Oltean May 15, 2023, 10:08 p.m. UTC | #6
On Mon, May 15, 2023 at 10:45:21PM +0100, Russell King (Oracle) wrote:
> Clause 73.1:
> 
> So, my reading of these statements is that the _user_ should be
> able to control via ethtool whether Clause 73 negotiation is
> performed on a 10GBASE-KR (or any other backplane link that
> uses clause 73 negotiation.) Having extracted that from 802.3,
> I now believe it should be treated the same as 1000BASE-X, and
> the Autoneg bit in ethtool should determine whether Clause 73
> negotiation is used for 10GBASE-KR (and any other Clause 73
> using protocol.)

Having said that copper backplane link modes should be treated "the
same" as fiber link modes w.r.t. ethtool -s autoneg, it should also be
said that there are significant differences between clause 37 and 73
autoneg too.

Clause 73 negotiates the actual use of 10GBase-KR as a SERDES protocol
through the copper backplane in favor of other "Base-K*" alternative
link modes, so it's not quite proper to say that 10GBase-KR is a clause
73 using protocol.

To me, the goals of clause 73 autoneg are much more similar to those of
the twisted pair autoneg process - clause 28, which similarly selects
between different media side protocols in the PHY, using a priority
resolution function. For those, we use phylib and the phy_device
structure. What are the merits of using phylink_pcs for copper backplanes
and not phylib?
Russell King (Oracle) May 15, 2023, 11:36 p.m. UTC | #7
On Tue, May 16, 2023 at 01:08:33AM +0300, Vladimir Oltean wrote:
> On Mon, May 15, 2023 at 10:45:21PM +0100, Russell King (Oracle) wrote:
> > Clause 73.1:
> > 
> > So, my reading of these statements is that the _user_ should be
> > able to control via ethtool whether Clause 73 negotiation is
> > performed on a 10GBASE-KR (or any other backplane link that
> > uses clause 73 negotiation.) Having extracted that from 802.3,
> > I now believe it should be treated the same as 1000BASE-X, and
> > the Autoneg bit in ethtool should determine whether Clause 73
> > negotiation is used for 10GBASE-KR (and any other Clause 73
> > using protocol.)
> 
> Having said that copper backplane link modes should be treated "the
> same" as fiber link modes w.r.t. ethtool -s autoneg, it should also be
> said that there are significant differences between clause 37 and 73
> autoneg too.
> 
> Clause 73 negotiates the actual use of 10GBase-KR as a SERDES protocol
> through the copper backplane in favor of other "Base-K*" alternative
> link modes, so it's not quite proper to say that 10GBase-KR is a clause
> 73 using protocol.

I believe it is correct to state it as such, because:

72.1: Table 72–1—Physical Layer clauses associated with the 10GBASE-KR PMD

	73—Auto-Negotiation for Backplane Ethernet              Required

Essentially, 802.3 doesn't permit 10GBASE-KR without hardware support
for Clause 73 AN (but that AN doesn't have to be enabled by management
software.)

> To me, the goals of clause 73 autoneg are much more similar to those of
> the twisted pair autoneg process - clause 28, which similarly selects
> between different media side protocols in the PHY, using a priority
> resolution function. For those, we use phylib and the phy_device
> structure. What are the merits of using phylink_pcs for copper backplanes
> and not phylib?

I agree with you on that, because not only does that fit better with
our ethernet PHY model, but it also means PHY_INTERFACE_MODE_XLGMII
makes sense.

However, by that same token, 1000BASE-X should never have been a
PHY_INTERFACE_MODE_xxx, and this should also have been treated purely
as a PHY.

Taking that still further, this means SGMII, which is 1000BASE-X but
modified for Cisco's purposes, would effectively be a media converting
PHY sat between the MAC/RS and the "real" ethernet PHY. In this case,
PHY_INTERFACE_MODE_SGMII might make sense because the "real" ethernet
PHY needs to know that.

Then there's 1000BASE-X used to connect a "real" ethernet PHY to the
MAC/RS, which means 1000BASE-X can't really be any different from
SGMII.

This all makes the whole thing extremely muddy, but this deviates away
from the original topic, because we're now into a "what should we call
a PCS" vs "what should we call a PHY" discussion. Then we'll get into
a discussion about phylib, difficulties with net_device only being
able to have one phylib device, stacked PHYs, and phylib not being
able to cope with non-MDIO based devices that we find on embedded
platforms (some which don't even offer anything that approximates the
802.3 register set, so could never be a phylib driver.)

It even impacts on the DT description, since what does "managed =
"in-band-status";" mean if we start considering 1000base-X the same
way as 1000base-T and the "PHY" protocol for 1000base-X becomes GMII.
A GMII link has no inband AN, so "managed = "in-band-status";" at
that point makes no sense.

That is definitely a can of worms I do *not* want to open with this
discussion - and much of the above has a long history and considerably
pre-dates phylink.

My original question was more around: how do we decide what we
currently have as a PCS should use inband negotiation.

For SGMII and close relatives, and 1000BASE-X it's been obvious. For
2500BASE-X less so (due to vendors coming up with it before its been
standardised.)

We have implementations using this for other protocols, so it's
a question that needs answering for these other protocols.


However, if we did want to extend this topic, then there are a number
of questions that really need to be asked is about the XPCS driver.
Such as - what does 1000BASE-KX, 10000BASE-KX4, 10000BASE-KR and
2500BASE-X have to do with USXGMII, and why are there no copper
ethtool modes listed when a USXGMII link can definitely support
being connected to a copper PHY? (See xpcs_usxgmii_features[]).

Why does XPCS think that USXGMII uses Clause 73 AN? (see the first
entry in synopsys_xpcs_compat[].)

xpcs_sgmii_features[] only mentions copper linkmodes, but as we know,
it's common for copper PHYs to also support fibre with an auto-
selection mechanism. So, 1000BASE-X is definitely possible if SGMII
is supported, so why isn't it listed.

As previously said, 1000BASE-X can be connected to a PHY that does
1000BASE-T, so why does xpcs_1000basex_features[] not mention
1000baseT_Full... there's probably more here as well.

Interestingly, xpcs_2500basex_features[] _does_ mention both
2500BASE-X and 2500BASE-T, but I think that only does because I
happened to comment on it during a review.

I think xpcs is another can of worms, but is an easier can of worms
to solve than trying to sort out that "what's an ethernet PHY"
question we seem to be heading towards (which I think would be a
mammoth task, even back when phylink didn't exist, to sort out.)
Vladimir Oltean May 16, 2023, 9 a.m. UTC | #8
On Tue, May 16, 2023 at 12:36:33AM +0100, Russell King (Oracle) wrote:
> > Clause 73 negotiates the actual use of 10GBase-KR as a SERDES protocol
> > through the copper backplane in favor of other "Base-K*" alternative
> > link modes, so it's not quite proper to say that 10GBase-KR is a clause
> > 73 using protocol.
> 
> I believe it is correct to state it as such, because:
> 
> 72.1: Table 72–1—Physical Layer clauses associated with the 10GBASE-KR PMD
> 
> 	73—Auto-Negotiation for Backplane Ethernet              Required
> 
> Essentially, 802.3 doesn't permit 10GBASE-KR without hardware support
> for Clause 73 AN (but that AN doesn't have to be enabled by management
> software.)

Just like clause 40 (PCS and PMA for 1000BASE-T) requires clause 28 AN
to be supported. But, when the autoneg process begins, the use of
10GBase-KR as a protocol over the backplane link hasn't even been yet
established, so I find it unnatural to speak of clause 73 autoneg as
something that 10GBase-KR has.

The reason why I'm insisting on this is because to me, to treat clause
73 as an in-band autoneg process of 10GBase-KR sounds like a reversal of
causality. The clause 73 link codeword has a Technology Ability field
through which 10GBase-KR, 1GBase-KX etc are advertised as supported
protocols. If C73 is an inband protocol of 10GBase-KR, what should the
local PCS advertise for its Technology Ability? Only 10GBase-KR, because
this is what is implied by treating it as an attribute of 10GBase-KR, no?
But that would be a denatured way of negotiating - advertise a single
link mode, take it or leave it. And what other inband autoneg protocols
permit, say, starting from SGMII and ending in 1000Base-X? Clause 73
can't be directly compared to what we currently mean by managed =
"in-band-status".

Not only is C37 autoneg not directly comparable to C73, but they are not
mutually exclusive, either. I would say they are more or less orthogonal.
More below.

I don't believe that toggling clause 73 autoneg based on phylink_pcs_neg_mode()
makes much sense.

> > To me, the goals of clause 73 autoneg are much more similar to those of
> > the twisted pair autoneg process - clause 28, which similarly selects
> > between different media side protocols in the PHY, using a priority
> > resolution function. For those, we use phylib and the phy_device
> > structure. What are the merits of using phylink_pcs for copper backplanes
> > and not phylib?
> 
> I agree with you on that, because not only does that fit better with
> our ethernet PHY model, but it also means PHY_INTERFACE_MODE_XLGMII
> makes sense.
> 
> However, by that same token, 1000BASE-X should never have been a
> PHY_INTERFACE_MODE_xxx, and this should also have been treated purely
> as a PHY.
> 
> Taking that still further, this means SGMII, which is 1000BASE-X but
> modified for Cisco's purposes, would effectively be a media converting
> PHY sat between the MAC/RS and the "real" ethernet PHY. In this case,
> PHY_INTERFACE_MODE_SGMII might make sense because the "real" ethernet
> PHY needs to know that.
> 
> Then there's 1000BASE-X used to connect a "real" ethernet PHY to the
> MAC/RS, which means 1000BASE-X can't really be any different from
> SGMII.
> 
> This all makes the whole thing extremely muddy, but this deviates away
> from the original topic, because we're now into a "what should we call
> a PCS" vs "what should we call a PHY" discussion. Then we'll get into
> a discussion about phylib, difficulties with net_device only being
> able to have one phylib device, stacked PHYs, and phylib not being
> able to cope with non-MDIO based devices that we find on embedded
> platforms (some which don't even offer anything that approximates the
> 802.3 register set, so could never be a phylib driver.)
> 
> It even impacts on the DT description, since what does "managed =
> "in-band-status";" mean if we start considering 1000base-X the same
> way as 1000base-T and the "PHY" protocol for 1000base-X becomes GMII.
> A GMII link has no inband AN, so "managed = "in-band-status";" at
> that point makes no sense.
> 
> That is definitely a can of worms I do *not* want to open with this
> discussion - and much of the above has a long history and considerably
> pre-dates phylink.

I don't have much of a problem accepting that not everything we have
in-tree is consistent/correct. But if you think it's too big of a can of
worms to open, okay.

> 
> My original question was more around: how do we decide what we
> currently have as a PCS should use inband negotiation.
> 
> For SGMII and close relatives, and 1000BASE-X it's been obvious. For
> 2500BASE-X less so (due to vendors coming up with it before its been
> standardised.)
> 
> We have implementations using this for other protocols, so it's
> a question that needs answering for these other protocols.
> 
> 
> However, if we did want to extend this topic, then there are a number
> of questions that really need to be asked is about the XPCS driver.
> Such as - what does 1000BASE-KX, 10000BASE-KX4, 10000BASE-KR and
> 2500BASE-X have to do with USXGMII, and why are there no copper
> ethtool modes listed when a USXGMII link can definitely support
> being connected to a copper PHY? (See xpcs_usxgmii_features[]).
> 
> Why does XPCS think that USXGMII uses Clause 73 AN? (see the first
> entry in synopsys_xpcs_compat[].)

First, in principle USXGMII and clause 73 are not mutually exclusive.

It is possible to use clause 73 to advertise 10GBase-KR as a link mode,
and that will give you link training for proper 3-tap electrical
equalization over the copper backplane.

Then, once C73 AN/LT ended and 10GBase-KR has been established, is
possible to configure the 10GBase-R PCS to enable C37 USXGMII to select
the actual data rate via symbol replication, while the SERDES lane
remains at 10GBaud. At least, the XPCS seems to permit enabling symbol
replication in conjunction with 10GBase-KR.

Cross-checking what the XPCS driver does with the XPCS databook, I think
it's this:

It doesn't use C37 autoneg, but it still uses the USXGMII replicator.
It uses C73 (standard or perhaps a vendor-specific form) to advertise
the 1000baseKX_Full, 2500baseX_Full, 10000baseKX4_Full, 10000baseKR_Full
link modes. Each of these corresponds to a bit in the SR_AN_ADV3
register.

Note that 2500baseX_Full doesn't have a "K", and for good reason -
because 802.3 Table 73–4—"Technology Ability Field encoding" doesn't
specify 2500Base-KX as a copper backplane mode that exists. But XPCS
does! To quote the databook:

| When DATA[32] is set to 1, it indicates that the local device
| supports 2.5GBASE-KX speed mode.

And since it advertises this through a code word visible to the link
partner, I would suggest this means that its C73 hardware implementation
is non-standard.

Then, there's the entire issue of what the software does.

When 1GBase-KX is established through C73 autoneg, the XPCS driver
will leave the lane in 10GBase-R mode, and enable 1:10 symbol replication
in the USXGMII block. Sure, this is a 1Gbit data rate, but it uses the
64b/66b encoding (BASE-R), and not the advertised 8b/10b encoding
(BASE-X)! So it will likely only work between 2 XPCS devices!!!

Then, there's the entire issue that the code, as it was originally
introduced, is not the same as it is now. For example, this bit in
xpcs_do_config():

	switch (compat->an_mode) {
	case DW_AN_C73:
		if (phylink_autoneg_inband(mode)) {
			ret = xpcs_config_aneg_c73(xpcs, compat);
			if (ret)
				return ret;
		}
		break;

used to look at state->an_enabled rather than phylink_autoneg_inband().
Through my idiocy, I inadvertently converted that in commit 11059740e616
("net: pcs: xpcs: convert to phylink_pcs_ops").

> 
> xpcs_sgmii_features[] only mentions copper linkmodes, but as we know,
> it's common for copper PHYs to also support fibre with an auto-
> selection mechanism. So, 1000BASE-X is definitely possible if SGMII
> is supported, so why isn't it listed.

Most likely explanation is that XPCS has never been paired up until now
to such a PHY.

> As previously said, 1000BASE-X can be connected to a PHY that does
> 1000BASE-T, so why does xpcs_1000basex_features[] not mention
> 1000baseT_Full... there's probably more here as well.
> 
> Interestingly, xpcs_2500basex_features[] _does_ mention both
> 2500BASE-X and 2500BASE-T, but I think that only does because I
> happened to comment on it during a review.
> 
> I think xpcs is another can of worms, but is an easier can of worms
> to solve than trying to sort out that "what's an ethernet PHY"
> question we seem to be heading towards (which I think would be a
> mammoth task, even back when phylink didn't exist, to sort out.)

I wasn't necessarily going to go all the way into "what's a PHY?".
I just want to clarify some terms such that we can agree what is correct
and what is not. I believe that much of what's currently in XPCS w.r.t.
C73 is not correct, partly through initial intention and partly through
blind conversions such as mine.
Russell King (Oracle) May 16, 2023, 11:49 a.m. UTC | #9
On Tue, May 16, 2023 at 12:00:09PM +0300, Vladimir Oltean wrote:
> On Tue, May 16, 2023 at 12:36:33AM +0100, Russell King (Oracle) wrote:
> > > Clause 73 negotiates the actual use of 10GBase-KR as a SERDES protocol
> > > through the copper backplane in favor of other "Base-K*" alternative
> > > link modes, so it's not quite proper to say that 10GBase-KR is a clause
> > > 73 using protocol.
> > 
> > I believe it is correct to state it as such, because:
> > 
> > 72.1: Table 72–1—Physical Layer clauses associated with the 10GBASE-KR PMD
> > 
> > 	73—Auto-Negotiation for Backplane Ethernet              Required
> > 
> > Essentially, 802.3 doesn't permit 10GBASE-KR without hardware support
> > for Clause 73 AN (but that AN doesn't have to be enabled by management
> > software.)
> 
> Just like clause 40 (PCS and PMA for 1000BASE-T) requires clause 28 AN
> to be supported. But, when the autoneg process begins, the use of
> 10GBase-KR as a protocol over the backplane link hasn't even been yet
> established, so I find it unnatural to speak of clause 73 autoneg as
> something that 10GBase-KR has.
> 
> The reason why I'm insisting on this is because to me, to treat clause
> 73 as an in-band autoneg process of 10GBase-KR sounds like a reversal of
> causality. The clause 73 link codeword has a Technology Ability field
> through which 10GBase-KR, 1GBase-KX etc are advertised as supported
> protocols. If C73 is an inband protocol of 10GBase-KR, what should the
> local PCS advertise for its Technology Ability? Only 10GBase-KR, because
> this is what is implied by treating it as an attribute of 10GBase-KR, no?

I'm not going to get hung up on this, because I don't regard the point
as being particularly important to this discussion. You are right that
Clause 73 AN selects whether e.g. 10GBASE-KR gets used, and I don't
disagree with that. We're just entering into what seems like a pointless
debate that eats up email bandwidth.

> But that would be a denatured way of negotiating - advertise a single
> link mode, take it or leave it. And what other inband autoneg protocols
> permit, say, starting from SGMII and ending in 1000Base-X? Clause 73
> can't be directly compared to what we currently mean by managed =
> "in-band-status".
> 
> Not only is C37 autoneg not directly comparable to C73, but they are not
> mutually exclusive, either. I would say they are more or less orthogonal.
> More below.
> 
> I don't believe that toggling clause 73 autoneg based on phylink_pcs_neg_mode()
> makes much sense.

I agree, which means phylink_pcs_neg_mode() needs to document this so
that people don't think it does - and that solves both of the issues
I was bringing up in my original email.

> > However, if we did want to extend this topic, then there are a number
> > of questions that really need to be asked is about the XPCS driver.
> > Such as - what does 1000BASE-KX, 10000BASE-KX4, 10000BASE-KR and
> > 2500BASE-X have to do with USXGMII, and why are there no copper
> > ethtool modes listed when a USXGMII link can definitely support
> > being connected to a copper PHY? (See xpcs_usxgmii_features[]).
> > 
> > Why does XPCS think that USXGMII uses Clause 73 AN? (see the first
> > entry in synopsys_xpcs_compat[].)
> 
> First, in principle USXGMII and clause 73 are not mutually exclusive.
> 
> It is possible to use clause 73 to advertise 10GBase-KR as a link mode,
> and that will give you link training for proper 3-tap electrical
> equalization over the copper backplane.
> 
> Then, once C73 AN/LT ended and 10GBase-KR has been established, is
> possible to configure the 10GBase-R PCS to enable C37 USXGMII to select
> the actual data rate via symbol replication, while the SERDES lane
> remains at 10GBaud. At least, the XPCS seems to permit enabling symbol
> replication in conjunction with 10GBase-KR.

My comments are against the driver as it stands today, not some
theoretical case that the hardware may support.

What I'm getting at is if the interface mode is
PHY_INTERFACE_MODE_USXGMII, then... okay... we _may_ wish to do
clause 73 negotiation advertising 10GBASE-KR and then do clause 73
for the USXGMII control word - but the driver doesn't do this as far
as I can see. If C73 AN is being used, it merely reads the C73
state and returns the resolution from that. Any speed information that
a USXGMII PHY passes back over the C37 inband signalling would be
ignored because there seems to be no provision for the USXGMII
inband signalling.

So I'm confused what the xpcs driver _actually_ does when USXGMII
mode is selected by PHY_INTERFACE_MODE_USXGMII, because looking at
the driver, it doesn't look like it's USXGMII at all.

> Then, there's the entire issue that the code, as it was originally
> introduced, is not the same as it is now. For example, this bit in
> xpcs_do_config():
> 
> 	switch (compat->an_mode) {
> 	case DW_AN_C73:
> 		if (phylink_autoneg_inband(mode)) {
> 			ret = xpcs_config_aneg_c73(xpcs, compat);
> 			if (ret)
> 				return ret;
> 		}
> 		break;
> 
> used to look at state->an_enabled rather than phylink_autoneg_inband().
> Through my idiocy, I inadvertently converted that in commit 11059740e616
> ("net: pcs: xpcs: convert to phylink_pcs_ops").

If we want to change that back to the old behaviour, that needs to
be:
		if (test_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, advertising)) {
			...
		}
		break;

but that wouldn't ever have been sufficient, even when the code was
using an_enabled, because both of these reflect the user configuration.
(an_enabled was just a proxy for this Autoneg bit). I'm going to call
both of these an "AN indicator" in the question below.

Isn't it rather perverse that the driver configures AN if this AN
indicator is set, but then does nothing if it isn't?

> > xpcs_sgmii_features[] only mentions copper linkmodes, but as we know,
> > it's common for copper PHYs to also support fibre with an auto-
> > selection mechanism. So, 1000BASE-X is definitely possible if SGMII
> > is supported, so why isn't it listed.
> 
> Most likely explanation is that XPCS has never been paired up until now
> to such a PHY.

So it's probably safe to add ETHTOOL_LINK_MODE_1000baseX_Full_BIT
there - thanks.

> > As previously said, 1000BASE-X can be connected to a PHY that does
> > 1000BASE-T, so why does xpcs_1000basex_features[] not mention
> > 1000baseT_Full... there's probably more here as well.
> > 
> > Interestingly, xpcs_2500basex_features[] _does_ mention both
> > 2500BASE-X and 2500BASE-T, but I think that only does because I
> > happened to comment on it during a review.
> > 
> > I think xpcs is another can of worms, but is an easier can of worms
> > to solve than trying to sort out that "what's an ethernet PHY"
> > question we seem to be heading towards (which I think would be a
> > mammoth task, even back when phylink didn't exist, to sort out.)
> 
> I wasn't necessarily going to go all the way into "what's a PHY?".
> I just want to clarify some terms such that we can agree what is correct
> and what is not. I believe that much of what's currently in XPCS w.r.t.
> C73 is not correct, partly through initial intention and partly through
> blind conversions such as mine.

Right, that's probably why I'm having a hard time interpreting what
this driver is doing when it comes to these modes that makes use
clause 73.

As this is the only phylink-using implementation that involves clause
73 at present, I would like to ensure that there's a clear resolution
of the expected behaviour before we get further implementations, and
preferably document what's expected.
Vladimir Oltean May 16, 2023, 2:15 p.m. UTC | #10
On Tue, May 16, 2023 at 12:49:44PM +0100, Russell King (Oracle) wrote:
> What I'm getting at is if the interface mode is
> PHY_INTERFACE_MODE_USXGMII, then... okay... we _may_ wish to do
> clause 73 negotiation advertising 10GBASE-KR and then do clause 73
                                                                  ~~
                                                                  37

> for the USXGMII control word - but the driver doesn't do this as far
> as I can see. If C73 AN is being used, it merely reads the C73
> state and returns the resolution from that. Any speed information that
> a USXGMII PHY passes back over the C37 inband signalling would be
> ignored because there seems to be no provision for the USXGMII
> inband signalling.
> 
> So I'm confused what the xpcs driver _actually_ does when USXGMII
> mode is selected by PHY_INTERFACE_MODE_USXGMII, because looking at
> the driver, it doesn't look like it's USXGMII at all.

If what you're looking for is strictly the USXGMII in-band autoneg code
word derived from C37, then the short answer is that you won't find it,
even when going back to the initial commit fcb26bd2b6ca ("net: phy: Add
Synopsys DesignWare XPCS MDIO module").

To the larger question 'what does XPCS actually do in phy-mode "usxgmii"',
I guess the simple answer looking at the aforementioned initial commit
is 'it violates the advertised coding scheme by using BASE-R instead of
BASE-X for speeds lower than 10G', and 'if you don't want the USXGMII
replicator just use phy-mode "10gbase-kr" which behaves basically the
same except it doesn't advertise the rate-adapted speeds'. Disclaimer:
I'm saying this based solely on the code and documentation.

> If we want to change that back to the old behaviour, that needs to
> be:
> 		if (test_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, advertising)) {
> 			...
> 		}
> 		break;

Ok. I should send a patch fixing this, since I introduced the behavior
change. I'll add your suggested tag.

> but that wouldn't ever have been sufficient, even when the code was
> using an_enabled, because both of these reflect the user configuration.
> (an_enabled was just a proxy for this Autoneg bit). I'm going to call
> both of these an "AN indicator" in the question below.
> 
> Isn't it rather perverse that the driver configures AN if this AN
> indicator is set, but then does nothing if it isn't?

Maybe. My copy of the databook is parameterized based on
instantiation-dependent variables, and I can't really tell what are the
out-of-reset register values for hardware I don't have, so it is hard
for me to infer what is the behavior when AN is not enabled.

> As this is the only phylink-using implementation that involves clause
> 73 at present, I would like to ensure that there's a clear resolution
> of the expected behaviour before we get further implementations, and
> preferably document what's expected.

+1
that's where I would like this to go, too.
diff mbox series

Patch

diff --git a/include/linux/phylink.h b/include/linux/phylink.h
index 8968094503d6..465248818276 100644
--- a/include/linux/phylink.h
+++ b/include/linux/phylink.h
@@ -21,6 +21,18 @@  enum {
 	MLO_AN_FIXED,	/* Fixed-link mode */
 	MLO_AN_INBAND,	/* In-band protocol */
 
+	/* PCS "negotiation" mode.
+	 *  PHYLINK_PCS_NEG_NONE - protocol has no inband capability
+	 *  PHYLINK_PCS_NEG_OUTBAND - some out of band or fixed link setting
+	 *  PHYLINK_PCS_NEG_INBAND_DISABLED - inband mode disabled, e.g.
+	 *				      1000base-X with autoneg off
+	 *  PHYLINK_PCS_NEG_INBAND_ENABLED - inband mode enabled
+	 */
+	PHYLINK_PCS_NEG_NONE = 0,
+	PHYLINK_PCS_NEG_OUTBAND,
+	PHYLINK_PCS_NEG_INBAND_DISABLED,
+	PHYLINK_PCS_NEG_INBAND_ENABLED,
+
 	/* MAC_SYM_PAUSE and MAC_ASYM_PAUSE are used when configuring our
 	 * autonegotiation advertisement. They correspond to the PAUSE and
 	 * ASM_DIR bits defined by 802.3, respectively.
@@ -79,6 +91,67 @@  static inline bool phylink_autoneg_inband(unsigned int mode)
 	return mode == MLO_AN_INBAND;
 }
 
+/**
+ * phylink_pcs_neg_mode() - helper to determine PCS inband mode
+ * @mode: one of %MLO_AN_FIXED, %MLO_AN_PHY, %MLO_AN_INBAND.
+ * @interface: interface mode to be used
+ * @advertising: adertisement ethtool link mode mask
+ *
+ * Determines the negotiation mode to be used by the PCS, and returns
+ * one of:
+ * %PHYLINK_PCS_NEG_NONE: interface mode does not support inband
+ * %PHYLINK_PCS_NEG_OUTBAND: an out of band mode (e.g. reading the PHY,
+ *   or fixed link) will be used.
+ * %PHYLINK_PCS_NEG_INBAND_DISABLED: inband mode selected but autoneg disabled
+ * %PHYLINK_PCS_NEG_INBAND_ENABLED: inband mode selected and autoneg enabled
+ */
+static inline unsigned int phylink_pcs_neg_mode(unsigned int mode,
+						phy_interface_t interface,
+						const unsigned long *advertising)
+{
+	unsigned int neg_mode;
+
+	switch (interface) {
+	case PHY_INTERFACE_MODE_SGMII:
+	case PHY_INTERFACE_MODE_QSGMII:
+	case PHY_INTERFACE_MODE_QUSGMII:
+	case PHY_INTERFACE_MODE_USXGMII:
+		/* These protocols are designed for use with a PHY which
+		 * communicates its negotiation result back to the MAC via
+		 * inband communication. Note: there exist PHYs that run
+		 * with SGMII but do not send the inband data.
+		 */
+		if (!phylink_autoneg_inband(mode))
+			neg_mode = PHYLINK_PCS_NEG_OUTBAND;
+		else
+			neg_mode = PHYLINK_PCS_NEG_INBAND_ENABLED;
+		break;
+
+	case PHY_INTERFACE_MODE_1000BASEX:
+	case PHY_INTERFACE_MODE_2500BASEX:
+		/* 1000base-X is designed for use media-side for Fibre
+		 * connections, and thus the Autoneg bit needs to be
+		 * taken into account. We also do this for 2500base-X
+		 * as well, but drivers may not support this, so may
+		 * need to override this.
+		 */
+		if (!phylink_autoneg_inband(mode))
+			neg_mode = PHYLINK_PCS_NEG_OUTBAND;
+		else if (linkmode_test_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
+					   advertising))
+			neg_mode = PHYLINK_PCS_NEG_INBAND_ENABLED;
+		else
+			neg_mode = PHYLINK_PCS_NEG_INBAND_DISABLED;
+		break;
+
+	default:
+		neg_mode = PHYLINK_PCS_NEG_NONE;
+		break;
+	}
+
+	return neg_mode;
+}
+
 /**
  * struct phylink_link_state - link state structure
  * @advertising: ethtool bitmask containing advertised link modes