diff mbox series

[v2,net-next,6/6] docs: net: Add description of SyncE interfaces

Message ID 20211105205331.2024623-7-maciej.machnikowski@intel.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Add RTNL interface for SyncE | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 2 maintainers not CCed: linux-doc@vger.kernel.org corbet@lwn.net
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Machnikowski, Maciej Nov. 5, 2021, 8:53 p.m. UTC
Add Documentation/networking/synce.rst describing new RTNL messages
and respective NDO ops supporting SyncE (Synchronous Ethernet).

Signed-off-by: Maciej Machnikowski <maciej.machnikowski@intel.com>
---
 Documentation/networking/synce.rst | 117 +++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)
 create mode 100644 Documentation/networking/synce.rst

Comments

Ido Schimmel Nov. 7, 2021, 2:08 p.m. UTC | #1
On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
> Add Documentation/networking/synce.rst describing new RTNL messages
> and respective NDO ops supporting SyncE (Synchronous Ethernet).
> 
> Signed-off-by: Maciej Machnikowski <maciej.machnikowski@intel.com>
> ---
>  Documentation/networking/synce.rst | 117 +++++++++++++++++++++++++++++
>  1 file changed, 117 insertions(+)
>  create mode 100644 Documentation/networking/synce.rst
> 
> diff --git a/Documentation/networking/synce.rst b/Documentation/networking/synce.rst
> new file mode 100644
> index 000000000000..4ca41fb9a481
> --- /dev/null
> +++ b/Documentation/networking/synce.rst
> @@ -0,0 +1,117 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================
> +Synchronous Ethernet
> +====================
> +
> +Synchronous Ethernet networks use a physical layer clock to syntonize
> +the frequency across different network elements.
> +
> +Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet
> +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered clocks
> +and a dedicated TX clock input that is used as to transmit data to other nodes.
> +
> +The SyncE capable PHY is able to recover the incomning frequency of the data
> +stream on RX lanes and redirect it (sometimes dividing it) to recovered
> +clock outputs. In SyncE PHY the TX frequency is directly dependent on the
> +input frequency - either on the PHY CLK input, or on a dedicated
> +TX clock input.
> +
> +      ┌───────────┬──────────┐
> +      │ RX        │ TX       │
> +  1   │ lanes     │ lanes    │ 1
> +  ───►├──────┐    │          ├─────►
> +  2   │      │    │          │ 2
> +  ───►├──┐   │    │          ├─────►
> +  3   │  │   │    │          │ 3
> +  ───►├─▼▼   ▼    │          ├─────►
> +      │ ──────    │          │
> +      │ \____/    │          │
> +      └──┼──┼─────┴──────────┘
> +        1│ 2│        ▲
> + RCLK out│  │        │ TX CLK in
> +         ▼  ▼        │
> +       ┌─────────────┴───┐
> +       │                 │
> +       │       EEC       │
> +       │                 │
> +       └─────────────────┘
> +
> +The EEC can synchronize its frequency to one of the synchronization inputs
> +either clocks recovered on traffic interfaces or (in advanced deployments)
> +external frequency sources.
> +
> +Some EEC implementations can select synchronization source through
> +priority tables and synchronization status messaging and provide necessary
> +filtering and holdover capabilities.
> +
> +The following interface can be applicable to diffferent packet network types
> +following ITU-T G.8261/G.8262 recommendations.
> +
> +Interface
> +=========
> +
> +The following RTNL messages are used to read/configure SyncE recovered
> +clocks.
> +
> +RTM_GETRCLKRANGE
> +-----------------
> +Reads the allowed pin index range for the recovered clock outputs.
> +This can be aligned to PHY outputs or to EEC inputs, whichever is
> +better for a given application.

Can you explain the difference between PHY outputs and EEC inputs? It is
no clear to me from the diagram.

How would the diagram look in a multi-port adapter where you have a
single EEC?

> +Will call the ndo_get_rclk_range function to read the allowed range
> +of output pin indexes.
> +Will call ndo_get_rclk_range to determine the allowed recovered clock
> +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
> +IFLA_RCLK_RANGE_MAX_PIN attributes

The first sentence seems to be redundant

> +
> +RTM_GETRCLKSTATE
> +-----------------
> +Read the state of recovered pins that output recovered clock from
> +a given port. The message will contain the number of assigned clocks
> +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in IFLA_RCLK_STATE_OUT_IDX
> +To support multiple recovered clock outputs from the same port, this message
> +will return the IFLA_RCLK_STATE_COUNT attribute containing the number of
> +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX attributes
> +listing the active output indexes.
> +This message will call the ndo_get_rclk_range to determine the allowed
> +recovered clock indexes and then will loop through them, calling
> +the ndo_get_rclk_state for each of them.

Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE? Isn't
RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in the
range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just report the
state (enabled / disable) for all

> +
> +RTM_SETRCLKSTATE
> +-----------------
> +Sets the redirection of the recovered clock for a given pin. This message
> +expects one attribute:
> +struct if_set_rclk_msg {
> +	__u32 ifindex; /* interface index */
> +	__u32 out_idx; /* output index (from a valid range)
> +	__u32 flags; /* configuration flags */
> +};
> +
> +Supported flags are:
> +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
> +		     if clear - the output will be disabled.

In the diagram you have two recovered clock outputs going into the EEC.
According to which the EEC is synchronized?

How does user space know which pins to enable?

> +
> +RTM_GETEECSTATE
> +----------------
> +Reads the state of the EEC or equivalent physical clock synchronizer.
> +This message returns the following attributes:
> +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
> +		 The states returned in this attribute are aligned to the
> +		 ITU-T G.781 and are:
> +		  IF_EEC_STATE_INVALID - state is not valid
> +		  IF_EEC_STATE_FREERUN - clock is free-running
> +		  IF_EEC_STATE_LOCKED - clock is locked to the reference,
> +		                        but the holdover memory is not valid
> +		  IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the reference
> +		                               and holdover memory is valid
> +		  IF_EEC_STATE_HOLDOVER - clock is in holdover mode
> +State is read from the netdev calling the:
> +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state,
> +			 u32 *src_idx, struct netlink_ext_ack *extack);
> +
> +IFLA_EEC_SRC_IDX - optional attribute returning the index of the reference that
> +		   is used for the current IFLA_EEC_STATE, i.e., the index of
> +		   the pin that the EEC is locked to.
> +
> +Will be returned only if the ndo_get_eec_src is implemented.
> \ No newline at end of file
> -- 
> 2.26.3
>
Machnikowski, Maciej Nov. 8, 2021, 8:35 a.m. UTC | #2
> -----Original Message-----
> From: Ido Schimmel <idosch@idosch.org>
> Sent: Sunday, November 7, 2021 3:09 PM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
> > +Interface
> > +=========
> > +
> > +The following RTNL messages are used to read/configure SyncE recovered
> > +clocks.
> > +
> > +RTM_GETRCLKRANGE
> > +-----------------
> > +Reads the allowed pin index range for the recovered clock outputs.
> > +This can be aligned to PHY outputs or to EEC inputs, whichever is
> > +better for a given application.
> 
> Can you explain the difference between PHY outputs and EEC inputs? It is
> no clear to me from the diagram.

PHY is the source of frequency for the EEC, so PHY produces the reference
And EEC synchronizes to it.

Both PHY outputs and EEC inputs are configurable. PHY outputs usually are
configured using PHY registers, and EEC inputs in the DPLL references
block
 
> How would the diagram look in a multi-port adapter where you have a
> single EEC?

That depends. It can be either a multiport PHY - in this case it will look
exactly like the one I drawn. In case we have multiple PHYs their recovered
clock outputs will go to different recovered clock inputs and each PHY
TX clock inputs will be driven from different EEC's synchronized outputs
or from a single one through  clock fan out.

> > +Will call the ndo_get_rclk_range function to read the allowed range
> > +of output pin indexes.
> > +Will call ndo_get_rclk_range to determine the allowed recovered clock
> > +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
> > +IFLA_RCLK_RANGE_MAX_PIN attributes
> 
> The first sentence seems to be redundant
> 
> > +
> > +RTM_GETRCLKSTATE
> > +-----------------
> > +Read the state of recovered pins that output recovered clock from
> > +a given port. The message will contain the number of assigned clocks
> > +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
> IFLA_RCLK_STATE_OUT_IDX
> > +To support multiple recovered clock outputs from the same port, this
> message
> > +will return the IFLA_RCLK_STATE_COUNT attribute containing the number
> of
> > +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
> attributes
> > +listing the active output indexes.
> > +This message will call the ndo_get_rclk_range to determine the allowed
> > +recovered clock indexes and then will loop through them, calling
> > +the ndo_get_rclk_state for each of them.
> 
> Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE? Isn't
> RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in the
> range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just
> report the
> state (enabled / disable) for all

Great idea! Will implement it.
 
> > +
> > +RTM_SETRCLKSTATE
> > +-----------------
> > +Sets the redirection of the recovered clock for a given pin. This message
> > +expects one attribute:
> > +struct if_set_rclk_msg {
> > +	__u32 ifindex; /* interface index */
> > +	__u32 out_idx; /* output index (from a valid range)
> > +	__u32 flags; /* configuration flags */
> > +};
> > +
> > +Supported flags are:
> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
> > +		     if clear - the output will be disabled.
> 
> In the diagram you have two recovered clock outputs going into the EEC.
> According to which the EEC is synchronized?

That will depend on the future DPLL configuration. For now it'll be based
on the DPLL's auto select ability and its default configuration.
 
> How does user space know which pins to enable?

That's why the RTM_GETRCLKRANGE was invented but I like the suggestion
you made above so will rework the code to remove the range one and
just return the indexes with enable/disable bit for each of them. In this
case youserspace will just send the RTM_GETRCLKSTATE to learn what
can be enabled.

> > +
> > +RTM_GETEECSTATE
> > +----------------
> > +Reads the state of the EEC or equivalent physical clock synchronizer.
> > +This message returns the following attributes:
> > +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
> > +		 The states returned in this attribute are aligned to the
> > +		 ITU-T G.781 and are:
> > +		  IF_EEC_STATE_INVALID - state is not valid
> > +		  IF_EEC_STATE_FREERUN - clock is free-running
> > +		  IF_EEC_STATE_LOCKED - clock is locked to the reference,
> > +		                        but the holdover memory is not valid
> > +		  IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the
> reference
> > +		                               and holdover memory is valid
> > +		  IF_EEC_STATE_HOLDOVER - clock is in holdover mode
> > +State is read from the netdev calling the:
> > +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state
> *state,
> > +			 u32 *src_idx, struct netlink_ext_ack *extack);
> > +
> > +IFLA_EEC_SRC_IDX - optional attribute returning the index of the
> reference that
> > +		   is used for the current IFLA_EEC_STATE, i.e., the index of
> > +		   the pin that the EEC is locked to.
> > +
> > +Will be returned only if the ndo_get_eec_src is implemented.
> > \ No newline at end of file
> > --
> > 2.26.3
> >
Ido Schimmel Nov. 8, 2021, 4:29 p.m. UTC | #3
On Mon, Nov 08, 2021 at 08:35:17AM +0000, Machnikowski, Maciej wrote:
> > -----Original Message-----
> > From: Ido Schimmel <idosch@idosch.org>
> > Sent: Sunday, November 7, 2021 3:09 PM
> > To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> > Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> > interfaces
> > 
> > On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
> > > +Interface
> > > +=========
> > > +
> > > +The following RTNL messages are used to read/configure SyncE recovered
> > > +clocks.
> > > +
> > > +RTM_GETRCLKRANGE
> > > +-----------------
> > > +Reads the allowed pin index range for the recovered clock outputs.
> > > +This can be aligned to PHY outputs or to EEC inputs, whichever is
> > > +better for a given application.
> > 
> > Can you explain the difference between PHY outputs and EEC inputs? It is
> > no clear to me from the diagram.
> 
> PHY is the source of frequency for the EEC, so PHY produces the reference
> And EEC synchronizes to it.
> 
> Both PHY outputs and EEC inputs are configurable. PHY outputs usually are
> configured using PHY registers, and EEC inputs in the DPLL references
> block
>  
> > How would the diagram look in a multi-port adapter where you have a
> > single EEC?
> 
> That depends. It can be either a multiport PHY - in this case it will look
> exactly like the one I drawn. In case we have multiple PHYs their recovered
> clock outputs will go to different recovered clock inputs and each PHY
> TX clock inputs will be driven from different EEC's synchronized outputs
> or from a single one through  clock fan out.
> 
> > > +Will call the ndo_get_rclk_range function to read the allowed range
> > > +of output pin indexes.
> > > +Will call ndo_get_rclk_range to determine the allowed recovered clock
> > > +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
> > > +IFLA_RCLK_RANGE_MAX_PIN attributes
> > 
> > The first sentence seems to be redundant
> > 
> > > +
> > > +RTM_GETRCLKSTATE
> > > +-----------------
> > > +Read the state of recovered pins that output recovered clock from
> > > +a given port. The message will contain the number of assigned clocks
> > > +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
> > IFLA_RCLK_STATE_OUT_IDX
> > > +To support multiple recovered clock outputs from the same port, this
> > message
> > > +will return the IFLA_RCLK_STATE_COUNT attribute containing the number
> > of
> > > +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
> > attributes
> > > +listing the active output indexes.
> > > +This message will call the ndo_get_rclk_range to determine the allowed
> > > +recovered clock indexes and then will loop through them, calling
> > > +the ndo_get_rclk_state for each of them.
> > 
> > Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE? Isn't
> > RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in the
> > range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just
> > report the
> > state (enabled / disable) for all
> 
> Great idea! Will implement it.
>  
> > > +
> > > +RTM_SETRCLKSTATE
> > > +-----------------
> > > +Sets the redirection of the recovered clock for a given pin. This message
> > > +expects one attribute:
> > > +struct if_set_rclk_msg {
> > > +	__u32 ifindex; /* interface index */
> > > +	__u32 out_idx; /* output index (from a valid range)
> > > +	__u32 flags; /* configuration flags */
> > > +};
> > > +
> > > +Supported flags are:
> > > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
> > > +		     if clear - the output will be disabled.
> > 
> > In the diagram you have two recovered clock outputs going into the EEC.
> > According to which the EEC is synchronized?
> 
> That will depend on the future DPLL configuration. For now it'll be based
> on the DPLL's auto select ability and its default configuration.
>  
> > How does user space know which pins to enable?
> 
> That's why the RTM_GETRCLKRANGE was invented but I like the suggestion
> you made above so will rework the code to remove the range one and
> just return the indexes with enable/disable bit for each of them. In this
> case youserspace will just send the RTM_GETRCLKSTATE to learn what
> can be enabled.

In the diagram there are multiple Rx lanes, all of which might be used
by the same port. How does user space know to differentiate between the
quality levels of the clock signal recovered from each lane / pin when
the information is transmitted on a per-port basis via ESMC messages?

The uAPI seems to be too low-level and is not compatible with Nvidia's
devices and potentially other vendors. We really just need a logical
interface that says "Synchronize the frequency of the EEC to the clock
recovered from port X". The kernel / drivers should abstract the inner
workings of the device from user space. Any reason this can't work for
ice?

I also want to re-iterate my dissatisfaction with the interface being
netdev-centric. By modelling the EEC as a standalone object we will be
able to extend it to set the source of the EEC to something other than a
netdev in the future. If we don't do it now, we will end up with two
ways to report the source of the EEC (i.e., EEC_SRC_PORT and something
else).

Other advantages of modelling the EEC as a separate object include the
ability for user space to determine the mapping between netdevs and EECs
(currently impossible) and reporting additional EEC attributes such as
SyncE clockIdentity and default SSM code. There is really no reason to
report all of this identical information via multiple netdevs.

With regards to rtnetlink vs. something else, in my suggestion the only
thing that should be reported per-netdev is the mapping between the
netdev and the EEC. Similar to the way user space determines the mapping
from netdev to PHC via ETHTOOL_GET_TS_INFO. If we go with rtnetlink,
this can be reported as a new attribute in RTM_NEWLINK, no need to add
new messages.
Jakub Kicinski Nov. 8, 2021, 5:03 p.m. UTC | #4
On Mon, 8 Nov 2021 18:29:50 +0200 Ido Schimmel wrote:
> I also want to re-iterate my dissatisfaction with the interface being
> netdev-centric. By modelling the EEC as a standalone object we will be
> able to extend it to set the source of the EEC to something other than a
> netdev in the future. If we don't do it now, we will end up with two
> ways to report the source of the EEC (i.e., EEC_SRC_PORT and something
> else).
> 
> Other advantages of modelling the EEC as a separate object include the
> ability for user space to determine the mapping between netdevs and EECs
> (currently impossible) and reporting additional EEC attributes such as
> SyncE clockIdentity and default SSM code. There is really no reason to
> report all of this identical information via multiple netdevs.

Indeed, I feel convinced. I believe the OCP timing card will benefit
from such API as well. I pinged Jonathan if he doesn't have cycles 
I'll do the typing.

What do you have in mind for driver abstracting away pin selection?
For a standalone clock fed PPS signal from a backplate this will be
impossible, so we may need some middle way.
Petr Machata Nov. 8, 2021, 6 p.m. UTC | #5
Maciej Machnikowski <maciej.machnikowski@intel.com> writes:

> Add Documentation/networking/synce.rst describing new RTNL messages
> and respective NDO ops supporting SyncE (Synchronous Ethernet).
>
> Signed-off-by: Maciej Machnikowski <maciej.machnikowski@intel.com>
> ---
>  Documentation/networking/synce.rst | 117 +++++++++++++++++++++++++++++
>  1 file changed, 117 insertions(+)
>  create mode 100644 Documentation/networking/synce.rst
>
> diff --git a/Documentation/networking/synce.rst b/Documentation/networking/synce.rst
> new file mode 100644
> index 000000000000..4ca41fb9a481
> --- /dev/null
> +++ b/Documentation/networking/synce.rst
> @@ -0,0 +1,117 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================
> +Synchronous Ethernet
> +====================
> +
> +Synchronous Ethernet networks use a physical layer clock to syntonize
> +the frequency across different network elements.
> +
> +Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet
> +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered clocks
> +and a dedicated TX clock input that is used as to transmit data to other nodes.
> +
> +The SyncE capable PHY is able to recover the incomning frequency of the data
> +stream on RX lanes and redirect it (sometimes dividing it) to recovered
> +clock outputs. In SyncE PHY the TX frequency is directly dependent on the
> +input frequency - either on the PHY CLK input, or on a dedicated
> +TX clock input.
> +
> +      ┌───────────┬──────────┐
> +      │ RX        │ TX       │
> +  1   │ lanes     │ lanes    │ 1
> +  ───►├──────┐    │          ├─────►
> +  2   │      │    │          │ 2
> +  ───►├──┐   │    │          ├─────►
> +  3   │  │   │    │          │ 3
> +  ───►├─▼▼   ▼    │          ├─────►
> +      │ ──────    │          │
> +      │ \____/    │          │
> +      └──┼──┼─────┴──────────┘
> +        1│ 2│        ▲
> + RCLK out│  │        │ TX CLK in
> +         ▼  ▼        │
> +       ┌─────────────┴───┐
> +       │                 │
> +       │       EEC       │
> +       │                 │
> +       └─────────────────┘
> +
> +The EEC can synchronize its frequency to one of the synchronization inputs
> +either clocks recovered on traffic interfaces or (in advanced deployments)
> +external frequency sources.
> +
> +Some EEC implementations can select synchronization source through
> +priority tables and synchronization status messaging and provide necessary
> +filtering and holdover capabilities.
> +
> +The following interface can be applicable to diffferent packet network types
> +following ITU-T G.8261/G.8262 recommendations.
> +
> +Interface
> +=========
> +
> +The following RTNL messages are used to read/configure SyncE recovered
> +clocks.
> +
> +RTM_GETRCLKRANGE
> +-----------------
> +Reads the allowed pin index range for the recovered clock outputs.
> +This can be aligned to PHY outputs or to EEC inputs, whichever is
> +better for a given application.
> +Will call the ndo_get_rclk_range function to read the allowed range
> +of output pin indexes.
> +Will call ndo_get_rclk_range to determine the allowed recovered clock
> +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
> +IFLA_RCLK_RANGE_MAX_PIN attributes
> +
> +RTM_GETRCLKSTATE
> +-----------------
> +Read the state of recovered pins that output recovered clock from
> +a given port. The message will contain the number of assigned clocks
> +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in IFLA_RCLK_STATE_OUT_IDX
> +To support multiple recovered clock outputs from the same port, this message
> +will return the IFLA_RCLK_STATE_COUNT attribute containing the number of
> +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX attributes
> +listing the active output indexes.
> +This message will call the ndo_get_rclk_range to determine the allowed
> +recovered clock indexes and then will loop through them, calling
> +the ndo_get_rclk_state for each of them.

Let me make sure I understand the model that you propose. Specifically
from the point of view of a multi-port device, because that's my
immediate use case.

RTM_GETRCLKRANGE would report number of "pins" that matches the number
of lanes in the system. So e.g. a 32-port switch, where each port has 4
lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or
whatever.)

RTM_GETRCLKSTATE would then return some subset of those pins, depending
on which lanes actually managed to establish a connection and carry a
valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a
100Gbps established.

> +
> +RTM_SETRCLKSTATE
> +-----------------
> +Sets the redirection of the recovered clock for a given pin. This message
> +expects one attribute:
> +struct if_set_rclk_msg {
> +	__u32 ifindex; /* interface index */
> +	__u32 out_idx; /* output index (from a valid range)
> +	__u32 flags; /* configuration flags */
> +};
> +
> +Supported flags are:
> +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
> +		     if clear - the output will be disabled.

OK, so here I set up the tracking. ifindex tells me which EEC to
configure, out_idx is the pin to track, flags tell me whether to set up
the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
I somehow know that lane 2 has the best clock.


If the above is broadly correct, I've got some questions.

First, what if more than one out_idx is set? What are drivers / HW meant
to do with this? What is the expected behavior?

Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope: one
reports which pins carry a clock signal, the other influences tracking.
That seems wrong. There also does not seems to be an UAPI to retrieve
the tracking settings.

Second, as a user-space client, how do I know that if ports 1 and 2 both
report pin range [A; B], that they both actually share the same
underlying EEC? Is there some sort of coordination among the drivers,
such that each pin in the system has a unique ID?

Further, how do I actually know the mapping from ports to pins? E.g. as
a user, I might know my master is behind swp1. How do I know what pins
correspond to that port? As a user-space tool author, how do I help
users to do something like "eec set clock eec0 track swp1"?

Additionally, how would things like external GPSs or 1pps be modeled? I
guess the driver would know about such interface, and would expose it as
a "pin". When the GPS signal locks, the driver starts reporting the pin
in the RCLK set. Then it is possible to set up tracking of that pin.


It seems to me it would be easier to understand, and to write user-space
tools and drivers for, a model that has EEC as an explicit first-class
object. That's where the EEC state naturally belongs, that's where the
pin range naturally belongs. Netdevs should have a reference to EEC and
pins, not present this information as if they own it. A first-class EEC
would also allow to later figure out how to hook up PHC and EEC.

> +
> +RTM_GETEECSTATE
> +----------------
> +Reads the state of the EEC or equivalent physical clock synchronizer.
> +This message returns the following attributes:
> +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
> +		 The states returned in this attribute are aligned to the
> +		 ITU-T G.781 and are:
> +		  IF_EEC_STATE_INVALID - state is not valid
> +		  IF_EEC_STATE_FREERUN - clock is free-running
> +		  IF_EEC_STATE_LOCKED - clock is locked to the reference,
> +		                        but the holdover memory is not valid
> +		  IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the reference
> +		                               and holdover memory is valid
> +		  IF_EEC_STATE_HOLDOVER - clock is in holdover mode
> +State is read from the netdev calling the:
> +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state,
> +			 u32 *src_idx, struct netlink_ext_ack *extack);
> +
> +IFLA_EEC_SRC_IDX - optional attribute returning the index of the reference that
> +		   is used for the current IFLA_EEC_STATE, i.e., the index of
> +		   the pin that the EEC is locked to.
> +
> +Will be returned only if the ndo_get_eec_src is implemented.
Machnikowski, Maciej Nov. 9, 2021, 10:32 a.m. UTC | #6
> -----Original Message-----
> From: Ido Schimmel <idosch@idosch.org>
> Sent: Monday, November 8, 2021 5:30 PM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> On Mon, Nov 08, 2021 at 08:35:17AM +0000, Machnikowski, Maciej wrote:
> > > -----Original Message-----
> > > From: Ido Schimmel <idosch@idosch.org>
> > > Sent: Sunday, November 7, 2021 3:09 PM
> > > To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> > > Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> > > interfaces
> > >
> > > On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
> > > > +Interface
> > > > +=========
> > > > +
> > > > +The following RTNL messages are used to read/configure SyncE
> recovered
> > > > +clocks.
> > > > +
> > > > +RTM_GETRCLKRANGE
> > > > +-----------------
> > > > +Reads the allowed pin index range for the recovered clock outputs.
> > > > +This can be aligned to PHY outputs or to EEC inputs, whichever is
> > > > +better for a given application.
> > >
> > > Can you explain the difference between PHY outputs and EEC inputs? It is
> > > no clear to me from the diagram.
> >
> > PHY is the source of frequency for the EEC, so PHY produces the reference
> > And EEC synchronizes to it.
> >
> > Both PHY outputs and EEC inputs are configurable. PHY outputs usually are
> > configured using PHY registers, and EEC inputs in the DPLL references
> > block
> >
> > > How would the diagram look in a multi-port adapter where you have a
> > > single EEC?
> >
> > That depends. It can be either a multiport PHY - in this case it will look
> > exactly like the one I drawn. In case we have multiple PHYs their recovered
> > clock outputs will go to different recovered clock inputs and each PHY
> > TX clock inputs will be driven from different EEC's synchronized outputs
> > or from a single one through  clock fan out.
> >
> > > > +Will call the ndo_get_rclk_range function to read the allowed range
> > > > +of output pin indexes.
> > > > +Will call ndo_get_rclk_range to determine the allowed recovered clock
> > > > +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
> > > > +IFLA_RCLK_RANGE_MAX_PIN attributes
> > >
> > > The first sentence seems to be redundant
> > >
> > > > +
> > > > +RTM_GETRCLKSTATE
> > > > +-----------------
> > > > +Read the state of recovered pins that output recovered clock from
> > > > +a given port. The message will contain the number of assigned clocks
> > > > +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
> > > IFLA_RCLK_STATE_OUT_IDX
> > > > +To support multiple recovered clock outputs from the same port, this
> > > message
> > > > +will return the IFLA_RCLK_STATE_COUNT attribute containing the
> number
> > > of
> > > > +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
> > > attributes
> > > > +listing the active output indexes.
> > > > +This message will call the ndo_get_rclk_range to determine the
> allowed
> > > > +recovered clock indexes and then will loop through them, calling
> > > > +the ndo_get_rclk_state for each of them.
> > >
> > > Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE?
> Isn't
> > > RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in
> the
> > > range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just
> > > report the
> > > state (enabled / disable) for all
> >
> > Great idea! Will implement it.
> >
> > > > +
> > > > +RTM_SETRCLKSTATE
> > > > +-----------------
> > > > +Sets the redirection of the recovered clock for a given pin. This
> message
> > > > +expects one attribute:
> > > > +struct if_set_rclk_msg {
> > > > +	__u32 ifindex; /* interface index */
> > > > +	__u32 out_idx; /* output index (from a valid range)
> > > > +	__u32 flags; /* configuration flags */
> > > > +};
> > > > +
> > > > +Supported flags are:
> > > > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be
> enabled,
> > > > +		     if clear - the output will be disabled.
> > >
> > > In the diagram you have two recovered clock outputs going into the EEC.
> > > According to which the EEC is synchronized?
> >
> > That will depend on the future DPLL configuration. For now it'll be based
> > on the DPLL's auto select ability and its default configuration.
> >
> > > How does user space know which pins to enable?
> >
> > That's why the RTM_GETRCLKRANGE was invented but I like the suggestion
> > you made above so will rework the code to remove the range one and
> > just return the indexes with enable/disable bit for each of them. In this
> > case youserspace will just send the RTM_GETRCLKSTATE to learn what
> > can be enabled.
> 
> In the diagram there are multiple Rx lanes, all of which might be used
> by the same port. How does user space know to differentiate between the
> quality levels of the clock signal recovered from each lane / pin when
> the information is transmitted on a per-port basis via ESMC messages?

The lines represent different ports - not necessarily lanes. My bad - will fix.

> The uAPI seems to be too low-level and is not compatible with Nvidia's
> devices and potentially other vendors. We really just need a logical
> interface that says "Synchronize the frequency of the EEC to the clock
> recovered from port X". The kernel / drivers should abstract the inner
> workings of the device from user space. Any reason this can't work for
> ice?

You can build a very simple solution with just one recovered clock index and
implement exactly what you described. RTM_SETRCLKSTATE will only set the
redirection and RTM_GETRCLKSTATE will read the current HW setting of
what's enabled.
 
> I also want to re-iterate my dissatisfaction with the interface being
> netdev-centric. By modelling the EEC as a standalone object we will be
> able to extend it to set the source of the EEC to something other than a
> netdev in the future. If we don't do it now, we will end up with two
> ways to report the source of the EEC (i.e., EEC_SRC_PORT and something
> else).
> 
> Other advantages of modelling the EEC as a separate object include the
> ability for user space to determine the mapping between netdevs and EECs
> (currently impossible) and reporting additional EEC attributes such as
> SyncE clockIdentity and default SSM code. There is really no reason to
> report all of this identical information via multiple netdevs.
>
> With regards to rtnetlink vs. something else, in my suggestion the only
> thing that should be reported per-netdev is the mapping between the
> netdev and the EEC. Similar to the way user space determines the mapping
> from netdev to PHC via ETHTOOL_GET_TS_INFO. If we go with rtnetlink,
> this can be reported as a new attribute in RTM_NEWLINK, no need to add
> new messages.

Will answer that in the following mail.
Machnikowski, Maciej Nov. 9, 2021, 10:43 a.m. UTC | #7
> -----Original Message-----
> From: Petr Machata <petrm@nvidia.com>
> Sent: Monday, November 8, 2021 7:00 PM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Cc: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org;
> richardcochran@gmail.com; abyagowi@fb.com; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; davem@davemloft.net; kuba@kernel.org;
> linux-kselftest@vger.kernel.org; idosch@idosch.org; mkubecek@suse.cz;
> saeed@kernel.org; michael.chan@broadcom.com
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> 
> Maciej Machnikowski <maciej.machnikowski@intel.com> writes:
> 
> > Add Documentation/networking/synce.rst describing new RTNL messages
> > and respective NDO ops supporting SyncE (Synchronous Ethernet).
> >
> > Signed-off-by: Maciej Machnikowski <maciej.machnikowski@intel.com>
> > ---
> >  Documentation/networking/synce.rst | 117
> +++++++++++++++++++++++++++++
> >  1 file changed, 117 insertions(+)
> >  create mode 100644 Documentation/networking/synce.rst
> >
> > diff --git a/Documentation/networking/synce.rst
> b/Documentation/networking/synce.rst
> > new file mode 100644
> > index 000000000000..4ca41fb9a481
> > --- /dev/null
> > +++ b/Documentation/networking/synce.rst
> > @@ -0,0 +1,117 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +====================
> > +Synchronous Ethernet
> > +====================
> > +
> > +Synchronous Ethernet networks use a physical layer clock to syntonize
> > +the frequency across different network elements.
> > +
> > +Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet
> > +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered
> clocks
> > +and a dedicated TX clock input that is used as to transmit data to other
> nodes.
> > +
> > +The SyncE capable PHY is able to recover the incomning frequency of the
> data
> > +stream on RX lanes and redirect it (sometimes dividing it) to recovered
> > +clock outputs. In SyncE PHY the TX frequency is directly dependent on the
> > +input frequency - either on the PHY CLK input, or on a dedicated
> > +TX clock input.
> > +
> > +      ┌───────────┬──────────┐
> > +      │ RX        │ TX       │
> > +  1   │ lanes     │ lanes    │ 1
> > +  ───►├──────┐    │          ├─────►
> > +  2   │      │    │          │ 2
> > +  ───►├──┐   │    │          ├─────►
> > +  3   │  │   │    │          │ 3
> > +  ───►├─▼▼   ▼    │          ├─────►
> > +      │ ──────    │          │
> > +      │ \____/    │          │
> > +      └──┼──┼─────┴──────────┘
> > +        1│ 2│        ▲
> > + RCLK out│  │        │ TX CLK in
> > +         ▼  ▼        │
> > +       ┌─────────────┴───┐
> > +       │                 │
> > +       │       EEC       │
> > +       │                 │
> > +       └─────────────────┘
> > +
> > +The EEC can synchronize its frequency to one of the synchronization
> inputs
> > +either clocks recovered on traffic interfaces or (in advanced deployments)
> > +external frequency sources.
> > +
> > +Some EEC implementations can select synchronization source through
> > +priority tables and synchronization status messaging and provide
> necessary
> > +filtering and holdover capabilities.
> > +
> > +The following interface can be applicable to diffferent packet network
> types
> > +following ITU-T G.8261/G.8262 recommendations.
> > +
> > +Interface
> > +=========
> > +
> > +The following RTNL messages are used to read/configure SyncE recovered
> > +clocks.
> > +
> > +RTM_GETRCLKRANGE
> > +-----------------
> > +Reads the allowed pin index range for the recovered clock outputs.
> > +This can be aligned to PHY outputs or to EEC inputs, whichever is
> > +better for a given application.
> > +Will call the ndo_get_rclk_range function to read the allowed range
> > +of output pin indexes.
> > +Will call ndo_get_rclk_range to determine the allowed recovered clock
> > +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
> > +IFLA_RCLK_RANGE_MAX_PIN attributes
> > +
> > +RTM_GETRCLKSTATE
> > +-----------------
> > +Read the state of recovered pins that output recovered clock from
> > +a given port. The message will contain the number of assigned clocks
> > +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
> IFLA_RCLK_STATE_OUT_IDX
> > +To support multiple recovered clock outputs from the same port, this
> message
> > +will return the IFLA_RCLK_STATE_COUNT attribute containing the number
> of
> > +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
> attributes
> > +listing the active output indexes.
> > +This message will call the ndo_get_rclk_range to determine the allowed
> > +recovered clock indexes and then will loop through them, calling
> > +the ndo_get_rclk_state for each of them.
> 
> Let me make sure I understand the model that you propose. Specifically
> from the point of view of a multi-port device, because that's my
> immediate use case.
> 
> RTM_GETRCLKRANGE would report number of "pins" that matches the
> number
> of lanes in the system. So e.g. a 32-port switch, where each port has 4
> lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or
> whatever.)
> 
> RTM_GETRCLKSTATE would then return some subset of those pins,
> depending
> on which lanes actually managed to establish a connection and carry a
> valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a
> 100Gbps established.
> 

Those 2 will be merged into a single RTM_GETRCLKSTATE that will report
the state of all available pins for a given port.

Also lanes here should really be ports - will fix in next revision.

But the logic will be: 
Call the RTM_GETRCLKSTATE. It will return the list of pins and their state
for a given port. Once you read the range you will send the RTM_SETRCLKSTATE
to enable the redirection to a given RCLK output from the PHY. If your DPLL/EEC
is configured to accept it automatically - it's all you need to do and you need to
wait for the right state of the EEC (locked/locked with HO).

> > +
> > +RTM_SETRCLKSTATE
> > +-----------------
> > +Sets the redirection of the recovered clock for a given pin. This message
> > +expects one attribute:
> > +struct if_set_rclk_msg {
> > +	__u32 ifindex; /* interface index */
> > +	__u32 out_idx; /* output index (from a valid range)
> > +	__u32 flags; /* configuration flags */
> > +};
> > +
> > +Supported flags are:
> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
> > +		     if clear - the output will be disabled.
> 
> OK, so here I set up the tracking. ifindex tells me which EEC to
> configure, out_idx is the pin to track, flags tell me whether to set up
> the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
> I somehow know that lane 2 has the best clock.

It's bound to ifindex to know which PHY port you interact with. It has nothing to
do with the EEC yet.
 
> If the above is broadly correct, I've got some questions.
> 
> First, what if more than one out_idx is set? What are drivers / HW meant
> to do with this? What is the expected behavior?

Expected behavior is deployment specific. You can use different phy recovered
clock outputs to implement active/passive mode of clock failover.

> Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope:
> one
> reports which pins carry a clock signal, the other influences tracking.
> That seems wrong. There also does not seems to be an UAPI to retrieve
> the tracking settings.

They don't. Get reads the redirection state and SET sets it - nothing more,
nothing less. In ICE we use EEC pin indexes so that the model translates easier
to the one when we support DPLL subsystem.

> Second, as a user-space client, how do I know that if ports 1 and 2 both
> report pin range [A; B], that they both actually share the same
> underlying EEC? Is there some sort of coordination among the drivers,
> such that each pin in the system has a unique ID?

For now we don't, as we don't have EEC subsystem. But that can be solved
by a config file temporarily.

> Further, how do I actually know the mapping from ports to pins? E.g. as
> a user, I might know my master is behind swp1. How do I know what pins
> correspond to that port? As a user-space tool author, how do I help
> users to do something like "eec set clock eec0 track swp1"?

That's why driver needs to be smart there and return indexes properly.

> Additionally, how would things like external GPSs or 1pps be modeled? I
> guess the driver would know about such interface, and would expose it as
> a "pin". When the GPS signal locks, the driver starts reporting the pin
> in the RCLK set. Then it is possible to set up tracking of that pin.

That won't be enabled before we get the DPLL subsystem ready.
 
> It seems to me it would be easier to understand, and to write user-space
> tools and drivers for, a model that has EEC as an explicit first-class
> object. That's where the EEC state naturally belongs, that's where the
> pin range naturally belongs. Netdevs should have a reference to EEC and
> pins, not present this information as if they own it. A first-class EEC
> would also allow to later figure out how to hook up PHC and EEC.

We have the userspace tool, but can’t upstream it until we define kernel 
Interfaces. It's paragraph 22 :(

Regards
Maciek

> > +
> > +RTM_GETEECSTATE
> > +----------------
> > +Reads the state of the EEC or equivalent physical clock synchronizer.
> > +This message returns the following attributes:
> > +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
> > +		 The states returned in this attribute are aligned to the
> > +		 ITU-T G.781 and are:
> > +		  IF_EEC_STATE_INVALID - state is not valid
> > +		  IF_EEC_STATE_FREERUN - clock is free-running
> > +		  IF_EEC_STATE_LOCKED - clock is locked to the reference,
> > +		                        but the holdover memory is not valid
> > +		  IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the
> reference
> > +		                               and holdover memory is valid
> > +		  IF_EEC_STATE_HOLDOVER - clock is in holdover mode
> > +State is read from the netdev calling the:
> > +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state
> *state,
> > +			 u32 *src_idx, struct netlink_ext_ack *extack);
> > +
> > +IFLA_EEC_SRC_IDX - optional attribute returning the index of the
> reference that
> > +		   is used for the current IFLA_EEC_STATE, i.e., the index of
> > +		   the pin that the EEC is locked to.
> > +
> > +Will be returned only if the ndo_get_eec_src is implemented.
Machnikowski, Maciej Nov. 9, 2021, 10:50 a.m. UTC | #8
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Monday, November 8, 2021 6:03 PM
> To: Ido Schimmel <idosch@idosch.org>
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> On Mon, 8 Nov 2021 18:29:50 +0200 Ido Schimmel wrote:
> > I also want to re-iterate my dissatisfaction with the interface being
> > netdev-centric. By modelling the EEC as a standalone object we will be
> > able to extend it to set the source of the EEC to something other than a
> > netdev in the future. If we don't do it now, we will end up with two
> > ways to report the source of the EEC (i.e., EEC_SRC_PORT and something
> > else).
> >
> > Other advantages of modelling the EEC as a separate object include the
> > ability for user space to determine the mapping between netdevs and EECs
> > (currently impossible) and reporting additional EEC attributes such as
> > SyncE clockIdentity and default SSM code. There is really no reason to
> > report all of this identical information via multiple netdevs.
> 
> Indeed, I feel convinced. I believe the OCP timing card will benefit
> from such API as well. I pinged Jonathan if he doesn't have cycles
> I'll do the typing.
> 
> What do you have in mind for driver abstracting away pin selection?
> For a standalone clock fed PPS signal from a backplate this will be
> impossible, so we may need some middle way.

Me too! Yet it'll take a lot of time to implement it. My thinking was to
implement the simplest usable EEC state possible that is applicable to all
solutions (like 1GBaseT that doesn't always require external DPLL to enable
SyncE) and have an option to return the state for netdev-specific use cases
And easily enable the new path when it's available. We can just check if the
driver is connected to the DPLL in the future DPLL subsystem and reroute
the GET_EECSTATE call there.

We can also fix the mapping by adding the DPLL_IDX attribute.

The DPLL subsystem will require very flexible pin model as there are a lot to
configure inside the DPLL to enable many use cases.
Petr Machata Nov. 9, 2021, 2:52 p.m. UTC | #9
Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:

>> Maciej Machnikowski <maciej.machnikowski@intel.com> writes:
>> 
>> > +====================
>> > +Synchronous Ethernet
>> > +====================
>> > +
>> > +Synchronous Ethernet networks use a physical layer clock to syntonize
>> > +the frequency across different network elements.
>> > +
>> > +Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet
>> > +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered
>> clocks
>> > +and a dedicated TX clock input that is used as to transmit data to other
>> nodes.
>> > +
>> > +The SyncE capable PHY is able to recover the incomning frequency of the
>> data
>> > +stream on RX lanes and redirect it (sometimes dividing it) to recovered
>> > +clock outputs. In SyncE PHY the TX frequency is directly dependent on the
>> > +input frequency - either on the PHY CLK input, or on a dedicated
>> > +TX clock input.
>> > +
>> > +      ┌───────────┬──────────┐
>> > +      │ RX        │ TX       │
>> > +  1   │ lanes     │ lanes    │ 1
>> > +  ───►├──────┐    │          ├─────►
>> > +  2   │      │    │          │ 2
>> > +  ───►├──┐   │    │          ├─────►
>> > +  3   │  │   │    │          │ 3
>> > +  ───►├─▼▼   ▼    │          ├─────►
>> > +      │ ──────    │          │
>> > +      │ \____/    │          │
>> > +      └──┼──┼─────┴──────────┘
>> > +        1│ 2│        ▲
>> > + RCLK out│  │        │ TX CLK in
>> > +         ▼  ▼        │
>> > +       ┌─────────────┴───┐
>> > +       │                 │
>> > +       │       EEC       │
>> > +       │                 │
>> > +       └─────────────────┘
>> > +
>> > +The EEC can synchronize its frequency to one of the synchronization
>> inputs
>> > +either clocks recovered on traffic interfaces or (in advanced deployments)
>> > +external frequency sources.
>> > +
>> > +Some EEC implementations can select synchronization source through
>> > +priority tables and synchronization status messaging and provide
>> necessary
>> > +filtering and holdover capabilities.
>> > +
>> > +The following interface can be applicable to diffferent packet network
>> types
>> > +following ITU-T G.8261/G.8262 recommendations.
>> > +
>> > +Interface
>> > +=========
>> > +
>> > +The following RTNL messages are used to read/configure SyncE recovered
>> > +clocks.
>> > +
>> > +RTM_GETRCLKRANGE
>> > +-----------------
>> > +Reads the allowed pin index range for the recovered clock outputs.
>> > +This can be aligned to PHY outputs or to EEC inputs, whichever is
>> > +better for a given application.
>> > +Will call the ndo_get_rclk_range function to read the allowed range
>> > +of output pin indexes.
>> > +Will call ndo_get_rclk_range to determine the allowed recovered clock
>> > +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
>> > +IFLA_RCLK_RANGE_MAX_PIN attributes
>> > +
>> > +RTM_GETRCLKSTATE
>> > +-----------------
>> > +Read the state of recovered pins that output recovered clock from
>> > +a given port. The message will contain the number of assigned clocks
>> > +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
>> IFLA_RCLK_STATE_OUT_IDX
>> > +To support multiple recovered clock outputs from the same port, this
>> message
>> > +will return the IFLA_RCLK_STATE_COUNT attribute containing the number
>> of
>> > +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
>> attributes
>> > +listing the active output indexes.
>> > +This message will call the ndo_get_rclk_range to determine the allowed
>> > +recovered clock indexes and then will loop through them, calling
>> > +the ndo_get_rclk_state for each of them.
>> 
>> Let me make sure I understand the model that you propose. Specifically
>> from the point of view of a multi-port device, because that's my
>> immediate use case.
>> 
>> RTM_GETRCLKRANGE would report number of "pins" that matches the
>> number
>> of lanes in the system. So e.g. a 32-port switch, where each port has 4
>> lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or
>> whatever.)
>> 
>> RTM_GETRCLKSTATE would then return some subset of those pins,
>> depending
>> on which lanes actually managed to establish a connection and carry a
>> valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a
>> 100Gbps established.
>> 
>
> Those 2 will be merged into a single RTM_GETRCLKSTATE that will report
> the state of all available pins for a given port.
>
> Also lanes here should really be ports - will fix in next revision.
>
> But the logic will be: 
> Call the RTM_GETRCLKSTATE. It will return the list of pins and their state
> for a given port. Once you read the range you will send the RTM_SETRCLKSTATE
> to enable the redirection to a given RCLK output from the PHY. If your DPLL/EEC
> is configured to accept it automatically - it's all you need to do and you need to
> wait for the right state of the EEC (locked/locked with HO).

Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.

>> > +
>> > +RTM_SETRCLKSTATE
>> > +-----------------
>> > +Sets the redirection of the recovered clock for a given pin. This message
>> > +expects one attribute:
>> > +struct if_set_rclk_msg {
>> > +	__u32 ifindex; /* interface index */
>> > +	__u32 out_idx; /* output index (from a valid range)
>> > +	__u32 flags; /* configuration flags */
>> > +};
>> > +
>> > +Supported flags are:
>> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
>> > +		     if clear - the output will be disabled.
>> 
>> OK, so here I set up the tracking. ifindex tells me which EEC to
>> configure, out_idx is the pin to track, flags tell me whether to set up
>> the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
>> I somehow know that lane 2 has the best clock.
>
> It's bound to ifindex to know which PHY port you interact with. It has nothing to
> do with the EEC yet.

It has in the sense that I'm configuring "TX CLK in", which leads from
EEC to the port.

>> If the above is broadly correct, I've got some questions.
>> 
>> First, what if more than one out_idx is set? What are drivers / HW meant
>> to do with this? What is the expected behavior?
>
> Expected behavior is deployment specific. You can use different phy recovered
> clock outputs to implement active/passive mode of clock failover.

How? Which one is primary and which one is backup? I just have two
enabled pins...

Wouldn't failover be implementable in a userspace daemon? That would get
a notification from the system that holdover was entered, and can
reconfigure tracking to another pin based on arbitrary rules.

>> Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope:
>> one
>> reports which pins carry a clock signal, the other influences tracking.
>> That seems wrong. There also does not seems to be an UAPI to retrieve
>> the tracking settings.
>
> They don't. Get reads the redirection state and SET sets it - nothing more,
> nothing less. In ICE we use EEC pin indexes so that the model translates easier
> to the one when we support DPLL subsystem.
>
>> Second, as a user-space client, how do I know that if ports 1 and 2 both
>> report pin range [A; B], that they both actually share the same
>> underlying EEC? Is there some sort of coordination among the drivers,
>> such that each pin in the system has a unique ID?
>
> For now we don't, as we don't have EEC subsystem. But that can be solved
> by a config file temporarily.

I think it would be better to model this properly from day one.

>> Further, how do I actually know the mapping from ports to pins? E.g. as
>> a user, I might know my master is behind swp1. How do I know what pins
>> correspond to that port? As a user-space tool author, how do I help
>> users to do something like "eec set clock eec0 track swp1"?
>
> That's why driver needs to be smart there and return indexes properly.

What do you mean, properly? Up there you have RTM_GETRCLKRANGE that just
gives me a min and a max. Is there a policy about how to correlate
numbers in that range to... ifindices, netdevice names, devlink port
numbers, I don't know, something?

How do several drivers coordinate this numbering among themselves? Is
there a core kernel authority that manages pin number de/allocations?

>> Additionally, how would things like external GPSs or 1pps be modeled? I
>> guess the driver would know about such interface, and would expose it as
>> a "pin". When the GPS signal locks, the driver starts reporting the pin
>> in the RCLK set. Then it is possible to set up tracking of that pin.
>
> That won't be enabled before we get the DPLL subsystem ready.

It might prove challenging to retrofit an existing netdev-centric
interface into a more generic model. It would be better to model this
properly from day one, and OK, if we can carve out a subset of that
model to implement now, and leave the rest for later, fine. But the
current model does not strike me as having a natural migration path to
something more generic. E.g. reporting the EEC state through the
interfaces attached to that EEC... like, that will have to stay, even at
a time when it is superseded by a better interface.

>> It seems to me it would be easier to understand, and to write user-space
>> tools and drivers for, a model that has EEC as an explicit first-class
>> object. That's where the EEC state naturally belongs, that's where the
>> pin range naturally belongs. Netdevs should have a reference to EEC and
>> pins, not present this information as if they own it. A first-class EEC
>> would also allow to later figure out how to hook up PHC and EEC.
>
> We have the userspace tool, but can’t upstream it until we define
> kernel Interfaces. It's paragraph 22 :(

I'm sure you do, presumably you test this somehow. Still, as a potential
consumer of that interface, I will absolutely poke at it to figure out
how to use it, what it lets me to do, and what won't work.

BTW, what we've done in the past in a situation like this was, here's
the current submission, here's a pointer to a GIT with more stuff we
plan to send later on, here's a pointer to a GIT with the userspace
stuff. I doubt anybody actually looks at that code, ain't nobody got
time for that, but really there's no catch 22.
Machnikowski, Maciej Nov. 9, 2021, 6:19 p.m. UTC | #10
> -----Original Message-----
> From: Petr Machata <petrm@nvidia.com>
> Sent: Tuesday, November 9, 2021 3:53 PM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> 
> Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:
> 
> >> Maciej Machnikowski <maciej.machnikowski@intel.com> writes:
> >>
> >> RTM_GETRCLKRANGE would report number of "pins" that matches the
> >> number
> >> of lanes in the system. So e.g. a 32-port switch, where each port has 4
> >> lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or
> >> whatever.)
> >>
> >> RTM_GETRCLKSTATE would then return some subset of those pins,
> >> depending
> >> on which lanes actually managed to establish a connection and carry a
> >> valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a
> >> 100Gbps established.
> >>
> >
> > Those 2 will be merged into a single RTM_GETRCLKSTATE that will report
> > the state of all available pins for a given port.
> >
> > Also lanes here should really be ports - will fix in next revision.
> >
> > But the logic will be:
> > Call the RTM_GETRCLKSTATE. It will return the list of pins and their state
> > for a given port. Once you read the range you will send the
> RTM_SETRCLKSTATE
> > to enable the redirection to a given RCLK output from the PHY. If your
> DPLL/EEC
> > is configured to accept it automatically - it's all you need to do and you need
> to
> > wait for the right state of the EEC (locked/locked with HO).
> 
> Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.

The functionality needs to be there, but the message will be gone.
 
> >> > +
> >> > +RTM_SETRCLKSTATE
> >> > +-----------------
> >> > +Sets the redirection of the recovered clock for a given pin. This
> message
> >> > +expects one attribute:
> >> > +struct if_set_rclk_msg {
> >> > +	__u32 ifindex; /* interface index */
> >> > +	__u32 out_idx; /* output index (from a valid range)
> >> > +	__u32 flags; /* configuration flags */
> >> > +};
> >> > +
> >> > +Supported flags are:
> >> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be
> enabled,
> >> > +		     if clear - the output will be disabled.
> >>
> >> OK, so here I set up the tracking. ifindex tells me which EEC to
> >> configure, out_idx is the pin to track, flags tell me whether to set up
> >> the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
> >> I somehow know that lane 2 has the best clock.
> >
> > It's bound to ifindex to know which PHY port you interact with. It has
> nothing to
> > do with the EEC yet.
> 
> It has in the sense that I'm configuring "TX CLK in", which leads from
> EEC to the port.

At this stage we only enable the recovered clock. EEC may or may not use it
depending on many additional factors.

> >> If the above is broadly correct, I've got some questions.
> >>
> >> First, what if more than one out_idx is set? What are drivers / HW meant
> >> to do with this? What is the expected behavior?
> >
> > Expected behavior is deployment specific. You can use different phy
> recovered
> > clock outputs to implement active/passive mode of clock failover.
> 
> How? Which one is primary and which one is backup? I just have two
> enabled pins...

With this API you only have ports and pins and set up the redirection.
The EEC part is out of picture and will be part of DPLL subsystem.

> Wouldn't failover be implementable in a userspace daemon? That would get
> a notification from the system that holdover was entered, and can
> reconfigure tracking to another pin based on arbitrary rules.

Not necessarily. You can deploy the QL-disabled mode and rely on the
local DPLL configuration to manage the switching. In that mode you're
not passing the quality level downstream, so you only need to know if you
have a source.

> >> Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope:
> >> one
> >> reports which pins carry a clock signal, the other influences tracking.
> >> That seems wrong. There also does not seems to be an UAPI to retrieve
> >> the tracking settings.
> >
> > They don't. Get reads the redirection state and SET sets it - nothing more,
> > nothing less. In ICE we use EEC pin indexes so that the model translates
> easier
> > to the one when we support DPLL subsystem.
> >
> >> Second, as a user-space client, how do I know that if ports 1 and 2 both
> >> report pin range [A; B], that they both actually share the same
> >> underlying EEC? Is there some sort of coordination among the drivers,
> >> such that each pin in the system has a unique ID?
> >
> > For now we don't, as we don't have EEC subsystem. But that can be solved
> > by a config file temporarily.
> 
> I think it would be better to model this properly from day one.

I want to propose the simplest API that will work for the simplest device,
follow that with the userspace tool that will help everyone understand
what we need in the DPLL subsystem, otherwise it'll be hard to explain the
requirements. The only change will be the addition of the DPLL index.
 
> >> Further, how do I actually know the mapping from ports to pins? E.g. as
> >> a user, I might know my master is behind swp1. How do I know what pins
> >> correspond to that port? As a user-space tool author, how do I help
> >> users to do something like "eec set clock eec0 track swp1"?
> >
> > That's why driver needs to be smart there and return indexes properly.
> 
> What do you mean, properly? Up there you have RTM_GETRCLKRANGE that
> just
> gives me a min and a max. Is there a policy about how to correlate
> numbers in that range to... ifindices, netdevice names, devlink port
> numbers, I don't know, something?

The driver needs to know the underlying HW and report those ranges
correctly.

> How do several drivers coordinate this numbering among themselves? Is
> there a core kernel authority that manages pin number de/allocations?

I believe the goal is to create something similar to the ptp subsystem.
The driver will need to configure the relationship during initialization and the
OS will manage the indexes.
 
> >> Additionally, how would things like external GPSs or 1pps be modeled? I
> >> guess the driver would know about such interface, and would expose it as
> >> a "pin". When the GPS signal locks, the driver starts reporting the pin
> >> in the RCLK set. Then it is possible to set up tracking of that pin.
> >
> > That won't be enabled before we get the DPLL subsystem ready.
> 
> It might prove challenging to retrofit an existing netdev-centric
> interface into a more generic model. It would be better to model this
> properly from day one, and OK, if we can carve out a subset of that
> model to implement now, and leave the rest for later, fine. But the
> current model does not strike me as having a natural migration path to
> something more generic. E.g. reporting the EEC state through the
> interfaces attached to that EEC... like, that will have to stay, even at
> a time when it is superseded by a better interface.

The recovered clock API will not change - only EEC_STATE is in question.
We can either redirect the call to the DPLL subsystem, or just add the DPLL IDX
Into that call and return it. 

> >> It seems to me it would be easier to understand, and to write user-space
> >> tools and drivers for, a model that has EEC as an explicit first-class
> >> object. That's where the EEC state naturally belongs, that's where the
> >> pin range naturally belongs. Netdevs should have a reference to EEC and
> >> pins, not present this information as if they own it. A first-class EEC
> >> would also allow to later figure out how to hook up PHC and EEC.
> >
> > We have the userspace tool, but can’t upstream it until we define
> > kernel Interfaces. It's paragraph 22 :(
> 
> I'm sure you do, presumably you test this somehow. Still, as a potential
> consumer of that interface, I will absolutely poke at it to figure out
> how to use it, what it lets me to do, and what won't work.

That's why now I want to enable very basic functionality that will not go away
anytime soon. Mapping between port and recovered clock (as in
take my clock and output on the first PHY's recovered clock output)
and checking the state of the clock.

> BTW, what we've done in the past in a situation like this was, here's
> the current submission, here's a pointer to a GIT with more stuff we
> plan to send later on, here's a pointer to a GIT with the userspace
> stuff. I doubt anybody actually looks at that code, ain't nobody got
> time for that, but really there's no catch 22.

Unfortunately, the userspace of it will be a part of linuxptp and we can't
upstream it partially before we get those basics defined here. More 
advanced functionality will be grown organically, as I also have a limited
view of SyncE and am not expert on switches.
Petr Machata Nov. 10, 2021, 10:27 a.m. UTC | #11
Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:

>> Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.
>
> The functionality needs to be there, but the message will be gone.

Gotcha.

>> >> > +RTM_SETRCLKSTATE
>> >> > +-----------------
>> >> > +Sets the redirection of the recovered clock for a given pin. This
>> message
>> >> > +expects one attribute:
>> >> > +struct if_set_rclk_msg {
>> >> > +	__u32 ifindex; /* interface index */
>> >> > +	__u32 out_idx; /* output index (from a valid range)
>> >> > +	__u32 flags; /* configuration flags */
>> >> > +};
>> >> > +
>> >> > +Supported flags are:
>> >> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
>> >> > +		     if clear - the output will be disabled.
>> >>
>> >> OK, so here I set up the tracking. ifindex tells me which EEC to
>> >> configure, out_idx is the pin to track, flags tell me whether to set up
>> >> the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
>> >> I somehow know that lane 2 has the best clock.
>> >
>> > It's bound to ifindex to know which PHY port you interact with. It
>> > has nothing to do with the EEC yet.
>>
>> It has in the sense that I'm configuring "TX CLK in", which leads
>> from EEC to the port.
>
> At this stage we only enable the recovered clock. EEC may or may not
> use it depending on many additional factors.
>
>> >> If the above is broadly correct, I've got some questions.
>> >>
>> >> First, what if more than one out_idx is set? What are drivers / HW
>> >> meant to do with this? What is the expected behavior?
>> >
>> > Expected behavior is deployment specific. You can use different phy
>> > recovered clock outputs to implement active/passive mode of clock
>> > failover.
>>
>> How? Which one is primary and which one is backup? I just have two
>> enabled pins...
>
> With this API you only have ports and pins and set up the redirection.

Wait, so how do I do failover? Which of the set pins in primary and
which is backup? Should the backup be sticky, i.e. do primary and backup
switch roles after primary goes into holdover? It looks like there are a
number of policy decisions that would be best served by a userspace
tool.

> The EEC part is out of picture and will be part of DPLL subsystem.

So about that. I don't think it's contentious to claim that you need to
communicate EEC state somehow. This proposal does that through a netdev
object. After the DPLL subsystem comes along, that will necessarily
provide the same information, and the netdev interface will become
redundant, but we will need to keep it around.

That is a strong indication that a first-class DPLL object should be
part of the initial submission.

>> Wouldn't failover be implementable in a userspace daemon? That would get
>> a notification from the system that holdover was entered, and can
>> reconfigure tracking to another pin based on arbitrary rules.
>
> Not necessarily. You can deploy the QL-disabled mode and rely on the
> local DPLL configuration to manage the switching. In that mode you're
> not passing the quality level downstream, so you only need to know if you
> have a source.

The daemon can reconfigure tracking to another pin based on _arbitrary_
rules. They don't have to involve QL in any way. Can be round-robin,
FIFO, random choice... IMO it's better than just enabling a bunch of
pins and not providing any guidance as to the policy.

>> >> Second, as a user-space client, how do I know that if ports 1 and
>> >> 2 both report pin range [A; B], that they both actually share the
>> >> same underlying EEC? Is there some sort of coordination among the
>> >> drivers, such that each pin in the system has a unique ID?
>> >
>> > For now we don't, as we don't have EEC subsystem. But that can be
>> > solved by a config file temporarily.
>>
>> I think it would be better to model this properly from day one.
>
> I want to propose the simplest API that will work for the simplest
> device, follow that with the userspace tool that will help everyone
> understand what we need in the DPLL subsystem, otherwise it'll be hard
> to explain the requirements. The only change will be the addition of
> the DPLL index.

That would be fine if there were a migration path to the more complete
API. But as DPLL object is introduced, even the APIs that are superseded
by the DPLL APIs will need to stay in as a baggage.

>> >> Further, how do I actually know the mapping from ports to pins?
>> >> E.g. as a user, I might know my master is behind swp1. How do I
>> >> know what pins correspond to that port? As a user-space tool
>> >> author, how do I help users to do something like "eec set clock
>> >> eec0 track swp1"?
>> >
>> > That's why driver needs to be smart there and return indexes
>> > properly.
>>
>> What do you mean, properly? Up there you have RTM_GETRCLKRANGE that
>> just gives me a min and a max. Is there a policy about how to
>> correlate numbers in that range to... ifindices, netdevice names,
>> devlink port numbers, I don't know, something?
>
> The driver needs to know the underlying HW and report those ranges
> correctly.

How do I know _as a user_ though? As a user I want to be able to say
something like "eec set dev swp1 track dev swp2". But the "eec" tool has
no way of knowing how to set that up.

>> How do several drivers coordinate this numbering among themselves? Is
>> there a core kernel authority that manages pin number de/allocations?
>
> I believe the goal is to create something similar to the ptp
> subsystem. The driver will need to configure the relationship during
> initialization and the OS will manage the indexes.

Can you point at the index management code, please?

>> >> Additionally, how would things like external GPSs or 1pps be
>> >> modeled? I guess the driver would know about such interface, and
>> >> would expose it as a "pin". When the GPS signal locks, the driver
>> >> starts reporting the pin in the RCLK set. Then it is possible to
>> >> set up tracking of that pin.
>> >
>> > That won't be enabled before we get the DPLL subsystem ready.
>>
>> It might prove challenging to retrofit an existing netdev-centric
>> interface into a more generic model. It would be better to model this
>> properly from day one, and OK, if we can carve out a subset of that
>> model to implement now, and leave the rest for later, fine. But the
>> current model does not strike me as having a natural migration path to
>> something more generic. E.g. reporting the EEC state through the
>> interfaces attached to that EEC... like, that will have to stay, even at
>> a time when it is superseded by a better interface.
>
> The recovered clock API will not change - only EEC_STATE is in
> question. We can either redirect the call to the DPLL subsystem, or
> just add the DPLL IDX Into that call and return it.

It would be better to have a first-class DPLL object, however vestigial,
in the initial submission.

>> >> It seems to me it would be easier to understand, and to write
>> >> user-space tools and drivers for, a model that has EEC as an
>> >> explicit first-class object. That's where the EEC state naturally
>> >> belongs, that's where the pin range naturally belongs. Netdevs
>> >> should have a reference to EEC and pins, not present this
>> >> information as if they own it. A first-class EEC would also allow
>> >> to later figure out how to hook up PHC and EEC.
>> >
>> > We have the userspace tool, but can’t upstream it until we define
>> > kernel Interfaces. It's paragraph 22 :(
>>
>> I'm sure you do, presumably you test this somehow. Still, as a
>> potential consumer of that interface, I will absolutely poke at it to
>> figure out how to use it, what it lets me to do, and what won't work.
>
> That's why now I want to enable very basic functionality that will not
> go away anytime soon.

The issue is that the APIs won't go away any time soon either. That's
why people object to your proposal so strongly. Because we won't be able
to fix this later, and we _already_ see shortcomings now.

> Mapping between port and recovered clock (as in take my clock and
> output on the first PHY's recovered clock output) and checking the
> state of the clock.

Where is that mapping? I see a per-netdev call for a list of pins that
carry RCLK, and the state as well. I don't see a way to distinguish
which is which in any way.

>> BTW, what we've done in the past in a situation like this was, here's
>> the current submission, here's a pointer to a GIT with more stuff we
>> plan to send later on, here's a pointer to a GIT with the userspace
>> stuff. I doubt anybody actually looks at that code, ain't nobody got
>> time for that, but really there's no catch 22.
>
> Unfortunately, the userspace of it will be a part of linuxptp and we
> can't upstream it partially before we get those basics defined here.

Just push it to github or whereever?

> More advanced functionality will be grown organically, as I also have
> a limited view of SyncE and am not expert on switches.

We are growing it organically _right now_. I am strongly advocating an
organic growth in the direction of a first-class DPLL object.
Machnikowski, Maciej Nov. 10, 2021, 11:19 a.m. UTC | #12
> -----Original Message-----
> From: Petr Machata <petrm@nvidia.com>
> Sent: Wednesday, November 10, 2021 11:27 AM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> 
> Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:
> 
> >> Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.
> >
> > The functionality needs to be there, but the message will be gone.
> 
> Gotcha.
> 
> >> >> > +RTM_SETRCLKSTATE
> >> >> > +-----------------
> >> >> > +Sets the redirection of the recovered clock for a given pin. This
> >> message
> >> >> > +expects one attribute:
> >> >> > +struct if_set_rclk_msg {
> >> >> > +	__u32 ifindex; /* interface index */
> >> >> > +	__u32 out_idx; /* output index (from a valid range)
> >> >> > +	__u32 flags; /* configuration flags */
> >> >> > +};
> >> >> > +
> >> >> > +Supported flags are:
> >> >> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be
> enabled,
> >> >> > +		     if clear - the output will be disabled.
> >> >>
> >> >> OK, so here I set up the tracking. ifindex tells me which EEC to
> >> >> configure, out_idx is the pin to track, flags tell me whether to set up
> >> >> the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
> >> >> I somehow know that lane 2 has the best clock.
> >> >
> >> > It's bound to ifindex to know which PHY port you interact with. It
> >> > has nothing to do with the EEC yet.
> >>
> >> It has in the sense that I'm configuring "TX CLK in", which leads
> >> from EEC to the port.
> >
> > At this stage we only enable the recovered clock. EEC may or may not
> > use it depending on many additional factors.
> >
> >> >> If the above is broadly correct, I've got some questions.
> >> >>
> >> >> First, what if more than one out_idx is set? What are drivers / HW
> >> >> meant to do with this? What is the expected behavior?
> >> >
> >> > Expected behavior is deployment specific. You can use different phy
> >> > recovered clock outputs to implement active/passive mode of clock
> >> > failover.
> >>
> >> How? Which one is primary and which one is backup? I just have two
> >> enabled pins...
> >
> > With this API you only have ports and pins and set up the redirection.
> 
> Wait, so how do I do failover? Which of the set pins in primary and
> which is backup? Should the backup be sticky, i.e. do primary and backup
> switch roles after primary goes into holdover? It looks like there are a
> number of policy decisions that would be best served by a userspace
> tool.

The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API
only configures the redirections (aka. Which clocks will be available to the
DPLL as references). In some DPLLs the fallback is automatic as long as
secondary clock is available when the primary goes away. Userspace tool
can preconfigure that before the failure occurs.

> > The EEC part is out of picture and will be part of DPLL subsystem.
> 
> So about that. I don't think it's contentious to claim that you need to
> communicate EEC state somehow. This proposal does that through a netdev
> object. After the DPLL subsystem comes along, that will necessarily
> provide the same information, and the netdev interface will become
> redundant, but we will need to keep it around.
> 
> That is a strong indication that a first-class DPLL object should be
> part of the initial submission.

That's why only a bare minimum is proposed in this patch - reading the state
and which signal is used as a reference.

> >> Wouldn't failover be implementable in a userspace daemon? That would
> get
> >> a notification from the system that holdover was entered, and can
> >> reconfigure tracking to another pin based on arbitrary rules.
> >
> > Not necessarily. You can deploy the QL-disabled mode and rely on the
> > local DPLL configuration to manage the switching. In that mode you're
> > not passing the quality level downstream, so you only need to know if you
> > have a source.
> 
> The daemon can reconfigure tracking to another pin based on _arbitrary_
> rules. They don't have to involve QL in any way. Can be round-robin,
> FIFO, random choice... IMO it's better than just enabling a bunch of
> pins and not providing any guidance as to the policy.

This is how the API works now. You can enable clock on output N with the
RTM_SETRCLKSTATE.
It can't be random/round-robin, but it's deployment specific. If in your setup
you only have one link to synchronous network you'll always use it as your frequency
reference.

> >> >> Second, as a user-space client, how do I know that if ports 1 and
> >> >> 2 both report pin range [A; B], that they both actually share the
> >> >> same underlying EEC? Is there some sort of coordination among the
> >> >> drivers, such that each pin in the system has a unique ID?
> >> >
> >> > For now we don't, as we don't have EEC subsystem. But that can be
> >> > solved by a config file temporarily.
> >>
> >> I think it would be better to model this properly from day one.
> >
> > I want to propose the simplest API that will work for the simplest
> > device, follow that with the userspace tool that will help everyone
> > understand what we need in the DPLL subsystem, otherwise it'll be hard
> > to explain the requirements. The only change will be the addition of
> > the DPLL index.
> 
> That would be fine if there were a migration path to the more complete
> API. But as DPLL object is introduced, even the APIs that are superseded
> by the DPLL APIs will need to stay in as a baggage.

The migration paths are:
A) when the DPLL API is there check if the DPLL object is linked to the given netdev
     in the rtnl_eec_state_get - if it is - get the state from the DPLL object there
or
B) return the DPLL index linked to the given netdev and fail the rtnl_eec_state_get
     so that the userspace tool will need to switch to the new API

Also the rtnl_eec_state_get won't get obsolete in all cases once we get the DPLL
subsystem, as there are solutions where SyncE DPLL is embedded in the PHY
in which case the rtnl_eec_state_get will return all needed information without
the need to create a separate DPLL object.

The DPLL object makes sense for advanced SyncE DPLLs that provide additional
functionality, such as external reference/output pins.

> >> >> Further, how do I actually know the mapping from ports to pins?
> >> >> E.g. as a user, I might know my master is behind swp1. How do I
> >> >> know what pins correspond to that port? As a user-space tool
> >> >> author, how do I help users to do something like "eec set clock
> >> >> eec0 track swp1"?
> >> >
> >> > That's why driver needs to be smart there and return indexes
> >> > properly.
> >>
> >> What do you mean, properly? Up there you have RTM_GETRCLKRANGE
> that
> >> just gives me a min and a max. Is there a policy about how to
> >> correlate numbers in that range to... ifindices, netdevice names,
> >> devlink port numbers, I don't know, something?
> >
> > The driver needs to know the underlying HW and report those ranges
> > correctly.
> 
> How do I know _as a user_ though? As a user I want to be able to say
> something like "eec set dev swp1 track dev swp2". But the "eec" tool has
> no way of knowing how to set that up.

There's no such flexibility. It's more like timing pins in the PTP subsystem - we
expose the API to control them, but it's up to the final user to decide how 
to use them.

If we index the PHY outputs in the same way as the DPLL subsystem will see
them in the references part it should be sufficient to make sense out of them.
 
> >> How do several drivers coordinate this numbering among themselves? Is
> >> there a core kernel authority that manages pin number de/allocations?
> >
> > I believe the goal is to create something similar to the ptp
> > subsystem. The driver will need to configure the relationship during
> > initialization and the OS will manage the indexes.
> 
> Can you point at the index management code, please?

Look for the ptp_clock_register function in the kernel - it owns the registration
of the ptp clock to the subsystem.

> >> >> Additionally, how would things like external GPSs or 1pps be
> >> >> modeled? I guess the driver would know about such interface, and
> >> >> would expose it as a "pin". When the GPS signal locks, the driver
> >> >> starts reporting the pin in the RCLK set. Then it is possible to
> >> >> set up tracking of that pin.
> >> >
> >> > That won't be enabled before we get the DPLL subsystem ready.
> >>
> >> It might prove challenging to retrofit an existing netdev-centric
> >> interface into a more generic model. It would be better to model this
> >> properly from day one, and OK, if we can carve out a subset of that
> >> model to implement now, and leave the rest for later, fine. But the
> >> current model does not strike me as having a natural migration path to
> >> something more generic. E.g. reporting the EEC state through the
> >> interfaces attached to that EEC... like, that will have to stay, even at
> >> a time when it is superseded by a better interface.
> >
> > The recovered clock API will not change - only EEC_STATE is in
> > question. We can either redirect the call to the DPLL subsystem, or
> > just add the DPLL IDX Into that call and return it.
> 
> It would be better to have a first-class DPLL object, however vestigial,
> in the initial submission.

As stated above - DPLL subsystem won't render EEC state useless.

> >> >> It seems to me it would be easier to understand, and to write
> >> >> user-space tools and drivers for, a model that has EEC as an
> >> >> explicit first-class object. That's where the EEC state naturally
> >> >> belongs, that's where the pin range naturally belongs. Netdevs
> >> >> should have a reference to EEC and pins, not present this
> >> >> information as if they own it. A first-class EEC would also allow
> >> >> to later figure out how to hook up PHC and EEC.
> >> >
> >> > We have the userspace tool, but can’t upstream it until we define
> >> > kernel Interfaces. It's paragraph 22 :(
> >>
> >> I'm sure you do, presumably you test this somehow. Still, as a
> >> potential consumer of that interface, I will absolutely poke at it to
> >> figure out how to use it, what it lets me to do, and what won't work.
> >
> > That's why now I want to enable very basic functionality that will not
> > go away anytime soon.
> 
> The issue is that the APIs won't go away any time soon either. That's
> why people object to your proposal so strongly. Because we won't be able
> to fix this later, and we _already_ see shortcomings now.
> 
> > Mapping between port and recovered clock (as in take my clock and
> > output on the first PHY's recovered clock output) and checking the
> > state of the clock.
> 
> Where is that mapping? I see a per-netdev call for a list of pins that
> carry RCLK, and the state as well. I don't see a way to distinguish
> which is which in any way.
> 
> >> BTW, what we've done in the past in a situation like this was, here's
> >> the current submission, here's a pointer to a GIT with more stuff we
> >> plan to send later on, here's a pointer to a GIT with the userspace
> >> stuff. I doubt anybody actually looks at that code, ain't nobody got
> >> time for that, but really there's no catch 22.
> >
> > Unfortunately, the userspace of it will be a part of linuxptp and we
> > can't upstream it partially before we get those basics defined here.
> 
> Just push it to github or whereever?
> 
> > More advanced functionality will be grown organically, as I also have
> > a limited view of SyncE and am not expert on switches.
> 
> We are growing it organically _right now_. I am strongly advocating an
> organic growth in the direction of a first-class DPLL object.

If it helps - I can separate the PHY RCLK control patches and leave EEC state
under review
Petr Machata Nov. 10, 2021, 3:15 p.m. UTC | #13
>> >> >> First, what if more than one out_idx is set? What are drivers / HW
>> >> >> meant to do with this? What is the expected behavior?
>> >> >
>> >> > Expected behavior is deployment specific. You can use different phy
>> >> > recovered clock outputs to implement active/passive mode of clock
>> >> > failover.
>> >>
>> >> How? Which one is primary and which one is backup? I just have two
>> >> enabled pins...
>> >
>> > With this API you only have ports and pins and set up the redirection.
>> 
>> Wait, so how do I do failover? Which of the set pins in primary and
>> which is backup? Should the backup be sticky, i.e. do primary and backup
>> switch roles after primary goes into holdover? It looks like there are a
>> number of policy decisions that would be best served by a userspace
>> tool.
>
> The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API
> only configures the redirections (aka. Which clocks will be available to the
> DPLL as references). In some DPLLs the fallback is automatic as long as
> secondary clock is available when the primary goes away. Userspace tool
> can preconfigure that before the failure occurs.

OK, I see. It looks like this priority list implies which pins need to
be enabled. That makes the netdev interface redundant.

>> > The EEC part is out of picture and will be part of DPLL subsystem.
>> 
>> So about that. I don't think it's contentious to claim that you need to
>> communicate EEC state somehow. This proposal does that through a netdev
>> object. After the DPLL subsystem comes along, that will necessarily
>> provide the same information, and the netdev interface will become
>> redundant, but we will need to keep it around.
>> 
>> That is a strong indication that a first-class DPLL object should be
>> part of the initial submission.
>
> That's why only a bare minimum is proposed in this patch - reading the state
> and which signal is used as a reference.

The proposal includes APIs that we know _right now_ will be historical
baggage by the time the DPLL object is added. That does not constitute
bare minimum.

>> >> >> Second, as a user-space client, how do I know that if ports 1 and
>> >> >> 2 both report pin range [A; B], that they both actually share the
>> >> >> same underlying EEC? Is there some sort of coordination among the
>> >> >> drivers, such that each pin in the system has a unique ID?
>> >> >
>> >> > For now we don't, as we don't have EEC subsystem. But that can be
>> >> > solved by a config file temporarily.
>> >>
>> >> I think it would be better to model this properly from day one.
>> >
>> > I want to propose the simplest API that will work for the simplest
>> > device, follow that with the userspace tool that will help everyone
>> > understand what we need in the DPLL subsystem, otherwise it'll be hard
>> > to explain the requirements. The only change will be the addition of
>> > the DPLL index.
>> 
>> That would be fine if there were a migration path to the more complete
>> API. But as DPLL object is introduced, even the APIs that are superseded
>> by the DPLL APIs will need to stay in as a baggage.
>
> The migration paths are:
> A) when the DPLL API is there check if the DPLL object is linked to the given netdev
>      in the rtnl_eec_state_get - if it is - get the state from the DPLL object there
> or
> B) return the DPLL index linked to the given netdev and fail the rtnl_eec_state_get
>      so that the userspace tool will need to switch to the new API

Well, we call B) an API breakage, and it won't fly. That API is there to
stay, and operate like it operates now.

That leaves us with A), where the API becomes a redundant wart that we
can never get rid of.

> Also the rtnl_eec_state_get won't get obsolete in all cases once we get the DPLL
> subsystem, as there are solutions where SyncE DPLL is embedded in the PHY
> in which case the rtnl_eec_state_get will return all needed information without
> the need to create a separate DPLL object.

So the NIC or PHY driver will register the object. Easy peasy.

Allowing the interface to go through a netdev sometimes, and through a
dedicated object other times, just makes everybody's life harder. It's
two cases that need to be handled in user documentation, in scripts, in
UAPI clients, when reviewing kernel code.

This is a "hysterical raisins" sort of baggage, except we see up front
that's where it goes.

> The DPLL object makes sense for advanced SyncE DPLLs that provide
> additional functionality, such as external reference/output pins.

That does not need to be the case.

>> >> >> Further, how do I actually know the mapping from ports to pins?
>> >> >> E.g. as a user, I might know my master is behind swp1. How do I
>> >> >> know what pins correspond to that port? As a user-space tool
>> >> >> author, how do I help users to do something like "eec set clock
>> >> >> eec0 track swp1"?
>> >> >
>> >> > That's why driver needs to be smart there and return indexes
>> >> > properly.
>> >>
>> >> What do you mean, properly? Up there you have RTM_GETRCLKRANGE
>> that
>> >> just gives me a min and a max. Is there a policy about how to
>> >> correlate numbers in that range to... ifindices, netdevice names,
>> >> devlink port numbers, I don't know, something?
>> >
>> > The driver needs to know the underlying HW and report those ranges
>> > correctly.
>> 
>> How do I know _as a user_ though? As a user I want to be able to say
>> something like "eec set dev swp1 track dev swp2". But the "eec" tool has
>> no way of knowing how to set that up.
>
> There's no such flexibility. It's more like timing pins in the PTP subsystem - we
> expose the API to control them, but it's up to the final user to decide how 
> to use them.

As a user, say I know the signal coming from swp1 is freqency-locked.
How can I instruct the switch ASIC to propagate that signal to the other
ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or
whatever, with flags indicating I set up tracking, and pin number...
what exactly? How do I know which pin carries clock recovered from swp1?

> If we index the PHY outputs in the same way as the DPLL subsystem will
> see them in the references part it should be sufficient to make sense
> out of them.

What do you mean by indexing PHY outputs? Where are those indexed?

>> >> How do several drivers coordinate this numbering among themselves?
>> >> Is there a core kernel authority that manages pin number
>> >> de/allocations?
>> >
>> > I believe the goal is to create something similar to the ptp
>> > subsystem. The driver will need to configure the relationship
>> > during initialization and the OS will manage the indexes.
>> 
>> Can you point at the index management code, please?
>
> Look for the ptp_clock_register function in the kernel - it owns the
> registration of the ptp clock to the subsystem.

But I'm talking about the SyncE code.

>> >> >> Additionally, how would things like external GPSs or 1pps be
>> >> >> modeled? I guess the driver would know about such interface, and
>> >> >> would expose it as a "pin". When the GPS signal locks, the driver
>> >> >> starts reporting the pin in the RCLK set. Then it is possible to
>> >> >> set up tracking of that pin.
>> >> >
>> >> > That won't be enabled before we get the DPLL subsystem ready.
>> >>
>> >> It might prove challenging to retrofit an existing netdev-centric
>> >> interface into a more generic model. It would be better to model this
>> >> properly from day one, and OK, if we can carve out a subset of that
>> >> model to implement now, and leave the rest for later, fine. But the
>> >> current model does not strike me as having a natural migration path to
>> >> something more generic. E.g. reporting the EEC state through the
>> >> interfaces attached to that EEC... like, that will have to stay, even at
>> >> a time when it is superseded by a better interface.
>> >
>> > The recovered clock API will not change - only EEC_STATE is in
>> > question. We can either redirect the call to the DPLL subsystem, or
>> > just add the DPLL IDX Into that call and return it.
>> 
>> It would be better to have a first-class DPLL object, however vestigial,
>> in the initial submission.
>
> As stated above - DPLL subsystem won't render EEC state useless.

Of course not, the state is still important. But it will render the API
useless, and worse, an extra baggage everyone needs to know about and
support.

>> > More advanced functionality will be grown organically, as I also have
>> > a limited view of SyncE and am not expert on switches.
>> 
>> We are growing it organically _right now_. I am strongly advocating an
>> organic growth in the direction of a first-class DPLL object.
>
> If it helps - I can separate the PHY RCLK control patches and leave EEC state
> under review

Not sure what you mean by that.
Machnikowski, Maciej Nov. 10, 2021, 3:50 p.m. UTC | #14
> -----Original Message-----
> From: Petr Machata <petrm@nvidia.com>
> Sent: Wednesday, November 10, 2021 4:15 PM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> 
> >> >> >> First, what if more than one out_idx is set? What are drivers / HW
> >> >> >> meant to do with this? What is the expected behavior?
> >> >> >
> >> >> > Expected behavior is deployment specific. You can use different phy
> >> >> > recovered clock outputs to implement active/passive mode of clock
> >> >> > failover.
> >> >>
> >> >> How? Which one is primary and which one is backup? I just have two
> >> >> enabled pins...
> >> >
> >> > With this API you only have ports and pins and set up the redirection.
> >>
> >> Wait, so how do I do failover? Which of the set pins in primary and
> >> which is backup? Should the backup be sticky, i.e. do primary and backup
> >> switch roles after primary goes into holdover? It looks like there are a
> >> number of policy decisions that would be best served by a userspace
> >> tool.
> >
> > The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API
> > only configures the redirections (aka. Which clocks will be available to the
> > DPLL as references). In some DPLLs the fallback is automatic as long as
> > secondary clock is available when the primary goes away. Userspace tool
> > can preconfigure that before the failure occurs.
> 
> OK, I see. It looks like this priority list implies which pins need to
> be enabled. That makes the netdev interface redundant.

Netdev owns the PHY, so it needs to enable/disable clock from a given
port/lane - other than that it's EECs task. Technically - those subsystems
are separate.

> >> > The EEC part is out of picture and will be part of DPLL subsystem.
> >>
> >> So about that. I don't think it's contentious to claim that you need to
> >> communicate EEC state somehow. This proposal does that through a
> netdev
> >> object. After the DPLL subsystem comes along, that will necessarily
> >> provide the same information, and the netdev interface will become
> >> redundant, but we will need to keep it around.
> >>
> >> That is a strong indication that a first-class DPLL object should be
> >> part of the initial submission.
> >
> > That's why only a bare minimum is proposed in this patch - reading the
> state
> > and which signal is used as a reference.
> 
> The proposal includes APIs that we know _right now_ will be historical
> baggage by the time the DPLL object is added. That does not constitute
> bare minimum.
> 
> >> >> >> Second, as a user-space client, how do I know that if ports 1 and
> >> >> >> 2 both report pin range [A; B], that they both actually share the
> >> >> >> same underlying EEC? Is there some sort of coordination among the
> >> >> >> drivers, such that each pin in the system has a unique ID?
> >> >> >
> >> >> > For now we don't, as we don't have EEC subsystem. But that can be
> >> >> > solved by a config file temporarily.
> >> >>
> >> >> I think it would be better to model this properly from day one.
> >> >
> >> > I want to propose the simplest API that will work for the simplest
> >> > device, follow that with the userspace tool that will help everyone
> >> > understand what we need in the DPLL subsystem, otherwise it'll be hard
> >> > to explain the requirements. The only change will be the addition of
> >> > the DPLL index.
> >>
> >> That would be fine if there were a migration path to the more complete
> >> API. But as DPLL object is introduced, even the APIs that are superseded
> >> by the DPLL APIs will need to stay in as a baggage.
> >
> > The migration paths are:
> > A) when the DPLL API is there check if the DPLL object is linked to the given
> netdev
> >      in the rtnl_eec_state_get - if it is - get the state from the DPLL object
> there
> > or
> > B) return the DPLL index linked to the given netdev and fail the
> rtnl_eec_state_get
> >      so that the userspace tool will need to switch to the new API
> 
> Well, we call B) an API breakage, and it won't fly. That API is there to
> stay, and operate like it operates now.
> 
> That leaves us with A), where the API becomes a redundant wart that we
> can never get rid of.
> 
> > Also the rtnl_eec_state_get won't get obsolete in all cases once we get the
> DPLL
> > subsystem, as there are solutions where SyncE DPLL is embedded in the
> PHY
> > in which case the rtnl_eec_state_get will return all needed information
> without
> > the need to create a separate DPLL object.
> 
> So the NIC or PHY driver will register the object. Easy peasy.
> 
> Allowing the interface to go through a netdev sometimes, and through a
> dedicated object other times, just makes everybody's life harder. It's
> two cases that need to be handled in user documentation, in scripts, in
> UAPI clients, when reviewing kernel code.
> 
> This is a "hysterical raisins" sort of baggage, except we see up front
> that's where it goes.
> 
> > The DPLL object makes sense for advanced SyncE DPLLs that provide
> > additional functionality, such as external reference/output pins.
> 
> That does not need to be the case.
> 
> >> >> >> Further, how do I actually know the mapping from ports to pins?
> >> >> >> E.g. as a user, I might know my master is behind swp1. How do I
> >> >> >> know what pins correspond to that port? As a user-space tool
> >> >> >> author, how do I help users to do something like "eec set clock
> >> >> >> eec0 track swp1"?
> >> >> >
> >> >> > That's why driver needs to be smart there and return indexes
> >> >> > properly.
> >> >>
> >> >> What do you mean, properly? Up there you have
> RTM_GETRCLKRANGE
> >> that
> >> >> just gives me a min and a max. Is there a policy about how to
> >> >> correlate numbers in that range to... ifindices, netdevice names,
> >> >> devlink port numbers, I don't know, something?
> >> >
> >> > The driver needs to know the underlying HW and report those ranges
> >> > correctly.
> >>
> >> How do I know _as a user_ though? As a user I want to be able to say
> >> something like "eec set dev swp1 track dev swp2". But the "eec" tool has
> >> no way of knowing how to set that up.
> >
> > There's no such flexibility. It's more like timing pins in the PTP subsystem -
> we
> > expose the API to control them, but it's up to the final user to decide how
> > to use them.
> 
> As a user, say I know the signal coming from swp1 is freqency-locked.
> How can I instruct the switch ASIC to propagate that signal to the other
> ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or
> whatever, with flags indicating I set up tracking, and pin number...
> what exactly? How do I know which pin carries clock recovered from swp1?

You send the RTM_SETRCLKSTATE to the port that has the best reference
clock available.
If you want to know which pin carries the clock you simply send the
RTM_GETRCLKSTATE and it'll return the list of possible outputs with the flags
saying which of them are enabled (see the newer revision)

> > If we index the PHY outputs in the same way as the DPLL subsystem will
> > see them in the references part it should be sufficient to make sense
> > out of them.
> 
> What do you mean by indexing PHY outputs? Where are those indexed?

That's what ndo_get_rclk_range does. It returns allowed range of pins for a given
netdev.
 
> >> >> How do several drivers coordinate this numbering among themselves?
> >> >> Is there a core kernel authority that manages pin number
> >> >> de/allocations?
> >> >
> >> > I believe the goal is to create something similar to the ptp
> >> > subsystem. The driver will need to configure the relationship
> >> > during initialization and the OS will manage the indexes.
> >>
> >> Can you point at the index management code, please?
> >
> > Look for the ptp_clock_register function in the kernel - it owns the
> > registration of the ptp clock to the subsystem.
> 
> But I'm talking about the SyncE code.

PHY pins are indexed as the driver wishes, as they are board specific. 
You can index PHY pins 1,2,3 or 3,4,5 - whichever makes sense for 
a given application, as they are local for a netdev.
I would suggest returning numbers that are tightly coupled to the EEC
when that's known to make guessing game easier, but that's not mandatory.

> >> >> >> Additionally, how would things like external GPSs or 1pps be
> >> >> >> modeled? I guess the driver would know about such interface, and
> >> >> >> would expose it as a "pin". When the GPS signal locks, the driver
> >> >> >> starts reporting the pin in the RCLK set. Then it is possible to
> >> >> >> set up tracking of that pin.
> >> >> >
> >> >> > That won't be enabled before we get the DPLL subsystem ready.
> >> >>
> >> >> It might prove challenging to retrofit an existing netdev-centric
> >> >> interface into a more generic model. It would be better to model this
> >> >> properly from day one, and OK, if we can carve out a subset of that
> >> >> model to implement now, and leave the rest for later, fine. But the
> >> >> current model does not strike me as having a natural migration path to
> >> >> something more generic. E.g. reporting the EEC state through the
> >> >> interfaces attached to that EEC... like, that will have to stay, even at
> >> >> a time when it is superseded by a better interface.
> >> >
> >> > The recovered clock API will not change - only EEC_STATE is in
> >> > question. We can either redirect the call to the DPLL subsystem, or
> >> > just add the DPLL IDX Into that call and return it.
> >>
> >> It would be better to have a first-class DPLL object, however vestigial,
> >> in the initial submission.
> >
> > As stated above - DPLL subsystem won't render EEC state useless.
> 
> Of course not, the state is still important. But it will render the API
> useless, and worse, an extra baggage everyone needs to know about and
> support.
> 
> >> > More advanced functionality will be grown organically, as I also have
> >> > a limited view of SyncE and am not expert on switches.
> >>
> >> We are growing it organically _right now_. I am strongly advocating an
> >> organic growth in the direction of a first-class DPLL object.
> >
> > If it helps - I can separate the PHY RCLK control patches and leave EEC state
> > under review
> 
> Not sure what you mean by that.

Commit RTM_GETRCLKSTATE and RTM_SETRCLKSTATE now, wait with 
RTM_GETEECSTATE  till we clarify further direction of the DPLL subsystem
Petr Machata Nov. 10, 2021, 9:05 p.m. UTC | #15
Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:

>> >> Wait, so how do I do failover? Which of the set pins in primary and
>> >> which is backup? Should the backup be sticky, i.e. do primary and backup
>> >> switch roles after primary goes into holdover? It looks like there are a
>> >> number of policy decisions that would be best served by a userspace
>> >> tool.
>> >
>> > The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API
>> > only configures the redirections (aka. Which clocks will be available to the
>> > DPLL as references). In some DPLLs the fallback is automatic as long as
>> > secondary clock is available when the primary goes away. Userspace tool
>> > can preconfigure that before the failure occurs.
>> 
>> OK, I see. It looks like this priority list implies which pins need to
>> be enabled. That makes the netdev interface redundant.
>
> Netdev owns the PHY, so it needs to enable/disable clock from a given
> port/lane - other than that it's EECs task. Technically - those subsystems
> are separate.

So why is the UAPI conflating the two?

>> As a user, say I know the signal coming from swp1 is freqency-locked.
>> How can I instruct the switch ASIC to propagate that signal to the other
>> ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or
>> whatever, with flags indicating I set up tracking, and pin number...
>> what exactly? How do I know which pin carries clock recovered from swp1?
>
> You send the RTM_SETRCLKSTATE to the port that has the best reference
> clock available.
> If you want to know which pin carries the clock you simply send the
> RTM_GETRCLKSTATE and it'll return the list of possible outputs with the flags
> saying which of them are enabled (see the newer revision)

As a user I would really prefer to have a pin reference reported
somewhere at the netdev / phy / somewhere. Similarly to how a netdev can
reference a PHC. But whatever, I won't split hairs over this, this is
acutally one aspect that is easy to add later.

>> >> > More advanced functionality will be grown organically, as I also have
>> >> > a limited view of SyncE and am not expert on switches.
>> >>
>> >> We are growing it organically _right now_. I am strongly advocating an
>> >> organic growth in the direction of a first-class DPLL object.
>> >
>> > If it helps - I can separate the PHY RCLK control patches and leave EEC state
>> > under review
>> 
>> Not sure what you mean by that.
>
> Commit RTM_GETRCLKSTATE and RTM_SETRCLKSTATE now, wait with 
> RTM_GETEECSTATE  till we clarify further direction of the DPLL subsystem

It's not just state though. There is another oddity that I am not sure
is intentional. The proposed UAPI allows me to set up fairly general
frequency bridging. In a device with a bunch of ports, it would allow me
to set up, say, swp1 to track RCLK from swp2, then swp3 from swp4, etc.
But what will be the EEC state in that case?
Machnikowski, Maciej Nov. 15, 2021, 10:12 a.m. UTC | #16
> -----Original Message-----
> From: Petr Machata <petrm@nvidia.com>
> Sent: Wednesday, November 10, 2021 10:06 PM
> To: Machnikowski, Maciej <maciej.machnikowski@intel.com>
> Cc: Petr Machata <petrm@nvidia.com>; netdev@vger.kernel.org; intel-
> wired-lan@lists.osuosl.org; richardcochran@gmail.com; abyagowi@fb.com;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; davem@davemloft.net;
> kuba@kernel.org; linux-kselftest@vger.kernel.org; idosch@idosch.org;
> mkubecek@suse.cz; saeed@kernel.org; michael.chan@broadcom.com
> Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE
> interfaces
> 
> 
> Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:
> 
> >> >> Wait, so how do I do failover? Which of the set pins in primary and
> >> >> which is backup? Should the backup be sticky, i.e. do primary and
> backup
> >> >> switch roles after primary goes into holdover? It looks like there are a
> >> >> number of policy decisions that would be best served by a userspace
> >> >> tool.
> >> >
> >> > The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API
> >> > only configures the redirections (aka. Which clocks will be available to
> the
> >> > DPLL as references). In some DPLLs the fallback is automatic as long as
> >> > secondary clock is available when the primary goes away. Userspace
> tool
> >> > can preconfigure that before the failure occurs.
> >>
> >> OK, I see. It looks like this priority list implies which pins need to
> >> be enabled. That makes the netdev interface redundant.
> >
> > Netdev owns the PHY, so it needs to enable/disable clock from a given
> > port/lane - other than that it's EECs task. Technically - those subsystems
> > are separate.
> 
> So why is the UAPI conflating the two?

Because EEC can be a separate external device, but also can be integrated
inside the netdev. In the second case it makes more sense to just return
the state from a netdev 
 
> >> As a user, say I know the signal coming from swp1 is freqency-locked.
> >> How can I instruct the switch ASIC to propagate that signal to the other
> >> ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or
> >> whatever, with flags indicating I set up tracking, and pin number...
> >> what exactly? How do I know which pin carries clock recovered from
> swp1?
> >
> > You send the RTM_SETRCLKSTATE to the port that has the best reference
> > clock available.
> > If you want to know which pin carries the clock you simply send the
> > RTM_GETRCLKSTATE and it'll return the list of possible outputs with the
> flags
> > saying which of them are enabled (see the newer revision)
> 
> As a user I would really prefer to have a pin reference reported
> somewhere at the netdev / phy / somewhere. Similarly to how a netdev can
> reference a PHC. But whatever, I won't split hairs over this, this is
> acutally one aspect that is easy to add later.

I believe the best way would be to use sysfs entry for that (and provide a basic
control using it as well). But first we need the UAPI defined.
 
> >> >> > More advanced functionality will be grown organically, as I also have
> >> >> > a limited view of SyncE and am not expert on switches.
> >> >>
> >> >> We are growing it organically _right now_. I am strongly advocating an
> >> >> organic growth in the direction of a first-class DPLL object.
> >> >
> >> > If it helps - I can separate the PHY RCLK control patches and leave EEC
> state
> >> > under review
> >>
> >> Not sure what you mean by that.
> >
> > Commit RTM_GETRCLKSTATE and RTM_SETRCLKSTATE now, wait with
> > RTM_GETEECSTATE  till we clarify further direction of the DPLL subsystem
> 
> It's not just state though. There is another oddity that I am not sure
> is intentional. The proposed UAPI allows me to set up fairly general
> frequency bridging. In a device with a bunch of ports, it would allow me
> to set up, say, swp1 to track RCLK from swp2, then swp3 from swp4, etc.
> But what will be the EEC state in that case?

Yes. GET/SET UAPI is exactly there to configure that bridging. All it does
is to set up the recovered frequency on physical frequency output pins
of the phy/integrated device. In case DPLL is embedded the pins may be 
internal to the device and not exposed externally. It doesn't allow creation
of the tracking maps, as that's usually not a case in SyncE appliances.
In typical ones you recover the clock from a single port and then use that 
clock on all other ports.
The EEC state will depend on the signal quality and the configuration.
When the clock is enabled and is valid the EEC will tune its internal frequency
and report locked/Locked HO Acquired state.

Can remove word STATE from name and change to RTM_{GET,SET}RCLK 
if state is confusing there.
Jakub Kicinski Nov. 15, 2021, 9:42 p.m. UTC | #17
On Mon, 15 Nov 2021 10:12:25 +0000 Machnikowski, Maciej wrote:
> > > Netdev owns the PHY, so it needs to enable/disable clock from a given
> > > port/lane - other than that it's EECs task. Technically - those subsystems
> > > are separate.  
> > 
> > So why is the UAPI conflating the two?  
> 
> Because EEC can be a separate external device, but also can be integrated
> inside the netdev. In the second case it makes more sense to just return
> the state from a netdev 

I mentioned that we are in a need of such API to Vadim who, among other
things, works on the OCP Timecard. He indicated interest in developing
the separate netlink interface for "DPLLs" (the timecard is just an
atomic clock + GPS, no netdev to hang from). Let's wait for Vadim's work
to materialize and build on top of that.
diff mbox series

Patch

diff --git a/Documentation/networking/synce.rst b/Documentation/networking/synce.rst
new file mode 100644
index 000000000000..4ca41fb9a481
--- /dev/null
+++ b/Documentation/networking/synce.rst
@@ -0,0 +1,117 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Synchronous Ethernet
+====================
+
+Synchronous Ethernet networks use a physical layer clock to syntonize
+the frequency across different network elements.
+
+Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet
+Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered clocks
+and a dedicated TX clock input that is used as to transmit data to other nodes.
+
+The SyncE capable PHY is able to recover the incomning frequency of the data
+stream on RX lanes and redirect it (sometimes dividing it) to recovered
+clock outputs. In SyncE PHY the TX frequency is directly dependent on the
+input frequency - either on the PHY CLK input, or on a dedicated
+TX clock input.
+
+      ┌───────────┬──────────┐
+      │ RX        │ TX       │
+  1   │ lanes     │ lanes    │ 1
+  ───►├──────┐    │          ├─────►
+  2   │      │    │          │ 2
+  ───►├──┐   │    │          ├─────►
+  3   │  │   │    │          │ 3
+  ───►├─▼▼   ▼    │          ├─────►
+      │ ──────    │          │
+      │ \____/    │          │
+      └──┼──┼─────┴──────────┘
+        1│ 2│        ▲
+ RCLK out│  │        │ TX CLK in
+         ▼  ▼        │
+       ┌─────────────┴───┐
+       │                 │
+       │       EEC       │
+       │                 │
+       └─────────────────┘
+
+The EEC can synchronize its frequency to one of the synchronization inputs
+either clocks recovered on traffic interfaces or (in advanced deployments)
+external frequency sources.
+
+Some EEC implementations can select synchronization source through
+priority tables and synchronization status messaging and provide necessary
+filtering and holdover capabilities.
+
+The following interface can be applicable to diffferent packet network types
+following ITU-T G.8261/G.8262 recommendations.
+
+Interface
+=========
+
+The following RTNL messages are used to read/configure SyncE recovered
+clocks.
+
+RTM_GETRCLKRANGE
+-----------------
+Reads the allowed pin index range for the recovered clock outputs.
+This can be aligned to PHY outputs or to EEC inputs, whichever is
+better for a given application.
+Will call the ndo_get_rclk_range function to read the allowed range
+of output pin indexes.
+Will call ndo_get_rclk_range to determine the allowed recovered clock
+range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
+IFLA_RCLK_RANGE_MAX_PIN attributes
+
+RTM_GETRCLKSTATE
+-----------------
+Read the state of recovered pins that output recovered clock from
+a given port. The message will contain the number of assigned clocks
+(IFLA_RCLK_STATE_COUNT) and an N pin indexes in IFLA_RCLK_STATE_OUT_IDX
+To support multiple recovered clock outputs from the same port, this message
+will return the IFLA_RCLK_STATE_COUNT attribute containing the number of
+active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX attributes
+listing the active output indexes.
+This message will call the ndo_get_rclk_range to determine the allowed
+recovered clock indexes and then will loop through them, calling
+the ndo_get_rclk_state for each of them.
+
+RTM_SETRCLKSTATE
+-----------------
+Sets the redirection of the recovered clock for a given pin. This message
+expects one attribute:
+struct if_set_rclk_msg {
+	__u32 ifindex; /* interface index */
+	__u32 out_idx; /* output index (from a valid range)
+	__u32 flags; /* configuration flags */
+};
+
+Supported flags are:
+SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
+		     if clear - the output will be disabled.
+
+RTM_GETEECSTATE
+----------------
+Reads the state of the EEC or equivalent physical clock synchronizer.
+This message returns the following attributes:
+IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
+		 The states returned in this attribute are aligned to the
+		 ITU-T G.781 and are:
+		  IF_EEC_STATE_INVALID - state is not valid
+		  IF_EEC_STATE_FREERUN - clock is free-running
+		  IF_EEC_STATE_LOCKED - clock is locked to the reference,
+		                        but the holdover memory is not valid
+		  IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the reference
+		                               and holdover memory is valid
+		  IF_EEC_STATE_HOLDOVER - clock is in holdover mode
+State is read from the netdev calling the:
+int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state,
+			 u32 *src_idx, struct netlink_ext_ack *extack);
+
+IFLA_EEC_SRC_IDX - optional attribute returning the index of the reference that
+		   is used for the current IFLA_EEC_STATE, i.e., the index of
+		   the pin that the EEC is locked to.
+
+Will be returned only if the ndo_get_eec_src is implemented.
\ No newline at end of file