diff mbox series

[RFC,net-next,7/9] net: dsa: microchip: ksz9477: add hardware time stamping support

Message ID 20201019172435.4416-8-ceggers@arri.de (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series net: dsa: microchip: PTP support for KSZ956x | expand

Commit Message

Christian Eggers Oct. 19, 2020, 5:24 p.m. UTC
Add routines required for TX hardware time stamping.

The KSZ9563 only supports one step time stamping
(HWTSTAMP_TX_ONESTEP_P2P), which requires linuxptp-2.0 or later. PTP
mode is permanently enabled (changes tail tag; depends on
CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP).TX time stamps are reported via an
interrupt / device registers whilst RX time stamps are reported via an
additional tail tag.

One step TX time stamping of PDelay_Resp requires the RX time stamp from
the associated PDelay_Req message. linuxptp assumes that the RX time
stamp has already been subtracted from the PDelay_Req correction field
(as done by the TI PHYTER). linuxptp will echo back the value of the
correction field in the PDelay_Resp message.

In order to be compatible to this already established interface, the
KSZ9563 code emulates this behavior. When processing the PDelay_Resp
message, the time stamp is moved back from the correction field to the
tail tag, as the hardware doesn't support negative values on this field.
Of course, the UDP checksums (if any) have to be corrected after this
(for both directions).

The PTP hardware performs internal detection of PTP frames (likely
similar as ptp_classify_raw() and ptp_parse_header()). As these filters
cannot be disabled, the current delay mode (E2E/P2P) and the clock mode
(master/slave) must be configured via sysfs attributes. Time stamping
will only be performed on PTP packets matching the current mode
settings.

Everything has been tested on a Microchip KSZ9563 switch.

Signed-off-by: Christian Eggers <ceggers@arri.de>
---
 drivers/net/dsa/microchip/ksz9477_main.c |   9 +-
 drivers/net/dsa/microchip/ksz9477_ptp.c  | 629 +++++++++++++++++++++++
 drivers/net/dsa/microchip/ksz9477_ptp.h  |  27 +
 include/linux/dsa/ksz_common.h           |  30 ++
 net/dsa/tag_ksz.c                        | 206 +++++++-
 5 files changed, 893 insertions(+), 8 deletions(-)

Comments

Vladimir Oltean Oct. 20, 2020, 12:10 a.m. UTC | #1
On Mon, Oct 19, 2020 at 07:24:33PM +0200, Christian Eggers wrote:
> Add routines required for TX hardware time stamping.
> 
> The KSZ9563 only supports one step time stamping
> (HWTSTAMP_TX_ONESTEP_P2P), which requires linuxptp-2.0 or later. PTP
> mode is permanently enabled (changes tail tag; depends on
> CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP).TX time stamps are reported via an
> interrupt / device registers whilst RX time stamps are reported via an
> additional tail tag.
> 
> One step TX time stamping of PDelay_Resp requires the RX time stamp from
> the associated PDelay_Req message. linuxptp assumes that the RX time
> stamp has already been subtracted from the PDelay_Req correction field
> (as done by the TI PHYTER). linuxptp will echo back the value of the
> correction field in the PDelay_Resp message.
> 
> In order to be compatible to this already established interface, the
> KSZ9563 code emulates this behavior. When processing the PDelay_Resp
> message, the time stamp is moved back from the correction field to the
> tail tag, as the hardware doesn't support negative values on this field.
> Of course, the UDP checksums (if any) have to be corrected after this
> (for both directions).
> 
> The PTP hardware performs internal detection of PTP frames (likely
> similar as ptp_classify_raw() and ptp_parse_header()). As these filters
> cannot be disabled, the current delay mode (E2E/P2P) and the clock mode
> (master/slave) must be configured via sysfs attributes. Time stamping
> will only be performed on PTP packets matching the current mode
> settings.
> 
> Everything has been tested on a Microchip KSZ9563 switch.

I looked a little bit at the KSZ9563 datasheet and I'm more confused
than I was before opening it.

-----------------------------[cut here]-----------------------------
The device supports V2 (2008) of the IEEE 1588 PTP specification and can
be programmed as either an end-to-end (E2E) or peer-to-peer (P2P)
transparent clock (TC) between ports. In addition, the host port can be
programmed as either a slave or master ordinary clock (OC) port.
Ingress timestamp capture, egress timestamp recording, correction field
update with residence time and link delay, delay turn-around time
insertion, egress timestamp insertion, and checksum update are
supported.
-----------------------------[cut here]-----------------------------

So it's a 1-step transparent clock, fair enough. That works autonomously
without any sort of involvement from the operating system, you know
that, right? This is stateless functionality.

BUT, if that is the case, what do you need PTP support in the kernel
for? What profiles are you using with linuxptp? What benefit does it
bring you if you report timestamps to the operating system, for
terminated 1588 traffic? Why would you even terminate 1588 traffic on
the host CPU? I fail to understand many of the use cases that this
switch is tailored for.

Also, I know that Microchip support does a pretty bad job at giving
useful answers, and the datasheet isn't quite clear either (looks like
there's info that has been copied from other switches, like for 2-step
timestamping, then removed, and too much was removed because now nothing
is clear) so you'll have to give your best shot at explaining some
things.


Global PTP Message Config 1 Register
------------------------------------

Bit 2: Selection of P2P or E2E
1 = Peer-to-peer (P2P) transparent clock mode
0 = End-to-end (E2E) transparent clock mode

What does this bit do exactly?
Does it change the switch's behavior as an autonomous 1-step transparent
clock? Or does it have anything to do with how/which timestamps are
delivered to the CPU? The point is, why do you care to configure this?
Sysfs is not going to fly without a solid explanation, which you did not
provide here.

My understanding of E2E vs P2P TC is that an E2E TC will correct the
timestamps of Pdelay messages, while a P2P TC won't. The P2P TC must
speak proper PDelay and not forward those packets sheepishly. Which
starts to answer my question, I believe... So my comment above, that the
1-step TC functionality doesn't require any involvement from the CPU, is
only correct for E2E TC, am I right? For P2P TC, you would need the host
CPU to speak peer delay. But you wouldn't need it for anything else (the
SYNC messages would have no reason to go to the CPU, would they?). So,
again, what profile are you using with linuxptp for this one?

If my understanding is right, maybe you want to just leave the switch
operate in E2E TC mode by default, and put it into P2P TC as soon as
your .port_hwtstamp_set() method is called?


Ok, on to my next question....

Bit 1: Selection of Master or Slave
1 = Host port is PTP master ordinary clock
0 = Host port is PTP slave ordinary clock

What does this _actually_ do? Here I really have no idea. I can only
imagine that this has again to do with the 1-step TC operation, and that
it's treating the host port as a switched endpoint, and this has to do
with the port states of the P2P TC. I'm so confused by this one that I
don't even know what to ask...
Ok, let's put it differently. You bothered to add a sysfs for it, so you
must be using it for something. What are you using it for?
Christian Eggers Oct. 20, 2020, 8:39 a.m. UTC | #2
On Tuesday, 20 October 2020, 02:10:40 CEST, Vladimir Oltean wrote:
> I looked a little bit at the KSZ9563 datasheet and I'm more confused
> than I was before opening it.
> 
> -----------------------------[cut here]-----------------------------
> The device supports V2 (2008) of the IEEE 1588 PTP specification and can
> be programmed as either an end-to-end (E2E) or peer-to-peer (P2P)
> transparent clock (TC) between ports. In addition, the host port can be
> programmed as either a slave or master ordinary clock (OC) port.
> Ingress timestamp capture, egress timestamp recording, correction field
> update with residence time and link delay, delay turn-around time
> insertion, egress timestamp insertion, and checksum update are
> supported.
> -----------------------------[cut here]-----------------------------
> 
> So it's a 1-step transparent clock, fair enough. That works autonomously
> without any sort of involvement from the operating system, you know
> that, right? This is stateless functionality.
The device performs "one-step time stamping". This means that the correction 
field in Sync and PDelay_Resp messages is updated automatically. Not more. For 
operation as a switch, this would be probably enough. But I need also to set 
the switch' local PTP clock in order to generate a synchronized output signal 
for local use.

> BUT, if that is the case, what do you need PTP support in the kernel
> for?
The device cannot autonomously set it's PTP clock. This requires calculating 
the link delay between the own port and the master/peer. It looks like the 
KSZ9563 cannot perform this calculation itself.

> What profiles are you using with linuxptp?
My config is at the bottom.

> What benefit does it
> bring you if you report timestamps to the operating system, for
> terminated 1588 traffic? Why would you even terminate 1588 traffic on
> the host CPU? I fail to understand many of the use cases that this
> switch is tailored for.
In my opinion, the switch doesn't terminate 1588 traffic. It only generates 
hardware time stamps and performs update of the correction field on selected 
messages.

> Also, I know that Microchip support does a pretty bad job at giving
> useful answers, and the datasheet isn't quite clear either (looks like
> there's info that has been copied from other switches, like for 2-step
> timestamping, then removed, and too much was removed because now nothing
> is clear) so you'll have to give your best shot at explaining some
> things.
It looks like it was planned to support also 2-step time stamping. You can see 
an errata in revision B of the errata sheet [1]. Revision C dropped this 
again.

> Global PTP Message Config 1 Register
> ------------------------------------
> 
> Bit 2: Selection of P2P or E2E
> 1 = Peer-to-peer (P2P) transparent clock mode
> 0 = End-to-end (E2E) transparent clock mode
> 
> What does this bit do exactly?
> Does it change the switch's behavior as an autonomous 1-step transparent
> clock? Or does it have anything to do with how/which timestamps are
> delivered to the CPU? The point is, why do you care to configure this?
> Sysfs is not going to fly without a solid explanation, which you did not
> provide here.
In short: Clock synchronization (using linuxptp) doesn't work if this is not 
configured in sync with ptp4l.conf. Either no time stamping is performed or 
PTP messages are filtered out. If required, I can investigate this further. 

> My understanding of E2E vs P2P TC is that an E2E TC will correct the
> timestamps of Pdelay messages, while a P2P TC won't
Yes, I think so. But I am no expert for PTP.

> The P2P TC must speak proper PDelay and not forward those packets 
> sheepishly. Which starts to answer my question, I believe... So my comment 
> above, that the 1-step TC functionality doesn't require any involvement from 
> the CPU, is only correct for E2E TC, am I right? For P2P TC, you would need 
> the host CPU to speak peer delay.
The host CPU is not required for updating forwarded (switched) PTP messages. 
But it is required RX/TX time stamping of own PTP messages. All local PTP 
operations can be done on the switch, there's no work left for the MAC 
connected to the host port.

> But you wouldn't need it for anything else (the
> SYNC messages would have no reason to go to the CPU, would they?). 
The (time stamped) SYNC messages are required in order to calculate the offset 
from the master clock.

> So, again, what profile are you using with linuxptp for this one?
only a few lines left...

> If my understanding is right, maybe you want to just leave the switch
> operate in E2E TC mode by default, and put it into P2P TC as soon as
> your .port_hwtstamp_set() method is called?
Currently the TC mode has to be set manually. So the automatic mode of ptp4l 
cannot be used. Probably this could be automated depending on received PTP 
messages. But probably we cannot see these messages because these are filtered 
out.

> Ok, on to my next question....
> 
> Bit 1: Selection of Master or Slave
> 1 = Host port is PTP master ordinary clock
> 0 = Host port is PTP slave ordinary clock
> 
> What does this _actually_ do? Here I really have no idea. I can only
> imagine that this has again to do with the 1-step TC operation, and that
> it's treating the host port as a switched endpoint, and this has to do
> with the port states of the P2P TC. I'm so confused by this one that I
> don't even know what to ask...
As for the TC mode questions, also this one could be best answered by 
Microchip. If not set correctly, clock synchronization will not work. I need 
to investigate what happens here concretely. Again I suppose that time 
stamping of the associated messages will not be performed or some messages may 
be filtered out.

> Ok, let's put it differently. You bothered to add a sysfs for it, so you
> must be using it for something. What are you using it for?
If not set correctly, nothing will work. This has the major drawback, that 
ptp4l cannot determine the clock mode nor the delay measurement mode 
automatically. I will try find some answers for this.

Best regards
Christian

[global]
twoStepFlag 0
time_stamping p2p1step
#slaveOnly 1
delay_mechanism P2P
#delay_mechanism E2E
delay_filter_length 100
tx_timestamp_timeout 100
#network_transport UDPv6
#network_transport L2

[1] 
http://ww1.microchip.com/downloads/en/DeviceDoc/KSZ9563R-Errata-80000786B.pdf
Vladimir Oltean Oct. 21, 2020, 11:39 p.m. UTC | #3
On Mon, Oct 19, 2020 at 07:24:33PM +0200, Christian Eggers wrote:
> Add routines required for TX hardware time stamping.
> 
> The KSZ9563 only supports one step time stamping
> (HWTSTAMP_TX_ONESTEP_P2P), which requires linuxptp-2.0 or later. PTP
> mode is permanently enabled (changes tail tag; depends on
> CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP).TX time stamps are reported via an
> interrupt / device registers whilst RX time stamps are reported via an
> additional tail tag.
> 
> One step TX time stamping of PDelay_Resp requires the RX time stamp from
> the associated PDelay_Req message. linuxptp assumes that the RX time
> stamp has already been subtracted from the PDelay_Req correction field
> (as done by the TI PHYTER). linuxptp will echo back the value of the
> correction field in the PDelay_Resp message.
> 
> In order to be compatible to this already established interface, the
> KSZ9563 code emulates this behavior. When processing the PDelay_Resp
> message, the time stamp is moved back from the correction field to the
> tail tag, as the hardware doesn't support negative values on this field.
> Of course, the UDP checksums (if any) have to be corrected after this
> (for both directions).
> 
> The PTP hardware performs internal detection of PTP frames (likely
> similar as ptp_classify_raw() and ptp_parse_header()). As these filters
> cannot be disabled, the current delay mode (E2E/P2P) and the clock mode
> (master/slave) must be configured via sysfs attributes. Time stamping
> will only be performed on PTP packets matching the current mode
> settings.
> 
> Everything has been tested on a Microchip KSZ9563 switch.
> 
> Signed-off-by: Christian Eggers <ceggers@arri.de>
> ---

Hi Richard!

Looks like we've been discussing without you here. I had an off-list discussion
with Christian and we just realized that. The topic of this email is that we've
got new P2P one-step hardware coming in, and it's not quite API-compatible with
what's there already (you might not actually be surprised by that).

Since you weren't copied to Christian's patches, here's a patchwork link for you:
https://patchwork.ozlabs.org/project/netdev/cover/20201019172435.4416-1-ceggers@arri.de/

Short introduction by quoting from the IEEE 1588 standard, just to get everybody
on the same page.

----------------------------------[cut here]-----------------------------------

11.4.3 Peer delay mechanism operational specifications
------------------------------------------------------

The actual value of the meanPathDelay shall be measured and computed as follows
for each instance of a peer delay request-response measurement:

(a) The delay requestor, Node-A:
    (1) Prepares a Pdelay_Req message. The correctionField (see 13.3.2.7) shall
        be set to 0.
    (2) If asymmetry corrections are required, shall modify the correctionField
        per 11.6.4.
    (3) Shall set the originTimestamp to 0 or an estimate no worse than +/-1s of
        the egress timestamp, t1, of the Pdelay_Req message.
    (4) Shall send the Pdelay_Req message and generate and save timestamp t1.
(b) If the delay responder, Node-B, is a one-step clock, it shall:
    (1) Generate timestamp t2 upon receipt of the Pdelay_Req message
    (2) Prepare a Pdelay_Resp message
    (3) Copy the sequenceId field from the Pdelay_Req message to the sequenceId
        field of the Pdelay_Resp message
    (4) Copy the sourcePortIdentity field from the Pdelay_Req message to the
        requestingPortIdentity field of the Pdelay_Resp message
    (5) Copy the domainNumber field from the Pdelay_Req message to the
        domainNumber field of the Pdelay_Resp message
    (6) Copy the correctionField from the Pdelay_Req message to the
        correctionField of the Pdelay_Resp message
    (7) Then:
        (i) Set to 0 the requestReceiptTimestamp field of the Pdelay_Resp message
        (ii) Issue the Pdelay_Resp message and generate timestamp t3 upon sending
        (iii) After t3 is generated but while the Pdelay_Resp message is
              leaving the responder, shall add the turnaround time t3 - t2 to
              the correctionField of the Pdelay_Resp message and make any
              needed corrections to checksums or other content-dependent fields
              of the Pdelay_Resp message
(c) If the delay responder is a two-step clock, it shall:
    [skipping because in our case, the delay responder is a one-step clock]
(d) The delay requestor, Node-A, shall:
    (1) Generate timestamp t 4 upon receipt of the Pdelay_Resp message
    (2) If asymmetry corrections are required, modify the correctionField of
        the Pdelay_Resp message per 11.6.5
    (3) If the twoStepFlag of the received Pdelay_Resp message is FALSE,
        indicating that no Pdelay_Resp_Follow_Up message will be received,
        compute the <meanPathDelay> as:
        <meanPathDelay> = [(t4 - t1) - correctionField of Pdelay_Resp] / 2

----------------------------------[cut here]-----------------------------------

Simply put:

          |                    |
       t1 |------\ Pdelay_Req  |
          |       ------\      |
          |              ----->| t2
Clock A   |                    |     Clock B
          | Pdelay_Resp /------| t3
          |       ------       |
       t4 |<-----/             |
          |                    |

                (t4 - t1) - (t3 - t2)
meanPathDelay = ---------------------
                          2

where t3 - t2 (the "turnaround" time) is contained inside the correctionField
of the Pdelay_Resp. To reiterate, t3 is the RX timestamp of the Pdelay_Req, and
t4 is the TX timestamp of the Pdelay_Resp, both taken at the delay responder
(Clock B).

That was about it for the standard.


The hardware that Christian is configuring (consider operation as Clock B)
works this way:
(a) the ingress port takes the t2 RX timestamp of the Pdelay_Req normally
(b) on transmission of the Pdelay_Resp, the kernel must provide t2 as metadata
    together with the Pdelay_Resp frame itself (it is put in the "TX timestamp"
    field of the DSA tag)
(c) the egress port takes the t3 TX timestamp and rewrites the correctionField
    of the PTP header with the (t3 - t2) value

Straightforward, one might say?
Except for the fact that this is not how the API for peer-to-peer one-step
timestamping currently works.
We were expecting that for this switch to have a streamlined implementation for
P2P 1-step, the kernel would have an API to inject a Pdelay_Resp packet into
the stack _along_ with t2. But it doesn't.

So how _does_ that work for TI PHYTER?

As far as we understand, the PHYTER appears to autonomously mangle PTP packets
in the following way:
- subtracting t2 on RX from the correctionField of the Pdelay_Req
- adding t3 on TX to the correctionField of the Pdelay_Resp

All that linuxptp has to do is connect the wires. That's why in process_pdelay_req()
from port.c, there is:

	if (p->timestamping == TS_P2P1STEP) {
		rsp->header.correction = m->header.correction;

Otherwise said, the t2 is not provided to the kernel explicitly, but simply by
copying the correctionField from the Pdelay_Req into the Pdelay_Resp. This
correctionField is the play ground of the PHYTER, so by having linuxptp copy
that field, it simply has a path to finish off what it started. We believe that
by the time the Pdelay_Req message arrives at the host, the correctionField
will be negative, and it will only become back positive after transmitting the
Pdelay_Resp. This is also something that might be problematic for other devices.

Nevertheless, I'm not here to say that it's wrong to do what we are currently
doing for PHYTER. I mean, paragraph 11.4.3.(b).(6) clearly states that the
correctionField should be transferred anyway from the Pdelay_Req to the
Pdelay_Resp. I am not entirely sure of the reasons why it stipulates this, if
it's just to allow hardware like PHYTER to operate within spec, or if it's for
the delay requestor to be aware of the correction that was applied to his frame.
Regardless of which way it is, we expect that any 1588 stack will have to do
that, so this is not implementation-defined behavior.

What I'm here to say is that this is not how Christian's hardware works, and he
is simply emulating the kernel interface used by the PHYTER. He needs to
subtract t2 on RX from the correctionField manually in the tagger code, then
have extra code on TX to copy t2 from the correctionField into the DSA tag's TX
timestamp field. This complicates the code by quite a significant margin, and
my concern is that nobody is going to follow what's going on while
maintaining it, and why it is done in this backwards way.
We would like to know your opinion on whether we couldn't, in fact, have a more
straightforward UAPI for hardware like his. For example, expand the sendmsg()
API with a new cmsg that contains t2 specifically for P2P one-step
timestamping. Or anything else. But "straightforward" should be the key.

Thanks,
-Vladimir
Richard Cochran Oct. 22, 2020, 2:42 a.m. UTC | #4
I'm just catching up with this.

Really. Truly. Please -- Include the maintainer on CC for such patches!

In case you don't know who that is, you can always consult the MAINTAINERS file.

There you will find the following entry.

    PTP HARDWARE CLOCK SUPPORT
    M:      Richard Cochran <richardcochran@gmail.com>
    L:      netdev@vger.kernel.org
    S:      Maintained
    W:      http://linuxptp.sourceforge.net/

On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> On Mon, Oct 19, 2020 at 07:24:33PM +0200, Christian Eggers wrote:
> > The PTP hardware performs internal detection of PTP frames (likely
> > similar as ptp_classify_raw() and ptp_parse_header()). As these filters
> > cannot be disabled, the current delay mode (E2E/P2P) and the clock mode
> > (master/slave) must be configured via sysfs attributes.

This is a complete no-go.  NAK.

Richard
Richard Cochran Oct. 22, 2020, 3:02 a.m. UTC | #5
On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> So how _does_ that work for TI PHYTER?
> 
> As far as we understand, the PHYTER appears to autonomously mangle PTP packets
> in the following way:
> - subtracting t2 on RX from the correctionField of the Pdelay_Req
> - adding t3 on TX to the correctionField of the Pdelay_Resp

The Phyter does not support peer-to-peer one step.

The only driver that implements it is ptp_ines.c.

And *that* driver/HW implements it correctly.

Thanks,
Richard
Christian Eggers Oct. 22, 2020, 7:30 a.m. UTC | #6
Hi Richard,

On Thursday, 22 October 2020, 04:42:01 CEST, Richard Cochran wrote:
> I'm just catching up with this.
> 
> Really. Truly. Please -- Include the maintainer on CC for such patches!
sorry for missing you on the recipients list. I blindly trusted the output of
get_maintainer.pl.

I recently sent two other patches which may also be of interest for you. They
related to handling of SO_TIMESTAMPING on 32 bit platforms with newer C 
libraries:

https://patchwork.ozlabs.org/project/netdev/patch/20201012093542.15504-1-ceggers@arri.de/
https://patchwork.ozlabs.org/project/netdev/patch/20201012093542.15504-2-ceggers@arri.de/

> On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> > On Mon, Oct 19, 2020 at 07:24:33PM +0200, Christian Eggers wrote:
> > > The PTP hardware performs internal detection of PTP frames (likely
> > > similar as ptp_classify_raw() and ptp_parse_header()). As these filters
> > > cannot be disabled, the current delay mode (E2E/P2P) and the clock mode
> > > (master/slave) must be configured via sysfs attributes.
> 
> This is a complete no-go.  NAK.
I didn't design the hardware nor do I have access to adequate documentation.
I will try to figure out what functionality is concretely affected by these
two settings.

If I am correct, the KSZ hardware consists of two main building blocks:
1. A TC on the switch path.
2. An OC on the DSA host port.

From the data sheet, page 109, chapter 5.1.6.11
("Global PTP Message Config 1 Register"), bit 2:

> *Selection of P2P or E2E*
> 1 = Peer-to-peer (P2P) transparent clock mode
> 0 = End-to-end (E2E) transparent clock mode
So this "tcmode" sysfs entry controls the behavior of the switch' transparent
clock. Is this related in any way to the PTP API?

For the master/slave setting, the data sheet writes the following:
*Selection of Master or Slave*
1 = Host port is PTP master ordinary clock
0 = Host port is PTP slave ordinary clock

So this is really related to the OC and so to the PTP API. Setting this
manually would interfere with the BMCA. I'll check whether delay measurement
and clock synchronization can also work for all conditions (E2E/P2P, 
master/slave) without altering this value. Otherwise we might consider
the KSZ as a "Slave Only Clock (SO)".

Christian
Vladimir Oltean Oct. 22, 2020, 9:01 a.m. UTC | #7
On Wed, Oct 21, 2020 at 08:02:17PM -0700, Richard Cochran wrote:
> On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> > So how _does_ that work for TI PHYTER?
> >
> > As far as we understand, the PHYTER appears to autonomously mangle PTP packets
> > in the following way:
> > - subtracting t2 on RX from the correctionField of the Pdelay_Req
> > - adding t3 on TX to the correctionField of the Pdelay_Resp
>
> The Phyter does not support peer-to-peer one step.

Ok, that's my mistake for not double-checking, sorry.

> The only driver that implements it is ptp_ines.c.
>
> And *that* driver/HW implements it correctly.

Is there documentation available for this timestamping block? I might be
missing some data, as the kernel driver is fairly pass-through for the
TX timestamping of the Pdelay_Resp, so the hardware might just 'do the
right thing'. I believe the answer lies within the timestamper's
per-port RX FIFO. Just guessing here, but I suspect that the RX FIFO
doesn't get cleared immediately after the host queries it, and the
hardware caches the last 100 events from the pool and uses the RX
timestamps of Pdelay_Req as t2 in the process of updating the
correctionField of Pdelay_Resp on TX. Would that be correct?
Christian Eggers Oct. 22, 2020, 10:17 a.m. UTC | #8
On Thursday, 22 October 2020, 09:30:57 CEST, Christian Eggers wrote:
> On Thursday, 22 October 2020, 04:42:01 CEST, Richard Cochran wrote:
> > On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> > > On Mon, Oct 19, 2020 at 07:24:33PM +0200, Christian Eggers wrote:
> > > > The PTP hardware performs internal detection of PTP frames (likely
> > > > similar as ptp_classify_raw() and ptp_parse_header()). As these
> > > > filters
> > > > cannot be disabled, the current delay mode (E2E/P2P) and the clock
> > > > mode
> > > > (master/slave) must be configured via sysfs attributes.
> > 
> > This is a complete no-go.  NAK.
> 
> I didn't design the hardware nor do I have access to adequate documentation.
> I will try to figure out what functionality is concretely affected by these
> two settings.
I tried to study the effect of setting the ocmode bit on the KSZ either to
master or to slave. The main visible change is, that some PTP message types
are be filtered out on RX:
- in "master" mode, "Sync" messages from other nodes will not be received
(but everything else like "Announce" seem to work)
- in "slave" mode, "Delay_Req" messages from other nodes will not be received

I am not an expert for PTP, so the following is only the idea of a beginner how
this could probably be handled:

As PTP announce messages are received all the time, the BMCA should always
be able to work. The KSZ hardware needs to be set to "master" when a node
is becoming master (in order to be able to receive (and answer) Delay_Req
messages). The setting "slave" is equired when the BCMA decides not being
master anymore (in order to receive Sync messages).

Handling the transition to "master" mode could probably be done easily in the 
driver (when a Sync message is seen in TX direction by the time stamping code).
But transition to slave seems to be difficult, because the tagging driver cannot
see when the node stops being master. For user space (ptp4l), the decision for
master/slave mode could probably be done easier.

If Richard (or somebody else) decides that "mode switching" of the KSZ device
would not be appropriate, I suspect the functionality of the KSZ has to be
limited to "Slave Only Clock".

regards
Christian
Vladimir Oltean Oct. 22, 2020, 10:50 a.m. UTC | #9
On Thu, Oct 22, 2020 at 12:01:26PM +0300, Vladimir Oltean wrote:
> On Wed, Oct 21, 2020 at 08:02:17PM -0700, Richard Cochran wrote:
> > On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> > > So how _does_ that work for TI PHYTER?
> > >
> > > As far as we understand, the PHYTER appears to autonomously mangle PTP packets
> > > in the following way:
> > > - subtracting t2 on RX from the correctionField of the Pdelay_Req
> > > - adding t3 on TX to the correctionField of the Pdelay_Resp
> >
> > The Phyter does not support peer-to-peer one step.
> 
> Ok, that's my mistake for not double-checking, sorry.
> 
> > The only driver that implements it is ptp_ines.c.
> >
> > And *that* driver/HW implements it correctly.
> 
> Is there documentation available for this timestamping block? I might be
> missing some data, as the kernel driver is fairly pass-through for the
> TX timestamping of the Pdelay_Resp, so the hardware might just 'do the
> right thing'. I believe the answer lies within the timestamper's
> per-port RX FIFO. Just guessing here, but I suspect that the RX FIFO
> doesn't get cleared immediately after the host queries it, and the
> hardware caches the last 100 events from the pool and uses the RX
> timestamps of Pdelay_Req as t2 in the process of updating the
> correctionField of Pdelay_Resp on TX. Would that be correct?

Ah, no, I think I'm wrong. As per your own description here:
https://lwn.net/Articles/807970/

If the hardware supports p2p one-step, it subtracts the ingress time
stamp value from the Pdelay_Request correction field.  The user space
software stack then simply copies the correction field into the
Pdelay_Response, and on transmission the hardware adds the egress time
stamp into the correction field.

So we were correct about the behavior, just not about the target
hardware.

So, that's just not how this hardware works. What do you recommend?
Keeping a FIFO of Pdelay_Req RX timestamps, and matching them to
Pdelay_Resp messages on TX, all of that contained within tag_ksz.c?
Christian Eggers Oct. 22, 2020, 11:11 a.m. UTC | #10
Hi Vladimir,

On Thursday, 22 October 2020, 12:50:14 CEST, Vladimir Oltean wrote:
> If the hardware supports p2p one-step, it subtracts the ingress time
> stamp value from the Pdelay_Request correction field.  The user space
> software stack then simply copies the correction field into the
> Pdelay_Response, and on transmission the hardware adds the egress time
> stamp into the correction field.
> 
> So we were correct about the behavior, just not about the target
> hardware.
> 
> So, that's just not how this hardware works. What do you recommend?
> Keeping a FIFO of Pdelay_Req RX timestamps, and matching them to
> Pdelay_Resp messages on TX, all of that contained within tag_ksz.c?

after applying the RX timestamp correctly to the correction field (shifting 
the nanoseconds by 16), it seems that "moving" the timestamp back to the tail 
tag on TX is not required anymore. Keeping the RX timestamp simply in
the correction field (negative value), works fine now. So this halves the 
effort in the tag_ksz driver.

Best regards
Christian
Vladimir Oltean Oct. 22, 2020, 11:32 a.m. UTC | #11
On Thu, Oct 22, 2020 at 01:11:40PM +0200, Christian Eggers wrote:
> Hi Vladimir,
> 
> On Thursday, 22 October 2020, 12:50:14 CEST, Vladimir Oltean wrote:
> after applying the RX timestamp correctly to the correction field (shifting
> the nanoseconds by 16),

That modification should have been done anyway, since the unit of
measurement for correctionField is scaled ppb (48 bits nanoseconds, 16
bits scaled nanoseconds), and not nanoseconds.

> it seems that "moving" the timestamp back to the tail tag on TX is not
> required anymore. Keeping the RX timestamp simply in the correction
> field (negative value), works fine now. So this halves the effort in
> the tag_ksz driver.

Ok, this makes sense.
Depending on what Richard responds, it now looks like the cleanest
approach would be to move your implementation that is currently in
ksz9477_update_ptp_correction_field() into a generic function called

static inline void ptp_onestep_p2p_move_t2_to_correction(struct sk_buff *skb,
							 unsigned int ptp_type,
							 struct ptp_header *ptp_header,
							 ktime_t t2)

You should then clearly document that this function is needed for
hardware capable of one-step P2P that does not already modify the
correctionField of Pdelay_Req event messages on ingress.
Richard Cochran Oct. 22, 2020, 2:34 p.m. UTC | #12
On Thu, Oct 22, 2020 at 02:32:43PM +0300, Vladimir Oltean wrote:
> On Thu, Oct 22, 2020 at 01:11:40PM +0200, Christian Eggers wrote:
> 
> > it seems that "moving" the timestamp back to the tail tag on TX is not
> > required anymore. Keeping the RX timestamp simply in the correction
> > field (negative value), works fine now. So this halves the effort in
> > the tag_ksz driver.
> 
> Ok, this makes sense.
> Depending on what Richard responds, it now looks like the cleanest
> approach would be to move your implementation that is currently in
> ksz9477_update_ptp_correction_field() into a generic function called

+1
Vladimir Oltean Oct. 30, 2020, 6:24 p.m. UTC | #13
On Thu, Oct 22, 2020 at 12:17:48PM +0200, Christian Eggers wrote:
> I tried to study the effect of setting the ocmode bit on the KSZ either to
> master or to slave. The main visible change is, that some PTP message types
> are be filtered out on RX:
> - in "master" mode, "Sync" messages from other nodes will not be received
> (but everything else like "Announce" seem to work)
> - in "slave" mode, "Delay_Req" messages from other nodes will not be received

Could you dump the contents of your REG_PTP_MSG_CONF2 register?
Christian Eggers Nov. 1, 2020, 9:35 a.m. UTC | #14
Hi Vladimir,

On Friday, 30 October 2020, 19:24:47 CET, Vladimir Oltean wrote:
> On Thu, Oct 22, 2020 at 12:17:48PM +0200, Christian Eggers wrote:
> > I tried to study the effect of setting the ocmode bit on the KSZ either to
> > master or to slave. The main visible change is, that some PTP message
> > types
> > are be filtered out on RX:
> > - in "master" mode, "Sync" messages from other nodes will not be received
> > (but everything else like "Announce" seem to work)
> > - in "slave" mode, "Delay_Req" messages from other nodes will not be
> > received
> Could you dump the contents of your REG_PTP_MSG_CONF2 register?
runtime register value is 0x1004 (matches default value from the data sheet).
The Linux driver doesn't touch this register. Below is a dump of all PTP
related (global) registers.

regards
Christian

    KSZ9563 (Ethernet switch)

      Global

        IEEE 1588 PTP
        CLKCTRL      0002      SWFA         enabled        CLKSADJ        NOP              PTPSD        subtract
                               CLKREAD      NOP            CLKWRITE       NOP              CLKCADJ      disabled
                               EN           enabled        RESET          normal
        RTCCP        0002      PHASE        16ns
        RTCNS        17D72FF0  NANOSECONDS    399978480
        RTCS         00000023  SECONDS               35
        RTCSUBNS     00000000  RATEDIR      subtract       TEMPADJ        permanent
                               SUBNS                  0
        RTCTMPADJ    00000000  CYCLES                 0
        MSGCFG1      007D      MODEEN       enabled        IEEE802.3      enabled          UDPv4        enabled
                               UDPv6        enabled        TCMODE         P2P              OCMODE       slave
        MSGCFG2      1004      UNICASTEN    both           ALTMASTER      disabled         PRIOTX       event only
                               CHKSYNFU     disabled       CHKDLY         disabled         CHKPDLY      disabled
                               DROP         disabled       CHKDOM         disabled         IPv4CHKSUM   calc
        DOMVER       0200      VERSION      2              DOMAIN            0
        UNITIDX      00000000  GPIO_IDX     GPIO_1         TS_IDX         Unit 0           TRIG_IDX     Unit 0
Vladimir Oltean Nov. 1, 2020, 11:10 a.m. UTC | #15
On Sun, Nov 01, 2020 at 10:35:01AM +0100, Christian Eggers wrote:
> Hi Vladimir,
> 
> On Friday, 30 October 2020, 19:24:47 CET, Vladimir Oltean wrote:
> > On Thu, Oct 22, 2020 at 12:17:48PM +0200, Christian Eggers wrote:
> > > I tried to study the effect of setting the ocmode bit on the KSZ either to
> > > master or to slave. The main visible change is, that some PTP message
> > > types
> > > are be filtered out on RX:
> > > - in "master" mode, "Sync" messages from other nodes will not be received
> > > (but everything else like "Announce" seem to work)
> > > - in "slave" mode, "Delay_Req" messages from other nodes will not be
> > > received
> > Could you dump the contents of your REG_PTP_MSG_CONF2 register?
> runtime register value is 0x1004 (matches default value from the data sheet).
> The Linux driver doesn't touch this register. Below is a dump of all PTP
> related (global) registers.

So the bit 5 ("Enable Dropping of Sync/Follow_Up and Delay_Req PTP
Messages") is not set. When the PTP messages are dropped, do you know
which error counter in ethtool -S is increasing?
Christian Eggers Nov. 1, 2020, 10:14 p.m. UTC | #16
On Sunday, 1 November 2020, 12:10:08 CET, Vladimir Oltean wrote:
> On Sun, Nov 01, 2020 at 10:35:01AM +0100, Christian Eggers wrote:
> > Hi Vladimir,
> > 
> > On Friday, 30 October 2020, 19:24:47 CET, Vladimir Oltean wrote:
> > > On Thu, Oct 22, 2020 at 12:17:48PM +0200, Christian Eggers wrote:
> > > > I tried to study the effect of setting the ocmode bit on the KSZ
> > > > either to
> > > > master or to slave. The main visible change is, that some PTP message
> > > > types
> > > > are be filtered out on RX:
> > > > - in "master" mode, "Sync" messages from other nodes will not be
> > > > received
> > > > (but everything else like "Announce" seem to work)
> > > > - in "slave" mode, "Delay_Req" messages from other nodes will not be
> > > > received
> > > 
> > > Could you dump the contents of your REG_PTP_MSG_CONF2 register?
> > 
> > runtime register value is 0x1004 (matches default value from the data
> > sheet). The Linux driver doesn't touch this register. Below is a dump of
> > all PTP related (global) registers.
> 
> So the bit 5 ("Enable Dropping of Sync/Follow_Up and Delay_Req PTP
> Messages") is not set. When the PTP messages are dropped, do you know
> which error counter in ethtool -S is increasing?
I am not sure whether I understand the question. My assumption is that the
KSZ9563 simply doesn't forward specific PTP packages from the slave ports to 
the CPU port. In my imagination this happens in hardware and is not visible in
software.

I have run "ethtool -S" two times on the PTP master clock (E2E mode):
- When "ocmode" is set to master (DelayReq messages can be received)
- When "ocmode" is set to slave (DelayReq messages cannot be received)

Here is the diff output:
--- /home/root/ethtool1
+++ /home/root/ethtool2
@@ -1,8 +1,8 @@
 NIC statistics:
-     tx_packets: 1421
-     tx_bytes: 133641
-     rx_packets: 488
-     rx_bytes: 35904
+     tx_packets: 1455
+     tx_bytes: 136783
+     rx_packets: 496
+     rx_bytes: 36459
      rx_hi: 0
      rx_undersize: 0
      rx_fragments: 0
@@ -14,10 +14,10 @@
      rx_mac_ctrl: 0
      rx_pause: 0
      rx_bcast: 4
-     rx_mcast: 667
+     rx_mcast: 683
      rx_ucast: 0
      rx_64_or_less: 0
-     rx_65_127: 659
+     rx_65_127: 675
      rx_128_255: 11
      rx_256_511: 1
      rx_512_1023: 0
@@ -27,15 +27,15 @@
      tx_hi: 0
      tx_late_col: 0
      tx_pause: 0
-     tx_bcast: 713
-     tx_mcast: 1852
-     tx_ucast: 324
+     tx_bcast: 733
+     tx_mcast: 1897
+     tx_ucast: 333
      tx_deferred: 0
      tx_total_col: 0
      tx_exc_col: 0
      tx_single_col: 0
      tx_mult_col: 0
-     rx_total: 61158
-     tx_total: 307225
+     rx_total: 62577
+     tx_total: 313773
      rx_discards: 20
      tx_discards: 0

regards
Christian
Vladimir Oltean Nov. 1, 2020, 11:41 p.m. UTC | #17
On Sun, Nov 01, 2020 at 11:14:24PM +0100, Christian Eggers wrote:
> I am not sure whether I understand the question.

When the port is in standalone mode, there is no other destination for
packets except for the CPU. So if the switch filters out Sync or
Delay_Req packets depending on the operating mode, then naturally those
packets would get dropped somewhere, and I wanted to see where.

> I have run "ethtool -S" two times on the PTP master clock (E2E mode):
> - When "ocmode" is set to master (DelayReq messages can be received)
> - When "ocmode" is set to slave (DelayReq messages cannot be received)
> 
> Here is the diff output:
> --- /home/root/ethtool1
> +++ /home/root/ethtool2
> @@ -1,8 +1,8 @@
>  NIC statistics:
> -     tx_packets: 1421
> -     tx_bytes: 133641
> -     rx_packets: 488
> -     rx_bytes: 35904
> +     tx_packets: 1455
> +     tx_bytes: 136783
> +     rx_packets: 496
> +     rx_bytes: 36459
>       rx_hi: 0
>       rx_undersize: 0
>       rx_fragments: 0
> @@ -14,10 +14,10 @@
>       rx_mac_ctrl: 0
>       rx_pause: 0
>       rx_bcast: 4
> -     rx_mcast: 667
> +     rx_mcast: 683
>       rx_ucast: 0
>       rx_64_or_less: 0
> -     rx_65_127: 659
> +     rx_65_127: 675
>       rx_128_255: 11
>       rx_256_511: 1
>       rx_512_1023: 0
> @@ -27,15 +27,15 @@
>       tx_hi: 0
>       tx_late_col: 0
>       tx_pause: 0
> -     tx_bcast: 713
> -     tx_mcast: 1852
> -     tx_ucast: 324
> +     tx_bcast: 733
> +     tx_mcast: 1897
> +     tx_ucast: 333
>       tx_deferred: 0
>       tx_total_col: 0
>       tx_exc_col: 0
>       tx_single_col: 0
>       tx_mult_col: 0
> -     rx_total: 61158
> -     tx_total: 307225
> +     rx_total: 62577
> +     tx_total: 313773
>       rx_discards: 20
>       tx_discards: 0

These counters seem to be on the front-panel port, and the discard seems
to not be recorded there. But nonetheless, even if you do see the drops
in the ethtool statistics for the CPU port, it's probably going to be at
'tx_discards' which is pretty unspecific and not useful. I was thinking
that the drop reason would be more elaborate, which it looks like it
isn't.

> My assumption is that the KSZ9563 simply doesn't forward specific PTP
> packages from the slave ports to the CPU port. In my imagination this
> happens in hardware and is not visible in software.

You talked about tracking the BMCA by snooping Announce messages. I
don't think that is going to be the path forward either. Think about the
general case, where there might not even be a BMCA (like in the
automotive profile).

It almost seems to me as if the hardware is trying to be helpful by
dropping the PTP messages that the application layer would drop anyway.
Too bad that nobody seems to have told them to make this helpful
mechanism optional, because as it is, it's standing in the way more than
helping.

You know what the real problem is, with DSA you don't have the concept
of the host port being an Ordinary Clock. DSA does not treat the host
port as a switched endpoint (aka a plain net device attached to a dumb
switch, and unaware of that switch), but instead is the conduit interface
for each front-panel switch interface, which is an individually
addressable network interface in its own right. You are not supposed to
use a DSA master interface for networking directly, not for regular
networking and not for PTP. In fact, DSA-enabled ports, only the PTP
clock of the switch is usable. If you attempt to run ptp4l on the master
interface an error will be thrown back at you.

Why am I mentioning this? Because the setting that's causing trouble for
us is 'port state of the host port OC', which in the context of what I
said above is nonsense. There _is_ no host port OC. There are 2 switch
ports which can act as individual OCs, or as a BC, or as a TC. But
consider the case when the switch is a BC, with one of the front-panel
ports being a master and the other a slave. What mode are you supposed
to put the host port in, so that it receives both the Sync packets from
the slave port, and the Delay_Req packets from the master port?! It just
makes no sense to me. In principle I don't see any reason why this
switch would not be able to operate as a one-step peer delay BC.

Unless somebody from Microchip could shed some light on the design
decisions of this switch (and there are enough Microchip people copied
already), here's what I would do. I would cut my losses and just support
peer delay, no E2E delay request-response mechanism (this means you'll
need to deploy peer delay to all devices within your network, but the
gains might be worth it). Because peer delay is symmetrical (both link
partners are both requestors as well as responders), there's no help in
the world that this switch could volunteer to give you in dropping
packets on your behalf. So I expect that if you hardcode:
- the port state for the host port OC as slave, then you'd get the Sync
  messages from all ports, and the Delay_Req messages would be dropped
  but you wouldn't care about those anyway, and
- the selection of TC mode to P2P TC.

Then I would negotiate with Richard whether it's ok to add these extra
values to enum hwtstamp_rx_filters:
	HWTSTAMP_FILTER_PTP_V2_PDELAY
	HWTSTAMP_FILTER_PTP_V2_L4_PDELAY

Given the fact that you're only limiting the timestamping to Pdelay
because we don't fully understand the hardware, I don't really know
whether introducing UAPI for this one situation is justifiable. If not,
then your driver will not have a chance to reject ptp4l -E, and it will
Simply Not Work.
Vladimir Oltean Nov. 1, 2020, 11:55 p.m. UTC | #18
On Mon, Nov 02, 2020 at 01:41:49AM +0200, Vladimir Oltean wrote:
> In principle I don't see any reason why this switch would not be able
> to operate as a one-step peer delay BC.

What I meant to say was "one-step E2E BC", since I was talking about
having to receive both Sync and Delay_Req at the same time, of course.
Christian Eggers Nov. 2, 2020, 10:35 a.m. UTC | #19
On Monday, 2 November 2020, 00:41:49 CET, Vladimir Oltean wrote:
> On Sun, Nov 01, 2020 at 11:14:24PM +0100, Christian Eggers wrote:
> > My assumption is that the KSZ9563 simply doesn't forward specific PTP
> > packages from the slave ports to the CPU port. In my imagination this
> > happens in hardware and is not visible in software.
> 
> You talked about tracking the BMCA by snooping Announce messages. I
> don't think that is going to be the path forward either. Think about the
> general case, where there might not even be a BMCA (like in the
> automotive profile).
Maybe my mail from October, 22 was ambiguous. I meant that despite of the 
presence of filtering, a BCMA algorithm should be about to work (as no 
Announce messages are filtered out).

Additionally I said, that switching between "master" and "slave" mode could 
not be done automatically by the driver, as the driver could at most detect 
the presence of Sync messages (indication for master mode), but would do hard 
to detect a transition to slave mode.

I see a chance that user space (ptp4l) could configure the appropriate 
"hardware filter setup" for master/slave mode. 

> It almost seems to me as if the hardware is trying to be helpful by
> dropping the PTP messages that the application layer would drop anyway.
> Too bad that nobody seems to have told them to make this helpful
> mechanism optional, because as it is, it's standing in the way more than
> helping.
I think the same. Maybe there is some undocumented "filter disable" bit, but 
this information must come from Microchip.

> You know what the real problem is, with DSA you don't have the concept
> of the host port being an Ordinary Clock. DSA does not treat the host
> port as a switched endpoint (aka a plain net device attached to a dumb
> switch, and unaware of that switch), but instead is the conduit interface
> for each front-panel switch interface, which is an individually
> addressable network interface in its own right. You are not supposed to
> use a DSA master interface for networking directly, not for regular
> networking and not for PTP. In fact, DSA-enabled ports, only the PTP
> clock of the switch is usable. If you attempt to run ptp4l on the master
> interface an error will be thrown back at you.
> 
> Why am I mentioning this? Because the setting that's causing trouble for
> us is 'port state of the host port OC', which in the context of what I
> said above is nonsense. There _is_ no host port OC. There are 2 switch
> ports which can act as individual OCs, or as a BC, or as a TC.
But the switch has only one clock at all. I assume that the switch cannot be a 
boundary clock, only TC seems possible.

> But
> consider the case when the switch is a BC, with one of the front-panel
> ports being a master and the other a slave. What mode are you supposed
> to put the host port in, so that it receives both the Sync packets from
> the slave port, and the Delay_Req packets from the master port?! It just
> makes no sense to me. In principle I don't see any reason why this
> switch would not be able to operate as a one-step peer delay BC.
> 
> Unless somebody from Microchip could shed some light on the design
> decisions of this switch (and there are enough Microchip people copied
> already), here's what I would do. I would cut my losses and just support
> peer delay, no E2E delay request-response mechanism (this means you'll
> need to deploy peer delay to all devices within your network, but the
> gains might be worth it). Because peer delay is symmetrical (both link
> partners are both requestors as well as responders), there's no help in
> the world that this switch could volunteer to give you in dropping
> packets on your behalf. So I expect that if you hardcode:
> - the port state for the host port OC as slave, then you'd get the Sync
>   messages from all ports, and the Delay_Req messages would be dropped
>   but you wouldn't care about those anyway, and
> - the selection of TC mode to P2P TC.
When using only P2P, setting the OCMODE bit to "slave" should work.

> Then I would negotiate with Richard whether it's ok to add these extra
> values to enum hwtstamp_rx_filters:
> 	HWTSTAMP_FILTER_PTP_V2_PDELAY
> 	HWTSTAMP_FILTER_PTP_V2_L4_PDELAY
>
As said above, having "filter setups" for E2E/P2P and for MASTER/SLAVE would 
probably fit well for this kind of hardware.

> Given the fact that you're only limiting the timestamping to Pdelay
> because we don't fully understand the hardware, I don't really know
> whether introducing UAPI for this one situation is justifiable. If not,
> then your driver will not have a chance to reject ptp4l -E, and it will
> Simply Not Work.
Vladimir Oltean Nov. 2, 2020, 12:28 p.m. UTC | #20
On Mon, Nov 02, 2020 at 11:35:00AM +0100, Christian Eggers wrote:
> Maybe my mail from October, 22 was ambiguous. I meant that despite of the
> presence of filtering, a BCMA algorithm should be about to work (as no
> Announce messages are filtered out).
> 
> Additionally I said, that switching between "master" and "slave" mode could
> not be done automatically by the driver, as the driver could at most detect
> the presence of Sync messages (indication for master mode), but would do hard
> to detect a transition to slave mode.
> 
> I see a chance that user space (ptp4l) could configure the appropriate
> "hardware filter setup" for master/slave mode.

The concept that you want from user space is hard to define.
You want ptp4l to make the driver aware of the port state, which is
something that happens at runtime and is not part of the "hardware
filter setup" as you say. Then, even if ptp4l notifies the driver of
port state, E2E BC setups would still be broken because that's how the
hardware works and no user space assistance can help with that. Also,
what abstraction do you plan using for programming the PTP port state
into the kernel.

Maybe you should optimize for what you plan to use. If you need to use
an E2E profile, maybe it would be worth the effort to find an
appropriate abstraction for this port state thing. If you only use it
for testing, then maybe it would be a good idea to keep the sysfs in
your tree.

> > Why am I mentioning this? Because the setting that's causing trouble for
> > us is 'port state of the host port OC', which in the context of what I
> > said above is nonsense. There _is_ no host port OC. There are 2 switch
> > ports which can act as individual OCs, or as a BC, or as a TC.
> But the switch has only one clock at all. I assume that the switch cannot be a
> boundary clock, only TC seems possible.

Why would a P2P BC not work? It does not require more than one clock.

> As said above, having "filter setups" for E2E/P2P and for MASTER/SLAVE would
> probably fit well for this kind of hardware.

For E2E/P2P is one thing, for master/slave is a completely different
thing.
Christian Eggers Nov. 5, 2020, 8:18 p.m. UTC | #21
Hi Vladimir,

On Thursday, 22 October 2020, 13:32:43 CET, Vladimir Oltean wrote:
> On Thu, Oct 22, 2020 at 01:11:40PM +0200, Christian Eggers wrote:
> > On Thursday, 22 October 2020, 12:50:14 CEST, Vladimir Oltean wrote:
> > after applying the RX timestamp correctly to the correction field
> > (shifting
> > the nanoseconds by 16),
> 
> That modification should have been done anyway, since the unit of
> measurement for correctionField is scaled ppb (48 bits nanoseconds, 16
> bits scaled nanoseconds), and not nanoseconds.
> 
> > it seems that "moving" the timestamp back to the tail tag on TX is not
> > required anymore. Keeping the RX timestamp simply in the correction
> > field (negative value), works fine now. So this halves the effort in
> > the tag_ksz driver.

unfortunately I made a mistake when testing. Actually the timestamp *must* be
moved from the correction field (negative) to the egress tail tag.

> Ok, this makes sense.
> Depending on what Richard responds, it now looks like the cleanest
> approach would be to move your implementation that is currently in
> ksz9477_update_ptp_correction_field() into a generic function called
> 
> static inline void ptp_onestep_p2p_move_t2_to_correction(struct sk_buff
> *skb, unsigned int ptp_type,
>  struct ptp_header *ptp_header,
> ktime_t t2)
I have implemented this in ptp_classify.h. Passing t2 instead of the correction
field itself is fine for rx, but as this function is now still required for
transmit, it looks a little bit misused there (see below).

Shall I keep it as below, or revert it to passing value of the correction field
itself?

regards
Christian


static void ksz9477_xmit_timestamp(struct sk_buff *skb)
{
	struct sk_buff *clone = DSA_SKB_CB(skb)->clone;
	struct ptp_header *ptp_hdr;
	u32 tstamp_raw = 0;
	u64 correction;

	if (!clone)
		goto out_put_tag;

	/* Use cached PTP header and type from ksz9477_ptp_should_tstamp(). Note
	 * that KSZ9477_SKB_CB(clone)->ptp_header != NULL implies that this is a
	 * Pdelay_resp message.
	 */
	ptp_hdr = KSZ9477_SKB_CB(clone)->ptp_header;
	if (!ptp_hdr)
		goto out_put_tag;

	correction = get_unaligned_be64(&ptp_hdr->correction);

	/* For PDelay_Resp messages we will likely have a negative value in the
	 * correction field (see ksz9477_rcv()). The switch hardware cannot
	 * correctly update such values, so it must be moved to the time stamp
	 * field in the tail tag.
	 */
	if ((s64)correction < 0) {
		unsigned int ptp_type = KSZ9477_SKB_CB(clone)->ptp_type;
		struct timespec64 ts;
		u64 ns;

		/* Move ingress time stamp from PTP header's correction field to
		 * tail tag. Format of the correction filed is 48 bit ns + 16
		 * bit fractional ns.  Avoid shifting negative numbers.
		 */
		ns = -((s64)correction) >> 16;
		ts = ns_to_timespec64(ns);
		tstamp_raw = ((ts.tv_sec & 3) << 30) | ts.tv_nsec;

>>>		/* Set correction field to 0 (by subtracting the negative value)
>>>		 * and update UDP checksum.
>>>		 */
>>>		ptp_onestep_p2p_move_t2_to_correction(skb, ptp_type, ptp_hdr, ns_to_ktime(-ns));
	}

out_put_tag:
	put_unaligned_be32(tstamp_raw, skb_put(skb, KSZ9477_PTP_TAG_LEN));
}


Addtionally ptp_onestep_p2p_move_t2_to_correction() must be able to handle negative values:

static inline
void ptp_onestep_p2p_move_t2_to_correction(struct sk_buff *skb,
					   unsigned int type,
					   struct ptp_header *hdr,
					   ktime_t t2)
{
	u8 *ptr = skb_mac_header(skb);
	struct udphdr *uhdr = NULL;
	s64 ns = ktime_to_ns(t2);
	__be64 correction_old;
	s64 correction;

	/* previous correction value is required for checksum update. */
	memcpy(&correction_old,  &hdr->correction, sizeof(correction_old));
	correction = (s64)be64_to_cpu(correction_old);

	/* PTP correction field consists of 32 bit nanoseconds and 16 bit
	 * fractional nanoseconds.  Avoid shifting negative numbers.
	 */
>>>	if (ns >= 0)
>>>		correction -= ns << 16;
>>>	else
>>>		correction += -ns << 16;

	/* write new correction value */
	put_unaligned_be64((u64)correction, &hdr->correction);
...
}
Vladimir Oltean Nov. 10, 2020, 1:42 a.m. UTC | #22
Sorry for getting back late to you. It did not compute when I read your
email the first time around, then I let it sit for a while.

On Thu, Nov 05, 2020 at 09:18:04PM +0100, Christian Eggers wrote:
> unfortunately I made a mistake when testing. Actually the timestamp *must* be
> moved from the correction field (negative) to the egress tail tag.

That doesn't sound very good at all.

I have a simple question. Can your driver, when operating as a PTP
master clock, distribute a time in 2020 into your network? (you can
check with "phc_ctl /dev/ptp0 get")
Christian Eggers Nov. 10, 2020, 2:36 p.m. UTC | #23
On Tuesday, 10 November 2020, 02:42:34 CET, Vladimir Oltean wrote:
> Sorry for getting back late to you. It did not compute when I read your
> email the first time around, then I let it sit for a while.
> 
> On Thu, Nov 05, 2020 at 09:18:04PM +0100, Christian Eggers wrote:
> > unfortunately I made a mistake when testing. Actually the timestamp *must*
> > be moved from the correction field (negative) to the egress tail tag.
> That doesn't sound very good at all.
I think that is no drawback. It is implemented and works.

> I have a simple question. Can your driver, when operating as a PTP
> master clock, distribute a time in 2020 into your network? (you can
> check with "phc_ctl /dev/ptp0 get")
# phc_ctl /dev/ptp2 get
phc_ctl[2141.093]: clock time is 1605018671.788481458 or Tue Nov 10 14:31:11 2020

The KSZ PTP clock has 32 bit seconds (and 30 bit nanoseconds).

I hope I can send the new series tonight. Had today no 5 minutes without the telephone ringing...
Vladimir Oltean Nov. 10, 2020, 4:40 p.m. UTC | #24
On Tue, Nov 10, 2020 at 03:36:02PM +0100, Christian Eggers wrote:
> On Tuesday, 10 November 2020, 02:42:34 CET, Vladimir Oltean wrote:
> > Sorry for getting back late to you. It did not compute when I read your
> > email the first time around, then I let it sit for a while.
> >
> > On Thu, Nov 05, 2020 at 09:18:04PM +0100, Christian Eggers wrote:
> > > unfortunately I made a mistake when testing. Actually the timestamp *must*
> > > be moved from the correction field (negative) to the egress tail tag.
> > That doesn't sound very good at all.
> I think that is no drawback. It is implemented and works.

It works, but how?

The reason why I'm asking you is this. I can guess that this hardware
plays by a very simple song when performing one-step TX timestamping.

correctionField_New = correctionField_Old + (t_TX - t_Tail_Tag)

It does this because, as far as the hardware is concerned, it needs to
make no difference between event messages, be they Sync, Pdelay_Req or
Pdelay_Resp messages.

First, let's see if we agree what happens for a Sync.

          |                    |
       t1 |------\ Sync        |
          |       ------\      |
          |              ----->| t2
Master    |                    |     Slave
Clock     |                    |     Clock
          |                    |

When receiving the one-step Sync message, the Slave Clock must add the
correctionField to the originTimestamp in order to figure out what is
the t1 it should use. Usually, the correctionField would only be updated
by transparent clocks. But when using one-step TX timestamping, it is
not mandatory for master clocks to send the correctionField as zero
(something which is mandatory for two-step). So they don't.

Here's what the IEEE 1588 standard says:

-----------------------------[cut here]-----------------------------
9.5.9.3 One-step clocks
-----------------------

The originTimestamp field of the Sync message shall be an estimate no
worse than ±1 s of the <syncEventEgressTimestamp> excluding any
fractional nanoseconds.
-----------------------------[cut here]-----------------------------

In practice, the Master Clock will probably set the originTimestamp to
something more or less arbitrary by reading the current PTP time of the
NIC, then it will let the MAC rewrite the correctionField with the
precise t_TX (hardware TX timestamp) minus the value from the
originTimestamp field*.

*Actually parsing the packet at that rate, while also rewriting and
timestamping it, might be too tricky and too late, so the MAC, instead
of reading the originTimestamp from the actual Sync message, will
actually require you, the driver writer, to pass them the value of the
originTimestamp twice, this time also through the tail tag (or through
whatever other buffer metadata that hardware might use).

So this formula would do that job:

correctionField_New = correctionField_Old + (t_TX - t_Tail_Tag)

where t_TX is the MAC TX timestamp and t_Tail_Tag should be the
approximate originTimestamp of the Sync message.

I am fairly confident that this is how your hardware works, because
that's also how peer delay wants to be timestamped, it seems. It's just
that in the case of Pdelay_Resp, t_Tail_Tag must be set to t2 (the
precise timestamp of the Pdelay_Req), in order for the correctionField
for that Pdelay_Resp message to contain (t3 - t2):

          |                    |
       t1 |------\ Pdelay_Req  |
          |       ------\      |
          |              ----->| t2
Clock A   |                    |     Clock B
          | Pdelay_Resp /------| t3
          |       ------       |
       t4 |<-----/             |
          |                    |

You don't do any of that here. So what must be happening in your case is
that the originTimestamp for Sync messages is always zero, and the
correctionField provides the rest of t1, in its entirety. That's because
in your tagger, you also leave the t_Tail_Tag as zero for Sync messages
(you only touch it for peer delay). Is that ok? Well, on paper yes,
because 0 + t1 = t1 + 0 = t1 (either the correctionField or the
originTimestamp could be zero, and they would add up to the same
number), but in practice, the bit width of the originTimestamp is 64
bits, whereas the correctionField only has 48 bits for nanoseconds, the
lower 16 bits being for fixed point. So if you're going to keep your
entire t1 into the correctionField, then in practice your master clock
can only advertise a time that holds 48 bits worth of nanoseconds, i.e.
3 1/4 days worth of TAI starting with Jan 1st 1970.  Then it would wrap
around. So how come this is not what happens in your case? I don't see
you update the originTimestamp nor the t_Tail_Tag for Sync messages.

The arithmetic that is done on the correctionField should be simple
48-bit two's complement. It should not matter if the correctionField is
positive or negative. The only case I believe it would matter that the
correctionField is negative is if the arithmetic is not actually on 48
bit two's complement, but on the partial timestamps directly (i.e. 32
bit two's complement). What I'm saying is that this formula is probably
calculated in hardware in two steps:

correctionField_New = correctionField_Old + (t_TX - t_Tail_Tag)

Step 1:
t_tmp = t_TX (32 bits) - t_Tail_Tag (32 bits), using 32-bit two's complement arithmetic

Step 2:
correctionField_New (48 bits) = correctionField_Old (48 bits) + t_tmp (32 bits),
using 48-bit two's complement arithmetic, and where t_tmp is probably
treated as an unsigned integer that is zero-padded up to 48 bits

I am getting the feeling that there are still some things we're missing
in this series. I would not hurry to send a v2 until they're clear.

I hope that you're not testing PTP just between yourself and yourself.
It's really easy for bugs to cancel out.
Vladimir Oltean Nov. 10, 2020, 7:32 p.m. UTC | #25
On Tue, Nov 10, 2020 at 06:40:45PM +0200, Vladimir Oltean wrote:
> I am fairly confident that this is how your hardware works, because
> that's also how peer delay wants to be timestamped, it seems.

So I was confident and also wrong, it appears. My KSZ9477 datasheet
says:

In the host-to-switch direction, the 4-byte timestamp field is always
present when PTP is enabled, as shown in Figure 4-6. This is true for
all packets sent by the host, including IBA packets. The host uses this
field to insert the receive timestamp from PTP Pdelay_Req messages into
the Pdelay_Resp messages. For all other traffic and PTP message types,
the host should populate the timestamp field with zeros.

Hm. Does that mean that the switch updates the originTimestamp field of
the Sync frames by itself? Ok... Very interesting that they decided to
introduce a field in the tail tag for a single type of PTP message.

But something is still wrong if you need to special-case the negative
correctionField, it looks like the arithmetic is not done on the correct
number of bits, either by the driver or by the hardware.

And zeroing out the correctionField of the Pdelay_Resp on transmission,
to put that value into t_Tail_Tag? How can you squeeze a 48-bit value
into a 32-bit value without truncation?
Christian Eggers Nov. 11, 2020, 9:49 p.m. UTC | #26
Hi Vladimir,

On Tuesday, 10 November 2020, 20:32:45 CET, Vladimir Oltean wrote:
> On Tue, Nov 10, 2020 at 06:40:45PM +0200, Vladimir Oltean wrote:
> > I am fairly confident that this is how your hardware works, because
> > that's also how peer delay wants to be timestamped, it seems.
> 
> So I was confident and also wrong, it appears. My KSZ9477 datasheet
> says:
> 
> In the host-to-switch direction, the 4-byte timestamp field is always
> present when PTP is enabled, as shown in Figure 4-6. This is true for
> all packets sent by the host, including IBA packets. The host uses this
> field to insert the receive timestamp from PTP Pdelay_Req messages into
> the Pdelay_Resp messages. For all other traffic and PTP message types,
> the host should populate the timestamp field with zeros.
> 
> Hm. Does that mean that the switch updates the originTimestamp field of
> the Sync frames by itself?
IMHO this is the best solution. User space / driver do not know the exact time 
(would require slow I2C transfer). So inserting the time in hardware seems to 
be the better solution. Maybe this is what the data sheet meant with "egress 
timestamp insertion".

> Ok... Very interesting that they decided to
> introduce a field in the tail tag for a single type of PTP message.

> But something is still wrong if you need to special-case the negative
> correctionField, it looks like the arithmetic is not done on the correct
> number of bits, either by the driver or by the hardware.
> 
Maybe I found the formula which is (should) applied to the correction field 
for PDelayResp:

correction = correction_old + now + residental_delay + tx_latency - tail_tag

<correction_old>: current value from the PTP header
<now>: Time of the PTP clock when entering the switch
<residental_delay>: Switching delay
<Tx latency>: Delay between time stamping unit and wire (configurable via 
register)
<tail_tag>: Time stamp in the tail tag

The new correction value has been captured with Wireshark. For the measurement 
I simply halted the internal PTP clock and set it to zero (so it's value is 
exactly known). In the port's TX time stamp unit I got a non-zero time stamp 
anyway (between, 10 to 40 ns), so this must be the residential delay.

I tested with different values for the correction field and the tail_tag. 
Negative values are no problem, but the calculation seems no to consider all 
bits.

Measurements:
- PTP clock: 0 (frozen)
- Residential delay: variable
- TX delay: 45 ns (default)

- correction = 0xffff ffff ffff.ffff (-1.xxx ns)
correction = correction + now + residental_delay + tx_latency - tail_tag
           =          -1  + 0                 40 +         45          0
           = 84             (0000 0000 0054)
           wireshark:       (0000 0000 0054) --> correct

correction = correction + now + residental_delay + tx_latency - tail_tag
           =          -1  + 0                 32 +         45      1.000
           = -924           (FFFF FFFF FC64)
           wireshark:       (0000 FFFF FC64) --> wrong

correction = correction + now + residental_delay + tx_latency - tail_tag
           =          -1  + 0                 24 +         45   2.000.000.000 
           = -1.926.258.108 (FFFF 8D2F A244)
           wireshark:       (0000 7B9A CA44) --> wrong

- correction = 0xffff ffff 0000.0000 (-65536.0 ns)
correction = correction + now + residental_delay + tx_latency - tail_tag
           =     -65536     0                 24 +         45          0
           =     -65467     (FFFF FFFF 0045)
           wireshark:       (FFFF FFFF 0045) --> correct

correction = correction + now + residental_delay + tx_latency - tail_tag
           =     -65536   + 0                 32 +         45      1.000
           =     -66459     (FFFF FFFE FC65)
           wireshark:       (0000 FFFE FC65) --> wrong


Please note that the tail tag consist of 2 bits for seconds and 30 bit 
nanoseconds. So the value of 2.000.000.000 means 1s + 926.258.176 ns.

As you are better in 2's complement as me, you can give me some more 
combinations for testing if you need. But in the end it looks like I should 
keep T2 in the tail tag.

> And zeroing out the correctionField of the Pdelay_Resp on transmission,
> to put that value into t_Tail_Tag? How can you squeeze a 48-bit value
> into a 32-bit value without truncation?
Only the lower bits are used. As long as PDelayResp doesn't take more than 4 
seconds, this should be enough.

regards
Christian
Christian Eggers Nov. 11, 2020, 9:50 p.m. UTC | #27
Hi Vladimir,

On Tuesday, 10 November 2020, 20:32:45 CET, Vladimir Oltean wrote:
> On Tue, Nov 10, 2020 at 06:40:45PM +0200, Vladimir Oltean wrote:
> > I am fairly confident that this is how your hardware works, because
> > that's also how peer delay wants to be timestamped, it seems.
> 
> So I was confident and also wrong, it appears. My KSZ9477 datasheet
> says:
> 
> In the host-to-switch direction, the 4-byte timestamp field is always
> present when PTP is enabled, as shown in Figure 4-6. This is true for
> all packets sent by the host, including IBA packets. The host uses this
> field to insert the receive timestamp from PTP Pdelay_Req messages into
> the Pdelay_Resp messages. For all other traffic and PTP message types,
> the host should populate the timestamp field with zeros.
> 
> Hm. Does that mean that the switch updates the originTimestamp field of
> the Sync frames by itself?
IMHO this is the best solution. User space / driver do not know the exact time 
(would require slow I2C transfer). So inserting the time in hardware seems to 
be the better solution. Maybe this is what the data sheet meant with "egress 
timestamp insertion".

> Ok... Very interesting that they decided to
> introduce a field in the tail tag for a single type of PTP message.

> But something is still wrong if you need to special-case the negative
> correctionField, it looks like the arithmetic is not done on the correct
> number of bits, either by the driver or by the hardware.
> 
Maybe I found the formula which is (should) applied to the correction field 
for PDelayResp:

correction = correction_old + now + residental_delay + tx_latency - tail_tag

<correction_old>: current value from the PTP header
<now>: Time of the PTP clock when entering the switch
<residental_delay>: Switching delay
<Tx latency>: Delay between time stamping unit and wire (configurable via 
register)
<tail_tag>: Time stamp in the tail tag

The new correction value has been captured with Wireshark. For the measurement 
I simply halted the internal PTP clock and set it to zero (so it's value is 
exactly known). In the port's TX time stamp unit I got a non-zero time stamp 
anyway (between, 10 to 40 ns), so this must be the residential delay.

I tested with different values for the correction field and the tail_tag. 
Negative values are no problem, but the calculation seems no to consider all 
bits.

Measurements:
- PTP clock: 0 (frozen)
- Residential delay: variable
- TX delay: 45 ns (default)

- correction = 0xffff ffff ffff.ffff (-1.xxx ns)
correction = correction + now + residental_delay + tx_latency - tail_tag
           =          -1  + 0                 40 +         45          0
           = 84             (0000 0000 0054)
           wireshark:       (0000 0000 0054) --> correct

correction = correction + now + residental_delay + tx_latency - tail_tag
           =          -1  + 0                 32 +         45      1.000
           = -924           (FFFF FFFF FC64)
           wireshark:       (0000 FFFF FC64) --> wrong

correction = correction + now + residental_delay + tx_latency - tail_tag
           =          -1  + 0                 24 +         45   2.000.000.000 
           = -1.926.258.108 (FFFF 8D2F A244)
           wireshark:       (0000 7B9A CA44) --> wrong

- correction = 0xffff ffff 0000.0000 (-65536.0 ns)
correction = correction + now + residental_delay + tx_latency - tail_tag
           =     -65536     0                 24 +         45          0
           =     -65467     (FFFF FFFF 0045)
           wireshark:       (FFFF FFFF 0045) --> correct

correction = correction + now + residental_delay + tx_latency - tail_tag
           =     -65536   + 0                 32 +         45      1.000
           =     -66459     (FFFF FFFE FC65)
           wireshark:       (0000 FFFE FC65) --> wrong


Please note that the tail tag consist of 2 bits for seconds and 30 bit 
nanoseconds. So the value of 2.000.000.000 means 1s + 926.258.176 ns.

As you are better in 2's complement as me, you can give me some more 
combinations for testing if you need. But in the end it looks like I should 
keep T2 in the tail tag.

> And zeroing out the correctionField of the Pdelay_Resp on transmission,
> to put that value into t_Tail_Tag? How can you squeeze a 48-bit value
> into a 32-bit value without truncation?
Only the lower bits are used. As long as PDelayResp doesn't take more than 4 
seconds, this should be enough.

regards
Christian
Christian Eggers Nov. 12, 2020, 3:28 p.m. UTC | #28
Hi Vladimir,

On Tuesday, 10 November 2020, 20:32:45 CET, Vladimir Oltean wrote:
> But something is still wrong if you need to special-case the negative
> correctionField, it looks like the arithmetic is not done on the correct
> number of bits, either by the driver or by the hardware.
I got it! The problem was caused because I subtracted the "full" timestamp 
(reconstructed 64 bit seconds) from the correction field. When the KSZ updates 
the correction field of the PDelay_Resp message on tx, only the "partial" (2 
bit seconds) egress time stamp will be added. So the invalid values in the 
correction field were caused because much more seconds were subtracted on rx 
than added on tx.

> And zeroing out the correctionField of the Pdelay_Resp on transmission,
> to put that value into t_Tail_Tag? How can you squeeze a 48-bit value
> into a 32-bit value without truncation?
The answer is simple: Only the lower 32 bit (tail tag) must be subtracted as 
only 32 bit will be added on egress.

v2 is on the way...

regards
Christian
Vladimir Oltean Nov. 12, 2020, 3:38 p.m. UTC | #29
On Thu, Nov 12, 2020 at 04:28:44PM +0100, Christian Eggers wrote:
> the invalid values in the correction field were caused because much
> more seconds were subtracted on rx than added on tx.
[...]
> v2 is on the way...

Alright :)
Christian Eggers Nov. 17, 2020, 11:27 a.m. UTC | #30
On Thursday, 12 November 2020, 16:28:44 CET, Christian Eggers wrote:
> Hi Vladimir,
> 
> On Tuesday, 10 November 2020, 20:32:45 CET, Vladimir Oltean wrote:
> > But something is still wrong if you need to special-case the negative
> > correctionField, it looks like the arithmetic is not done on the correct
> > number of bits, either by the driver or by the hardware.
> 
> I got it! 
I got it not!

While keeping the (negative) correction field works perfect when using
PTP over L2 (what I did the last weeks), this causes an unwanted side effect
when using UDP:

...
> User Datagram Protocol, Src Port: 319, Dst Port: 319
> 
>     Source Port: 319
>     Destination Port: 319
>     Length: 62
>  
>  Checksum: 0x2285 incorrect, should be 0x2286 (maybe caused by "UDP checksum offload"?)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>     [Checksum Status: Bad]
>     [Stream index: 0]
>     [Timestamps]
> 
> Precision Time Protocol (IEEE1588)
> 
>     0000 .... = transportSpecific: 0x0
>     .... 0011 = messageId: Peer_Delay_Resp Message (0x3)
>     0000 .... = Reserved: 0
>     .... 0010 = versionPTP: 2
>     messageLength: 54
>     subdomainNumber: 0
>     Reserved: 0
>     flags: 0x0000
>     correction: 5579788,000000 nanoseconds
>     Reserved: 0
>     ClockIdentity: 0x849000fffe0980f7
>     SourcePortID: 1
>     sequenceId: 785
>     control: Other Message (5)
>     logMessagePeriod: 127
>     requestreceiptTimestamp (seconds): 0
>     requestreceiptTimestamp (nanoseconds): 0
>     requestingSourcePortIdentity: 0x849000fffe0980f6
>     requestingSourcePortId: 2
While correction field is ok (residential delay ~5ms, using one printk...),
the UDP checksum is off by one in all PDelay_Resp messages.

The KSZ device offers on option to set the UDP checksum to zero, but this also
didn't help and additionally wouldn't work for IPv6.

It seems that I should return to "moving T2 from the correction field to the
tail tag" on tx.
 
regards
Christian
diff mbox series

Patch

diff --git a/drivers/net/dsa/microchip/ksz9477_main.c b/drivers/net/dsa/microchip/ksz9477_main.c
index 7d623400139f..42cd17c8c25d 100644
--- a/drivers/net/dsa/microchip/ksz9477_main.c
+++ b/drivers/net/dsa/microchip/ksz9477_main.c
@@ -1388,6 +1388,7 @@  static const struct dsa_switch_ops ksz9477_switch_ops = {
 	.phy_read		= ksz9477_phy_read16,
 	.phy_write		= ksz9477_phy_write16,
 	.phylink_mac_link_down	= ksz_mac_link_down,
+	.get_ts_info		= ksz9477_ptp_get_ts_info,
 	.port_enable		= ksz_enable_port,
 	.get_strings		= ksz9477_get_strings,
 	.get_ethtool_stats	= ksz_get_ethtool_stats,
@@ -1408,6 +1409,11 @@  static const struct dsa_switch_ops ksz9477_switch_ops = {
 	.port_mdb_del           = ksz9477_port_mdb_del,
 	.port_mirror_add	= ksz9477_port_mirror_add,
 	.port_mirror_del	= ksz9477_port_mirror_del,
+	.port_hwtstamp_get      = ksz9477_ptp_port_hwtstamp_get,
+	.port_hwtstamp_set      = ksz9477_ptp_port_hwtstamp_set,
+	.port_txtstamp          = ksz9477_ptp_port_txtstamp,
+	/* never defer rx delivery, tstamping is done via tail tagging */
+	.port_rxtstamp          = NULL,
 };
 
 static u32 ksz9477_get_port_addr(int port, int offset)
@@ -1554,7 +1560,8 @@  static irqreturn_t ksz9477_switch_irq_thread(int irq, void *dev_id)
 			if (ret)
 				return result;
 
-			/* ToDo: Add specific handling of port interrupts */
+			if (data8 & PORT_PTP_INT)
+				result |= ksz9477_ptp_port_interrupt(dev, port);
 		}
 	}
 
diff --git a/drivers/net/dsa/microchip/ksz9477_ptp.c b/drivers/net/dsa/microchip/ksz9477_ptp.c
index 44d7bbdea518..d4bcfff0577e 100644
--- a/drivers/net/dsa/microchip/ksz9477_ptp.c
+++ b/drivers/net/dsa/microchip/ksz9477_ptp.c
@@ -6,7 +6,10 @@ 
  * Copyright (c) 2020 ARRI Lighting
  */
 
+#include <linux/net_tstamp.h>
+#include <linux/ptp_classify.h>
 #include <linux/ptp_clock_kernel.h>
+#include <linux/sysfs.h>
 
 #include "ksz_common.h"
 #include "ksz9477_reg.h"
@@ -71,8 +74,10 @@  static int ksz9477_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 static int ksz9477_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 {
 	struct ksz_device *dev = container_of(ptp, struct ksz_device, ptp_caps);
+	struct timespec64 delta64 = ns_to_timespec64(delta);
 	s32 sec, nsec;
 	u16 data16;
+	unsigned long flags;
 	int ret;
 
 	mutex_lock(&dev->ptp_mutex);
@@ -103,6 +108,10 @@  static int ksz9477_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 	if (ret)
 		goto error_return;
 
+	spin_lock_irqsave(&dev->ptp_clock_lock, flags);
+	dev->ptp_clock_time = timespec64_add(dev->ptp_clock_time, delta64);
+	spin_unlock_irqrestore(&dev->ptp_clock_lock, flags);
+
 error_return:
 	mutex_unlock(&dev->ptp_mutex);
 	return ret;
@@ -160,6 +169,7 @@  static int ksz9477_ptp_settime(struct ptp_clock_info *ptp, struct timespec64 con
 {
 	struct ksz_device *dev = container_of(ptp, struct ksz_device, ptp_caps);
 	u16 data16;
+	unsigned long flags;
 	int ret;
 
 	mutex_lock(&dev->ptp_mutex);
@@ -198,6 +208,10 @@  static int ksz9477_ptp_settime(struct ptp_clock_info *ptp, struct timespec64 con
 	if (ret)
 		goto error_return;
 
+	spin_lock_irqsave(&dev->ptp_clock_lock, flags);
+	dev->ptp_clock_time = *ts;
+	spin_unlock_irqrestore(&dev->ptp_clock_lock, flags);
+
 error_return:
 	mutex_unlock(&dev->ptp_mutex);
 	return ret;
@@ -208,9 +222,27 @@  static int ksz9477_ptp_enable(struct ptp_clock_info *ptp, struct ptp_clock_reque
 	return -ENOTTY;
 }
 
+static long ksz9477_ptp_do_aux_work(struct ptp_clock_info *ptp)
+{
+	struct ksz_device *dev = container_of(ptp, struct ksz_device, ptp_caps);
+	struct timespec64 ts;
+	unsigned long flags;
+
+	mutex_lock(&dev->ptp_mutex);
+	_ksz9477_ptp_gettime(dev, &ts);
+	mutex_unlock(&dev->ptp_mutex);
+
+	spin_lock_irqsave(&dev->ptp_clock_lock, flags);
+	dev->ptp_clock_time = ts;
+	spin_unlock_irqrestore(&dev->ptp_clock_lock, flags);
+
+	return HZ;  /* reschedule in 1 second */
+}
+
 static int ksz9477_ptp_start_clock(struct ksz_device *dev)
 {
 	u16 data;
+	unsigned long flags;
 	int ret;
 
 	ret = ksz_read16(dev, REG_PTP_CLK_CTRL, &data);
@@ -230,6 +262,11 @@  static int ksz9477_ptp_start_clock(struct ksz_device *dev)
 	if (ret)
 		return ret;
 
+	spin_lock_irqsave(&dev->ptp_clock_lock, flags);
+	dev->ptp_clock_time.tv_sec = 0;
+	dev->ptp_clock_time.tv_nsec = 0;
+	spin_unlock_irqrestore(&dev->ptp_clock_lock, flags);
+
 	return 0;
 }
 
@@ -251,11 +288,349 @@  static int ksz9477_ptp_stop_clock(struct ksz_device *dev)
 	return 0;
 }
 
+/* Time stamping support */
+
+static int ksz9477_ptp_enable_mode(struct ksz_device *dev)
+{
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, REG_PTP_MSG_CONF1, &data);
+	if (ret)
+		return ret;
+
+	/* Enable PTP mode */
+	data |= PTP_ENABLE;
+	ret = ksz_write16(dev, REG_PTP_MSG_CONF1, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int ksz9477_ptp_disable_mode(struct ksz_device *dev)
+{
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, REG_PTP_MSG_CONF1, &data);
+	if (ret)
+		return ret;
+
+	/* Disable PTP mode */
+	data &= ~PTP_ENABLE;
+	ret = ksz_write16(dev, REG_PTP_MSG_CONF1, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int ksz9477_ptp_enable_port_ptp_interrupts(struct ksz_device *dev, int port)
+{
+	u32 addr = PORT_CTRL_ADDR(port, REG_PORT_INT_MASK);
+	u8 data;
+	int ret;
+
+	ret = ksz_read8(dev, addr, &data);
+	if (ret)
+		return ret;
+
+	/* Enable port PTP interrupt (0 means enabled) */
+	data &= ~PORT_PTP_INT;
+	ret = ksz_write8(dev, addr, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int ksz9477_ptp_disable_port_ptp_interrupts(struct ksz_device *dev, int port)
+{
+	u32 addr = PORT_CTRL_ADDR(port, REG_PORT_INT_MASK);
+	u8 data;
+	int ret;
+
+	ret = ksz_read8(dev, addr, &data);
+	if (ret)
+		return ret;
+
+	/* Enable port PTP interrupt (1 means disabled) */
+	data |= PORT_PTP_INT;
+	ret = ksz_write8(dev, addr, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int ksz9477_ptp_enable_port_egress_interrupts(struct ksz_device *dev, int port)
+{
+	u32 addr = PORT_CTRL_ADDR(port, REG_PTP_PORT_TX_INT_ENABLE__2);
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, addr, &data);
+	if (ret)
+		return ret;
+
+	/* Enable port xdelay egress timestamp interrupt (1 means enabled) */
+	data |= PTP_PORT_XDELAY_REQ_INT;
+	ret = ksz_write16(dev, addr, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int ksz9477_ptp_disable_port_egress_interrupts(struct ksz_device *dev, int port)
+{
+	u32 addr = PORT_CTRL_ADDR(port, REG_PTP_PORT_TX_INT_ENABLE__2);
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, addr, &data);
+	if (ret)
+		return ret;
+
+	/* Disable port xdelay egress timestamp interrupts (0 means disabled) */
+	data &= PTP_PORT_XDELAY_REQ_INT;
+	ret = ksz_write16(dev, addr, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int ksz9477_ptp_port_init(struct ksz_device *dev, int port)
+{
+	struct ksz_port *prt = &dev->ports[port];
+	int ret;
+
+	/* Read rx and tx delay from port registers */
+	ret = ksz_read16(dev, PORT_CTRL_ADDR(port, REG_PTP_PORT_RX_DELAY__2),
+			 &prt->tstamp_rx_latency_ns);
+	if (ret)
+		return ret;
+
+	ret = ksz_read16(dev, PORT_CTRL_ADDR(port, REG_PTP_PORT_TX_DELAY__2),
+			 &prt->tstamp_tx_latency_ns);
+	if (ret)
+		return ret;
+
+	if (port != dev->cpu_port) {
+		ret = ksz9477_ptp_enable_port_ptp_interrupts(dev, port);
+		if (ret)
+			return ret;
+
+		ret = ksz9477_ptp_enable_port_egress_interrupts(dev, port);
+		if (ret)
+			goto error_disable_port_ptp_interrupts;
+	}
+
+	return 0;
+
+error_disable_port_ptp_interrupts:
+	if (port != dev->cpu_port)
+		ksz9477_ptp_disable_port_ptp_interrupts(dev, port);
+	return ret;
+}
+
+static void ksz9477_ptp_port_deinit(struct ksz_device *dev, int port)
+{
+	if (port != dev->cpu_port) {
+		ksz9477_ptp_disable_port_egress_interrupts(dev, port);
+		ksz9477_ptp_disable_port_ptp_interrupts(dev, port);
+	}
+}
+
+static int ksz9477_ptp_ports_init(struct ksz_device *dev)
+{
+	int port;
+	int ret;
+
+	for (port = 0; port < dev->port_cnt; port++) {
+		ret = ksz9477_ptp_port_init(dev, port);
+		if (ret)
+			goto error_deinit;
+	}
+
+	return 0;
+
+error_deinit:
+	for (--port; port >= 0; --port)
+		ksz9477_ptp_port_deinit(dev, port);
+	return ret;
+}
+
+static void ksz9477_ptp_ports_deinit(struct ksz_device *dev)
+{
+	int port;
+
+	for (port = dev->port_cnt - 1; port >= 0; --port)
+		ksz9477_ptp_port_deinit(dev, port);
+}
+
+/* device attributes */
+
+enum ksz9477_ptp_tcmode {
+	KSZ9477_PTP_TCMODE_E2E,
+	KSZ9477_PTP_TCMODE_P2P,
+};
+
+static int ksz9477_ptp_tcmode_get(struct ksz_device *dev, enum ksz9477_ptp_tcmode *tcmode)
+{
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, REG_PTP_MSG_CONF1, &data);
+	if (ret)
+		return ret;
+
+	*tcmode = (data & PTP_TC_P2P) ? KSZ9477_PTP_TCMODE_P2P : KSZ9477_PTP_TCMODE_E2E;
+
+	return 0;
+}
+
+static int ksz9477_ptp_tcmode_set(struct ksz_device *dev, enum ksz9477_ptp_tcmode tcmode)
+{
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, REG_PTP_MSG_CONF1, &data);
+	if (ret)
+		return ret;
+
+	if (tcmode == KSZ9477_PTP_TCMODE_P2P)
+		data |= PTP_TC_P2P;
+	else
+		data &= ~PTP_TC_P2P;
+
+	ret = ksz_write16(dev, REG_PTP_MSG_CONF1, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static ssize_t tcmode_show(struct device *dev, struct device_attribute *attr __always_unused,
+			   char *buf)
+{
+	struct ksz_device *ksz = dev_get_drvdata(dev);
+	enum ksz9477_ptp_tcmode tcmode;
+	int ret = ksz9477_ptp_tcmode_get(ksz, &tcmode);
+
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%s\n", tcmode == KSZ9477_PTP_TCMODE_P2P ? "P2P" : "E2E");
+}
+
+static ssize_t tcmode_store(struct device *dev, struct device_attribute *attr __always_unused,
+			    char const *buf, size_t count)
+{
+	struct ksz_device *ksz = dev_get_drvdata(dev);
+	int ret;
+
+	if (strcasecmp(buf, "E2E") == 0)
+		ret = ksz9477_ptp_tcmode_set(ksz, KSZ9477_PTP_TCMODE_E2E);
+	else if (strcasecmp(buf, "P2P") == 0)
+		ret = ksz9477_ptp_tcmode_set(ksz, KSZ9477_PTP_TCMODE_P2P);
+	else
+		return -EINVAL;
+
+	return ret ? ret : (ssize_t)count;
+}
+
+static DEVICE_ATTR_RW(tcmode);
+
+enum ksz9477_ptp_ocmode {
+	KSZ9477_PTP_OCMODE_SLAVE,
+	KSZ9477_PTP_OCMODE_MASTER,
+};
+
+static int ksz9477_ptp_ocmode_get(struct ksz_device *dev, enum ksz9477_ptp_ocmode *ocmode)
+{
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, REG_PTP_MSG_CONF1, &data);
+	if (ret)
+		return ret;
+
+	*ocmode = (data & PTP_MASTER) ? KSZ9477_PTP_OCMODE_MASTER : KSZ9477_PTP_OCMODE_SLAVE;
+
+	return 0;
+}
+
+static int ksz9477_ptp_ocmode_set(struct ksz_device *dev, enum ksz9477_ptp_ocmode ocmode)
+{
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, REG_PTP_MSG_CONF1, &data);
+	if (ret)
+		return ret;
+
+	if (ocmode == KSZ9477_PTP_OCMODE_MASTER)
+		data |= PTP_MASTER;
+	else
+		data &= ~PTP_MASTER;
+
+	ret = ksz_write16(dev, REG_PTP_MSG_CONF1, data);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static ssize_t ocmode_show(struct device *dev, struct device_attribute *attr __always_unused,
+			   char *buf)
+{
+	struct ksz_device *ksz = dev_get_drvdata(dev);
+	enum ksz9477_ptp_ocmode ocmode;
+	int ret = ksz9477_ptp_ocmode_get(ksz, &ocmode);
+
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%s\n", ocmode == KSZ9477_PTP_OCMODE_MASTER ? "master" : "slave");
+}
+
+static ssize_t ocmode_store(struct device *dev, struct device_attribute *attr __always_unused,
+			    char const *buf, size_t count)
+{
+	struct ksz_device *ksz = dev_get_drvdata(dev);
+	int ret;
+
+	if (strcasecmp(buf, "master") == 0)
+		ret = ksz9477_ptp_ocmode_set(ksz, KSZ9477_PTP_OCMODE_MASTER);
+	else if (strcasecmp(buf, "slave") == 0)
+		ret = ksz9477_ptp_ocmode_set(ksz, KSZ9477_PTP_OCMODE_SLAVE);
+	else
+		return -EINVAL;
+
+	return ret ? ret : (ssize_t)count;
+}
+
+static DEVICE_ATTR_RW(ocmode);
+
+static struct attribute *ksz9477_ptp_attrs[] = {
+	&dev_attr_tcmode.attr,
+	&dev_attr_ocmode.attr,
+	NULL,
+};
+
+static struct attribute_group ksz9477_ptp_attrgrp = {
+	.attrs = ksz9477_ptp_attrs,
+};
+
 int ksz9477_ptp_init(struct ksz_device *dev)
 {
 	int ret;
 
 	mutex_init(&dev->ptp_mutex);
+	spin_lock_init(&dev->ptp_clock_lock);
 
 	/* PTP clock properties */
 
@@ -275,6 +650,7 @@  int ksz9477_ptp_init(struct ksz_device *dev)
 	dev->ptp_caps.gettime64   = ksz9477_ptp_gettime;
 	dev->ptp_caps.settime64   = ksz9477_ptp_settime;
 	dev->ptp_caps.enable      = ksz9477_ptp_enable;
+	dev->ptp_caps.do_aux_work = ksz9477_ptp_do_aux_work;
 
 	/* Start hardware counter (will overflow after 136 years) */
 	ret = ksz9477_ptp_start_clock(dev);
@@ -287,8 +663,36 @@  int ksz9477_ptp_init(struct ksz_device *dev)
 		goto error_stop_clock;
 	}
 
+	/* Enable PTP mode (will affect tail tagging format) */
+	ret = ksz9477_ptp_enable_mode(dev);
+	if (ret)
+		goto error_unregister_clock;
+
+	/* Init switch ports */
+	ret = ksz9477_ptp_ports_init(dev);
+	if (ret)
+		goto error_disable_mode;
+
+	/* Init attributes */
+	ret = sysfs_create_group(&dev->dev->kobj, &ksz9477_ptp_attrgrp);
+	if (ret)
+		goto error_ports_deinit;
+
+	/* Schedule cyclic call of ksz_ptp_do_aux_work() */
+	ret = ptp_schedule_worker(dev->ptp_clock, 0);
+	if (ret)
+		goto error_device_remove_attrgrp;
+
 	return 0;
 
+error_device_remove_attrgrp:
+	sysfs_remove_group(&dev->dev->kobj, &ksz9477_ptp_attrgrp);
+error_ports_deinit:
+	ksz9477_ptp_ports_deinit(dev);
+error_disable_mode:
+	ksz9477_ptp_disable_mode(dev);
+error_unregister_clock:
+	ptp_clock_unregister(dev->ptp_clock);
 error_stop_clock:
 	ksz9477_ptp_stop_clock(dev);
 	return ret;
@@ -296,6 +700,231 @@  int ksz9477_ptp_init(struct ksz_device *dev)
 
 void ksz9477_ptp_deinit(struct ksz_device *dev)
 {
+	sysfs_remove_group(&dev->dev->kobj, &ksz9477_ptp_attrgrp);
+	ksz9477_ptp_ports_deinit(dev);
+	ksz9477_ptp_disable_mode(dev);
 	ptp_clock_unregister(dev->ptp_clock);
 	ksz9477_ptp_stop_clock(dev);
 }
+
+irqreturn_t ksz9477_ptp_port_interrupt(struct ksz_device *dev, int port)
+{
+	u32 addr = PORT_CTRL_ADDR(port, REG_PTP_PORT_TX_INT_STATUS__2);
+	struct ksz_port *prt = &dev->ports[port];
+	irqreturn_t result = IRQ_NONE;
+	u16 data;
+	int ret;
+
+	ret = ksz_read16(dev, addr, &data);
+	if (ret)
+		return IRQ_NONE;
+
+	if ((data & PTP_PORT_XDELAY_REQ_INT) && prt->tstamp_tx_xdelay_skb) {
+		/* Timestamp for Pdelay_Req / Delay_Req */
+		u32 tstamp_raw;
+		struct skb_shared_hwtstamps shhwtstamps;
+		struct sk_buff *tmp_skb;
+
+		/* In contrast to the KSZ9563R data sheet, the format of the
+		 * port time stamp registers is 2 bit seconds + 30 bit nano
+		 * seconds (same as in the tail tags).
+		 */
+		ret = ksz_read32(dev, PORT_CTRL_ADDR(port, REG_PTP_PORT_XDELAY_TS), &tstamp_raw);
+		if (ret)
+			return result;
+
+		memset(&shhwtstamps, 0, sizeof(shhwtstamps));
+		shhwtstamps.hwtstamp = ksz9477_tstamp_to_clock(dev, tstamp_raw,
+							       prt->tstamp_tx_latency_ns);
+
+		/* skb_complete_tx_timestamp() will free up the client to make
+		 * another timestamp-able transmit. We have to be ready for it
+		 * -- by clearing the ps->tx_skb "flag" -- beforehand.
+		 */
+
+		tmp_skb = prt->tstamp_tx_xdelay_skb;
+		prt->tstamp_tx_xdelay_skb = NULL;
+		clear_bit_unlock(KSZ_HWTSTAMP_TX_XDELAY_IN_PROGRESS, &prt->tstamp_state);
+		skb_complete_tx_timestamp(tmp_skb, &shhwtstamps);
+	}
+
+	/* Clear interrupt(s) (W1C) */
+	ret = ksz_write16(dev, addr, data);
+	if (ret)
+		return IRQ_NONE;
+
+	return IRQ_HANDLED;
+}
+
+/* DSA PTP operations */
+
+int ksz9477_ptp_get_ts_info(struct dsa_switch *ds, int port __always_unused,
+			    struct ethtool_ts_info *ts)
+{
+	struct ksz_device *dev = ds->priv;
+
+	ts->so_timestamping =
+			SOF_TIMESTAMPING_TX_HARDWARE |
+			SOF_TIMESTAMPING_RX_HARDWARE |
+			SOF_TIMESTAMPING_RAW_HARDWARE;
+
+	ts->phc_index = ptp_clock_index(dev->ptp_clock);
+
+	ts->tx_types =
+			BIT(HWTSTAMP_TX_OFF) |
+			BIT(HWTSTAMP_TX_ONESTEP_P2P);
+
+	ts->rx_filters =
+			BIT(HWTSTAMP_FILTER_NONE) |
+			BIT(HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
+			BIT(HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
+			BIT(HWTSTAMP_FILTER_PTP_V2_EVENT);
+
+	return 0;
+}
+
+static int ksz9477_set_hwtstamp_config(struct ksz_device *dev, int port,
+				       struct hwtstamp_config *config)
+{
+	struct ksz_port *prt = &dev->ports[port];
+	bool tstamp_enable = false;
+
+	/* Prevent the TX/RX paths from trying to interact with the
+	 * timestamp hardware while we reconfigure it.
+	 */
+	clear_bit_unlock(KSZ_HWTSTAMP_ENABLED, &prt->tstamp_state);
+
+	/* reserved for future extensions */
+	if (config->flags)
+		return -EINVAL;
+
+	switch (config->tx_type) {
+	case HWTSTAMP_TX_OFF:
+		tstamp_enable = false;
+		break;
+	case HWTSTAMP_TX_ONESTEP_P2P:
+		tstamp_enable = true;
+		break;
+	default:
+		return -ERANGE;
+	}
+
+	switch (config->rx_filter) {
+	case HWTSTAMP_FILTER_NONE:
+		tstamp_enable = false;
+		break;
+	case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+		config->rx_filter = HWTSTAMP_FILTER_PTP_V2_L4_EVENT;  /* UDPv4/UDPv6 */
+		break;
+	case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+		config->rx_filter = HWTSTAMP_FILTER_PTP_V2_L2_EVENT;  /* 802.3 ether */
+		break;
+	case HWTSTAMP_FILTER_PTP_V2_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+		config->rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;     /* UDP / 802.3 */
+		break;
+	case HWTSTAMP_FILTER_ALL:
+	default:
+		config->rx_filter = HWTSTAMP_FILTER_NONE;
+		return -ERANGE;
+	}
+
+	/* Once hardware has been configured, enable timestamp checks
+	 * in the RX/TX paths.
+	 */
+	if (tstamp_enable)
+		set_bit(KSZ_HWTSTAMP_ENABLED, &prt->tstamp_state);
+
+	return 0;
+}
+
+int ksz9477_ptp_port_hwtstamp_get(struct dsa_switch *ds, int port, struct ifreq *ifr)
+{
+	struct ksz_device *dev = ds->priv;
+	struct hwtstamp_config *port_tstamp_config = &dev->ports[port].tstamp_config;
+
+	return copy_to_user(ifr->ifr_data,
+			    port_tstamp_config, sizeof(*port_tstamp_config)) ? -EFAULT : 0;
+}
+
+int ksz9477_ptp_port_hwtstamp_set(struct dsa_switch *ds, int port, struct ifreq *ifr)
+{
+	struct ksz_device *dev = ds->priv;
+	struct hwtstamp_config *port_tstamp_config = &dev->ports[port].tstamp_config;
+	struct hwtstamp_config config;
+	int err;
+
+	if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+		return -EFAULT;
+
+	err = ksz9477_set_hwtstamp_config(dev, port, &config);
+	if (err)
+		return err;
+
+	/* Save the chosen configuration to be returned later. */
+	memcpy(port_tstamp_config, &config, sizeof(config));
+
+	return copy_to_user(ifr->ifr_data, &config, sizeof(config)) ? -EFAULT : 0;
+}
+
+/* Returns a pointer to the PTP header if the caller should time stamp,
+ * or NULL if the caller should not.
+ */
+static struct ptp_header *ksz9477_ptp_should_tstamp(struct ksz_port *port, struct sk_buff *skb,
+						    unsigned int type)
+{
+	if (!test_bit(KSZ_HWTSTAMP_ENABLED, &port->tstamp_state))
+		return NULL;
+
+	return ptp_parse_header(skb, type);
+}
+
+bool ksz9477_ptp_port_txtstamp(struct dsa_switch *ds, int port, struct sk_buff *clone,
+			       unsigned int type)
+{
+	struct ksz_device *dev = ds->priv;
+	struct ksz_port *prt = &dev->ports[port];
+	enum ksz9477_ptp_event_messages msg_type;
+	struct ptp_header *hdr;
+
+	/* KSZ9563 supports PTPv2 only */
+	if (!(type & PTP_CLASS_V2))
+		return false;
+
+	/* Should already been tested in dsa_skb_tx_timestamp()? */
+	if (!(skb_shinfo(clone)->tx_flags & SKBTX_HW_TSTAMP))
+		return false;
+
+	hdr = ksz9477_ptp_should_tstamp(prt, clone, type);
+	if (!hdr)
+		return false;
+
+	msg_type = ptp_get_msgtype(hdr, type);
+
+	switch (msg_type) {
+	/* As the KSZ9563 always performs one step time stamping, only the time
+	 * stamp for Delay_Req and Pdelay_Req are reported to the application
+	 * via socket error queue. Time stamps for Sync and Pdelay_resp will be
+	 * applied directly to the outgoing message (e.g. correction field), but
+	 * will NOT be reported to the socket.
+	 */
+	case PTP_Event_Message_Delay_Req:
+	case PTP_Event_Message_Pdelay_Req:
+		if (test_and_set_bit_lock(KSZ_HWTSTAMP_TX_XDELAY_IN_PROGRESS,
+					  &prt->tstamp_state))
+			return false;  /* free cloned skb */
+
+		prt->tstamp_tx_xdelay_skb = clone;
+		break;
+
+	default:
+		return false;  /* free cloned skb */
+	}
+
+	return true;  /* keep cloned skb */
+}
diff --git a/drivers/net/dsa/microchip/ksz9477_ptp.h b/drivers/net/dsa/microchip/ksz9477_ptp.h
index 0076538419fa..e8d50a086311 100644
--- a/drivers/net/dsa/microchip/ksz9477_ptp.h
+++ b/drivers/net/dsa/microchip/ksz9477_ptp.h
@@ -10,6 +10,9 @@ 
 #ifndef DRIVERS_NET_DSA_MICROCHIP_KSZ9477_PTP_H_
 #define DRIVERS_NET_DSA_MICROCHIP_KSZ9477_PTP_H_
 
+#include <linux/irqreturn.h>
+#include <linux/types.h>
+
 #include "ksz_common.h"
 
 #if IS_ENABLED(CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP)
@@ -17,11 +20,35 @@ 
 int ksz9477_ptp_init(struct ksz_device *dev);
 void ksz9477_ptp_deinit(struct ksz_device *dev);
 
+irqreturn_t ksz9477_ptp_port_interrupt(struct ksz_device *dev, int port);
+
+int ksz9477_ptp_get_ts_info(struct dsa_switch *ds, int port, struct ethtool_ts_info *ts);
+int ksz9477_ptp_port_hwtstamp_get(struct dsa_switch *ds, int port, struct ifreq *ifr);
+int ksz9477_ptp_port_hwtstamp_set(struct dsa_switch *ds, int port, struct ifreq *ifr);
+bool ksz9477_ptp_port_txtstamp(struct dsa_switch *ds, int port, struct sk_buff *clone,
+			       unsigned int type);
+
 #else
 
 static inline int ksz9477_ptp_init(struct ksz_device *dev) { return 0; }
 static inline void ksz9477_ptp_deinit(struct ksz_device *dev) {}
 
+static inline irqreturn_t ksz9477_ptp_port_interrupt(struct ksz_device *dev, int port)
+	{ return IRQ_NONE; }
+
+static inline int ksz9477_ptp_get_ts_info(struct dsa_switch *ds, int port,
+					  struct ethtool_ts_info *ts) { return -EOPNOTSUPP; }
+
+static inline int ksz9477_ptp_port_hwtstamp_get(struct dsa_switch *ds, int port,
+						struct ifreq *ifr) { return -EOPNOTSUPP; }
+
+static inline int ksz9477_ptp_port_hwtstamp_set(struct dsa_switch *ds, int port, struct ifreq *ifr)
+	{ return -EOPNOTSUPP; }
+
+static inline bool ksz9477_ptp_port_txtstamp(struct dsa_switch *ds, int port,
+					     struct sk_buff *clone, unsigned int type)
+	{ return false; }
+
 #endif
 
 #endif /* DRIVERS_NET_DSA_MICROCHIP_KSZ9477_PTP_H_ */
diff --git a/include/linux/dsa/ksz_common.h b/include/linux/dsa/ksz_common.h
index 4d5b6cc9429a..10b42154ba5a 100644
--- a/include/linux/dsa/ksz_common.h
+++ b/include/linux/dsa/ksz_common.h
@@ -8,8 +8,10 @@ 
 #include <linux/gpio/consumer.h>
 #include <linux/kernel.h>
 #include <linux/mutex.h>
+#include <linux/net_tstamp.h>
 #include <linux/phy.h>
 #include <linux/ptp_clock_kernel.h>
+#include <linux/spinlock.h>
 #include <linux/regmap.h>
 #include <linux/timer.h>
 #include <linux/workqueue.h>
@@ -25,6 +27,12 @@  struct ksz_port_mib {
 	u64 *counters;
 };
 
+/* state flags for ksz_port::tstamp_state */
+enum {
+	KSZ_HWTSTAMP_ENABLED,
+	KSZ_HWTSTAMP_TX_XDELAY_IN_PROGRESS,
+};
+
 struct ksz_port {
 	u16 member;
 	u16 vid_member;
@@ -41,6 +49,13 @@  struct ksz_port {
 
 	struct ksz_port_mib mib;
 	phy_interface_t interface;
+#if IS_ENABLED(CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP)
+	u16 tstamp_rx_latency_ns;   /* rx delay from wire to tstamp unit */
+	u16 tstamp_tx_latency_ns;   /* tx delay from tstamp unit to wire */
+	struct hwtstamp_config tstamp_config;
+	struct sk_buff *tstamp_tx_xdelay_skb;
+	unsigned long tstamp_state;
+#endif
 };
 
 struct ksz_device {
@@ -98,7 +113,22 @@  struct ksz_device {
 	struct ptp_clock *ptp_clock;
 	struct ptp_clock_info ptp_caps;
 	struct mutex ptp_mutex;
+	spinlock_t ptp_clock_lock; /* for ptp_clock_time */
+	/* approximated current time, read once per second from hardware */
+	struct timespec64 ptp_clock_time;
 #endif
 };
 
+/* KSZ9563 will only timestamp event messages */
+enum ksz9477_ptp_event_messages {
+	PTP_Event_Message_Sync        = 0x0,
+	PTP_Event_Message_Delay_Req   = 0x1,
+	PTP_Event_Message_Pdelay_Req  = 0x2,
+	PTP_Event_Message_Pdelay_Resp = 0x3,
+};
+
+/* net/dsa/tag_ksz.c */
+ktime_t ksz9477_tstamp_to_clock(struct ksz_device *ksz, u32 tstamp,
+		int offset_ns);
+
 #endif /* _NET_DSA_KSZ_COMMON_H_ */
diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
index 4820dbcedfa2..9dbe2fde3db0 100644
--- a/net/dsa/tag_ksz.c
+++ b/net/dsa/tag_ksz.c
@@ -4,10 +4,14 @@ 
  * Copyright (c) 2017 Microchip Technology
  */
 
+#include <asm/unaligned.h>
+#include <linux/dsa/ksz_common.h>
 #include <linux/etherdevice.h>
 #include <linux/list.h>
+#include <linux/ptp_classify.h>
 #include <linux/slab.h>
 #include <net/dsa.h>
+#include <net/checksum.h>
 #include "dsa_priv.h"
 
 /* Typically only one byte is used for tail tag. */
@@ -87,26 +91,210 @@  MODULE_ALIAS_DSA_TAG_DRIVER(DSA_TAG_PROTO_KSZ8795);
 /*
  * For Ingress (Host -> KSZ9477), 2 bytes are added before FCS.
  * ---------------------------------------------------------------------------
- * DA(6bytes)|SA(6bytes)|....|Data(nbytes)|tag0(1byte)|tag1(1byte)|FCS(4bytes)
+ * DA(6bytes)|SA(6bytes)|....|Data(nbytes)|ts(4bytes)|tag0(1byte)|tag1(1byte)|FCS(4bytes)
  * ---------------------------------------------------------------------------
+ * ts   : time stamp (only present if PTP is enabled on the hardware).
  * tag0 : Prioritization (not used now)
  * tag1 : each bit represents port (eg, 0x01=port1, 0x02=port2, 0x10=port5)
  *
- * For Egress (KSZ9477 -> Host), 1 byte is added before FCS.
+ * For Egress (KSZ9477 -> Host), 1/4 bytes are added before FCS.
  * ---------------------------------------------------------------------------
- * DA(6bytes)|SA(6bytes)|....|Data(nbytes)|tag0(1byte)|FCS(4bytes)
+ * DA(6bytes)|SA(6bytes)|....|Data(nbytes)|ts(4bytes)|tag0(1byte)|FCS(4bytes)
  * ---------------------------------------------------------------------------
+ * ts   : time stamp (only present of bit 7 in tag0 is set
  * tag0 : zero-based value represents port
  *	  (eg, 0x00=port1, 0x02=port3, 0x06=port7)
  */
 
 #define KSZ9477_INGRESS_TAG_LEN		2
 #define KSZ9477_PTP_TAG_LEN		4
-#define KSZ9477_PTP_TAG_INDICATION	0x80
+#define KSZ9477_PTP_TAG_INDICATION	BIT(7)
 
 #define KSZ9477_TAIL_TAG_OVERRIDE	BIT(9)
 #define KSZ9477_TAIL_TAG_LOOKUP		BIT(10)
 
+#if IS_ENABLED(CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP)
+static inline __wsum ksz9477_check_diff8(__be64 old, __be64 new, __wsum oldsum)
+{
+	__be64 diff[2] = { ~old, new };
+
+	return csum_partial(diff, sizeof(diff), oldsum);
+}
+
+/* Replace content of PTP header's correction field and update UDP checksum */
+static void ksz9477_update_ptp_correction_field(struct sk_buff *skb, unsigned int type,
+						struct ptp_header *ptp_header, u64 value)
+{
+	u8 *ptr = skb_mac_header(skb);
+	struct udphdr *uhdr = NULL;
+	__be64 correction_old;
+
+	/* previous correction value is required for checksum update. */
+	memcpy(&correction_old,  &ptp_header->correction, sizeof(correction_old));
+
+	/* write new correction value */
+	put_unaligned_be64(value, &ptp_header->correction);
+
+	/* locate udp header */
+	if (type & PTP_CLASS_VLAN)
+		ptr += VLAN_HLEN;
+
+	ptr += ETH_HLEN;
+
+	switch (type & PTP_CLASS_PMASK) {
+	case PTP_CLASS_IPV4:
+		ptr += ((struct iphdr *)ptr)->ihl << 2;
+		uhdr = (struct udphdr *)ptr;
+		break;
+	case PTP_CLASS_IPV6:
+		ptr += IP6_HLEN;
+		uhdr = (struct udphdr *)ptr;
+		break;
+	}
+
+	if (uhdr) {
+		/* update checksum */
+		uhdr->check = csum_fold(ksz9477_check_diff8(correction_old, ptp_header->correction,
+							    ~csum_unfold(uhdr->check)));
+		if (!uhdr->check)
+			uhdr->check = CSUM_MANGLED_0;
+	}
+}
+
+/* Time stamp tag is only inserted if PTP is enabled in hardware. */
+static void ksz9477_xmit_timestamp(struct sk_buff *skb)
+{
+	enum ksz9477_ptp_event_messages msg_type;
+	struct ptp_header *ptp_hdr;
+	unsigned int ptp_type;
+	u32 tstamp_raw = 0;
+	u64 correction;
+
+	/* For PDelay_Resp messages we will likely have a negative value in the
+	 * correction field (see ksz9477_rcv()). The switch hardware cannot
+	 * correctly update such values, so it must be moved to the time stamp
+	 * field in the tail tag.
+	 */
+
+	/* Check PTP message type */
+	ptp_type = ptp_classify_raw(skb);
+
+	if (ptp_type == PTP_CLASS_NONE)
+		goto out_put_tag;
+
+	ptp_hdr = ptp_parse_header(skb, ptp_type);
+	if (!ptp_hdr)
+		goto out_put_tag;
+
+	msg_type = ptp_get_msgtype(ptp_hdr, ptp_type);
+	if (msg_type != PTP_Event_Message_Pdelay_Resp)
+		goto out_put_tag;
+
+	correction = get_unaligned_be64(&ptp_hdr->correction);
+	if ((s64)correction < 0) {
+		struct timespec64 ts;
+
+		/* Move ingress time stamp from PTP header's correction field to tail tag. */
+		ts = ns_to_timespec64(-((s64)correction));
+		tstamp_raw = ((ts.tv_sec & 3) << 30) | ts.tv_nsec;
+		correction = 0;
+		ksz9477_update_ptp_correction_field(skb, ptp_type, ptp_hdr, correction);
+	}
+
+out_put_tag:
+	put_unaligned_be32(tstamp_raw, skb_put(skb, KSZ9477_PTP_TAG_LEN));
+}
+
+ktime_t ksz9477_tstamp_to_clock(struct ksz_device *ksz, u32 tstamp, int offset_ns)
+{
+	struct timespec64 ptp_clock_time;
+	/* Split up time stamp, 2 bit seconds, 30 bit nano seconds */
+	struct timespec64 ts = {
+		.tv_sec  = tstamp >> 30,
+		.tv_nsec = tstamp & (BIT_MASK(30) - 1),
+	};
+	struct timespec64 diff;
+	unsigned long flags;
+	s64 ns;
+
+	spin_lock_irqsave(&ksz->ptp_clock_lock, flags);
+	ptp_clock_time = ksz->ptp_clock_time;
+	spin_unlock_irqrestore(&ksz->ptp_clock_lock, flags);
+
+	/* calculate full time from partial time stamp */
+	ts.tv_sec = (ptp_clock_time.tv_sec & ~3) | ts.tv_sec;
+
+	/* find nearest possible point in time */
+	diff = timespec64_sub(ts, ptp_clock_time);
+	if (diff.tv_sec > 2)
+		ts.tv_sec -= 4;
+	else if (diff.tv_sec < -2)
+		ts.tv_sec += 4;
+
+	/* Add/remove excess delay between wire and time stamp unit */
+	ns = timespec64_to_ns(&ts) + offset_ns;
+
+	return ns_to_ktime(ns);
+}
+EXPORT_SYMBOL(ksz9477_tstamp_to_clock);
+
+static void ksz9477_rcv_timestamp(struct sk_buff *skb, u8 *tag,
+				  struct net_device *dev, unsigned int port)
+{
+	struct skb_shared_hwtstamps *hwtstamps = skb_hwtstamps(skb);
+	enum ksz9477_ptp_event_messages msg_type;
+	struct dsa_switch *ds = dev->dsa_ptr->ds;
+	struct ksz_device *ksz = ds->priv;
+	struct ksz_port *prt = &ksz->ports[port];
+	u8 *tstamp = tag - KSZ9477_PTP_TAG_LEN;
+	struct ptp_header *ptp_hdr;
+	unsigned int ptp_type;
+	u64 correction;
+
+	/* convert time stamp and write to skb */
+	memset(hwtstamps, 0, sizeof(*hwtstamps));
+	hwtstamps->hwtstamp = ksz9477_tstamp_to_clock(ksz, get_unaligned_be32(tstamp),
+						      -prt->tstamp_rx_latency_ns);
+
+	/* For PDelay_Req messages, user space (ptp4l) expects that the hardware
+	 * subtracts the egress time stamp from the correction field. The
+	 * separate hw time stamp from the sk_buff struct will not be used in
+	 * this case.
+	 */
+	if (skb_headroom(skb) < ETH_HLEN)
+		return;
+
+	__skb_push(skb, ETH_HLEN);
+	ptp_type = ptp_classify_raw(skb);
+	__skb_pull(skb, ETH_HLEN);
+
+	if (ptp_type == PTP_CLASS_NONE)
+		return;
+
+	ptp_hdr = ptp_parse_header(skb, ptp_type);
+	if (!ptp_hdr)
+		return;
+
+	msg_type = ptp_get_msgtype(ptp_hdr, ptp_type);
+	if (msg_type != PTP_Event_Message_Pdelay_Req)
+		return;
+
+	correction = get_unaligned_be64(&ptp_hdr->correction);
+	correction -= hwtstamps->hwtstamp;
+	ksz9477_update_ptp_correction_field(skb, ptp_type, ptp_hdr, correction);
+}
+#else   /* IS_ENABLED(CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP) */
+static void ksz9477_xmit_timestamp(struct sk_buff *skb __maybe_unused)
+{
+}
+
+static void ksz9477_rcv_timestamp(struct sk_buff *skb __maybe_unused, u8 *tag __maybe_unused,
+				  struct net_device *dev __maybe_unused,
+				  unsigned int port __maybe_unused)
+{
+}
+#endif  /* IS_ENABLED(CONFIG_NET_DSA_MICROCHIP_KSZ9477_PTP) */
+
 static struct sk_buff *ksz9477_xmit(struct sk_buff *skb,
 				    struct net_device *dev)
 {
@@ -116,6 +304,7 @@  static struct sk_buff *ksz9477_xmit(struct sk_buff *skb,
 	u16 val;
 
 	/* Tag encoding */
+	ksz9477_xmit_timestamp(skb);
 	tag = skb_put(skb, KSZ9477_INGRESS_TAG_LEN);
 	addr = skb_mac_header(skb);
 
@@ -138,8 +327,10 @@  static struct sk_buff *ksz9477_rcv(struct sk_buff *skb, struct net_device *dev,
 	unsigned int len = KSZ_EGRESS_TAG_LEN;
 
 	/* Extra 4-bytes PTP timestamp */
-	if (tag[0] & KSZ9477_PTP_TAG_INDICATION)
+	if (tag[0] & KSZ9477_PTP_TAG_INDICATION) {
+		ksz9477_rcv_timestamp(skb, tag, dev, port);
 		len += KSZ9477_PTP_TAG_LEN;
+	}
 
 	return ksz_common_rcv(skb, dev, port, len);
 }
@@ -149,7 +340,7 @@  static const struct dsa_device_ops ksz9477_netdev_ops = {
 	.proto	= DSA_TAG_PROTO_KSZ9477,
 	.xmit	= ksz9477_xmit,
 	.rcv	= ksz9477_rcv,
-	.overhead = KSZ9477_INGRESS_TAG_LEN,
+	.overhead = KSZ9477_INGRESS_TAG_LEN + KSZ9477_PTP_TAG_LEN,
 	.tail_tag = true,
 };
 
@@ -167,6 +358,7 @@  static struct sk_buff *ksz9893_xmit(struct sk_buff *skb,
 	u8 *tag;
 
 	/* Tag encoding */
+	ksz9477_xmit_timestamp(skb);
 	tag = skb_put(skb, KSZ_INGRESS_TAG_LEN);
 	addr = skb_mac_header(skb);
 
@@ -183,7 +375,7 @@  static const struct dsa_device_ops ksz9893_netdev_ops = {
 	.proto	= DSA_TAG_PROTO_KSZ9893,
 	.xmit	= ksz9893_xmit,
 	.rcv	= ksz9477_rcv,
-	.overhead = KSZ_INGRESS_TAG_LEN,
+	.overhead = KSZ_INGRESS_TAG_LEN + KSZ9477_PTP_TAG_LEN,
 	.tail_tag = true,
 };