diff mbox series

[v2,net-next] selftests: forwarding: add Per-Stream Filtering and Policing test for Ocelot

Message ID 20220501112953.3298973-1-vladimir.oltean@nxp.com (mailing list archive)
State Accepted
Commit 954f46d2f0b4a76405a5e0788e3f80a24c657fd4
Headers show
Series [v2,net-next] selftests: forwarding: add Per-Stream Filtering and Policing test for Ocelot | expand

Commit Message

Vladimir Oltean May 1, 2022, 11:29 a.m. UTC
The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action
which enforced time-based access control per stream. A stream as seen by
this switch is identified by {MAC DA, VID}.

We use the standard forwarding selftest topology with 2 host interfaces
and 2 switch interfaces. The host ports must require timestamping non-IP
packets and supporting tc-etf offload, for isochron to work. The
isochron program monitors network sync status (ptp4l, phc2sys) and
deterministically transmits packets to the switch such that the tc-gate
action either (a) always accepts them based on its schedule, or
(b) always drops them.

I tried to keep as much of the logic that isn't specific to the NXP
LS1028A in a new tsn_lib.sh, for future reuse. This covers
synchronization using ptp4l and phc2sys, and isochron.

The cycle-time chosen for this selftest isn't particularly impressive
(and the focus is the functionality of the switch), but I didn't really
know what to do better, considering that it will mostly be run during
debugging sessions, various kernel bloatware would be enabled, like
lockdep, KASAN, etc, and we certainly can't run any races with those on.

I tried to look through the kselftest framework for other real time
applications and didn't really find any, so I'm not sure how better to
prepare the environment in case we want to go for a lower cycle time.
At the moment, the only thing the selftest is ensuring is that dynamic
frequency scaling is disabled on the CPU that isochron runs on. It would
probably be useful to have a blacklist of kernel config options (checked
through zcat /proc/config.gz) and some cyclictest scripts to run
beforehand, but I saw none of those.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
v1->v2:
- fix an off-by-one bug introduced at the last minute regarding which
  tc-mqprio queue was used for tc-etf and SO_TXTIME
- introduce debugging for packets incorrectly received / incorrectly
  dropped based on "isochron report"
- make the tsn_lib.sh dependency on isochron and linuxptp optional via
  REQUIRE_ISOCHRON and REQUIRE_LINUXPTP
- avoid errors when CONFIG_CPU_FREQ is disabled
- consistently use SCHED_FIFO instead of SCHED_RR for the isochron
  receiver

 .../selftests/drivers/net/ocelot/psfp.sh      | 327 ++++++++++++++++++
 .../selftests/net/forwarding/tsn_lib.sh       | 235 +++++++++++++
 2 files changed, 562 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/ocelot/psfp.sh
 create mode 100644 tools/testing/selftests/net/forwarding/tsn_lib.sh

Comments

Kurt Kanzenbach May 1, 2022, 11:56 a.m. UTC | #1
On Sun May 01 2022, Vladimir Oltean wrote:
> The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action
> which enforced time-based access control per stream. A stream as seen by
> this switch is identified by {MAC DA, VID}.
>
> We use the standard forwarding selftest topology with 2 host interfaces
> and 2 switch interfaces. The host ports must require timestamping non-IP
> packets and supporting tc-etf offload, for isochron to work. The
> isochron program monitors network sync status (ptp4l, phc2sys) and
> deterministically transmits packets to the switch such that the tc-gate
> action either (a) always accepts them based on its schedule, or
> (b) always drops them.
>
> I tried to keep as much of the logic that isn't specific to the NXP
> LS1028A in a new tsn_lib.sh, for future reuse. This covers
> synchronization using ptp4l and phc2sys, and isochron.
>
> The cycle-time chosen for this selftest isn't particularly impressive
> (and the focus is the functionality of the switch), but I didn't really
> know what to do better, considering that it will mostly be run during
> debugging sessions, various kernel bloatware would be enabled, like
> lockdep, KASAN, etc, and we certainly can't run any races with those on.
>
> I tried to look through the kselftest framework for other real time
> applications and didn't really find any, so I'm not sure how better to
> prepare the environment in case we want to go for a lower cycle time.
> At the moment, the only thing the selftest is ensuring is that dynamic
> frequency scaling is disabled on the CPU that isochron runs on. It would
> probably be useful to have a blacklist of kernel config options (checked
> through zcat /proc/config.gz) and some cyclictest scripts to run
> beforehand, but I saw none of those.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
patchwork-bot+netdevbpf@kernel.org May 2, 2022, 11:10 p.m. UTC | #2
Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Sun,  1 May 2022 14:29:53 +0300 you wrote:
> The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action
> which enforced time-based access control per stream. A stream as seen by
> this switch is identified by {MAC DA, VID}.
> 
> We use the standard forwarding selftest topology with 2 host interfaces
> and 2 switch interfaces. The host ports must require timestamping non-IP
> packets and supporting tc-etf offload, for isochron to work. The
> isochron program monitors network sync status (ptp4l, phc2sys) and
> deterministically transmits packets to the switch such that the tc-gate
> action either (a) always accepts them based on its schedule, or
> (b) always drops them.
> 
> [...]

Here is the summary with links:
  - [v2,net-next] selftests: forwarding: add Per-Stream Filtering and Policing test for Ocelot
    https://git.kernel.org/netdev/net-next/c/954f46d2f0b4

You are awesome, thank you!
Ferenc Fejes May 6, 2022, 7:49 a.m. UTC | #3
Hi!

On 2022. 05. 01. 13:29, Vladimir Oltean wrote:
> The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action
> which enforced time-based access control per stream. A stream as seen by
> this switch is identified by {MAC DA, VID}.
>
> We use the standard forwarding selftest topology with 2 host interfaces
> and 2 switch interfaces. The host ports must require timestamping non-IP
> packets and supporting tc-etf offload, for isochron to work. The
> isochron program monitors network sync status (ptp4l, phc2sys) and
> deterministically transmits packets to the switch such that the tc-gate
> action either (a) always accepts them based on its schedule, or
> (b) always drops them.
>
> I tried to keep as much of the logic that isn't specific to the NXP
> LS1028A in a new tsn_lib.sh, for future reuse. This covers
> synchronization using ptp4l and phc2sys, and isochron.
>
> The cycle-time chosen for this selftest isn't particularly impressive
> (and the focus is the functionality of the switch), but I didn't really
> know what to do better, considering that it will mostly be run during
> debugging sessions, various kernel bloatware would be enabled, like
> lockdep, KASAN, etc, and we certainly can't run any races with those on.
>
> I tried to look through the kselftest framework for other real time
> applications and didn't really find any, so I'm not sure how better to
> prepare the environment in case we want to go for a lower cycle time.
> At the moment, the only thing the selftest is ensuring is that dynamic
> frequency scaling is disabled on the CPU that isochron runs on. It would
> probably be useful to have a blacklist of kernel config options (checked
> through zcat /proc/config.gz) and some cyclictest scripts to run
> beforehand, but I saw none of those.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>
> +switch_create()
> +{
> +	local h2_mac_addr=$(mac_get $h2)
> +
> +	ip link set ${swp1} up
> +	ip link set ${swp2} up
> +
> +	ip link add br0 type bridge vlan_filtering 1
> +	ip link set ${swp1} master br0
> +	ip link set ${swp2} master br0
> +	ip link set br0 up
> +
> +	bridge vlan add dev ${swp2} vid ${STREAM_VID}
> +	bridge vlan add dev ${swp1} vid ${STREAM_VID}
> +	# PSFP on Ocelot requires the filter to also be added to the bridge
> +	# FDB, and not be removed
> +	bridge fdb add dev ${swp2} \
> +		${h2_mac_addr} vlan ${STREAM_VID} static master
> +
> +	psfp_chain_create ${swp1}
> +
> +	tc filter add dev ${swp1} ingress chain $(PSFP) pref 1 \
> +		protocol 802.1Q flower skip_sw \
> +		dst_mac ${h2_mac_addr} vlan_id ${STREAM_VID} \
> +		action gate base-time 0.000000000 \
> +		sched-entry OPEN  ${GATE_DURATION_NS} -1 -1 \
> +		sched-entry CLOSE ${GATE_DURATION_NS} -1 -1

I know that might be little bit off-topic here, but the current 
implementation of the act_gate does nothing with the IPV value [0] even 
if the user set it to non -1.
IMO this IPV value should be carried through in the tcf_gate struct [1] 
as something like a "current_ipv" member or so. Then this value can be 
applied in the tcf_gate_act function to the skb->priority.

Background story: I tried to combine gate and taprio (802.1Qci and Qbv) 
to achieve 802.1Qch operation (which is really just a coordinated config 
of those two) but without the IPV (should by set by the ingress port) we 
have no way to carry the gating info to the taprio, and as a result its 
just sending every packet with the default priority, no matter how we 
open/close the gate at the ingress.

[0] 
https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate.h#L21
[1] 
https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate.h#L40
[2] 
https://elixir.bootlin.com/linux/v5.18-rc5/source/net/sched/act_gate.c#L117

> +}
Ferenc
Vladimir Oltean May 6, 2022, 12:01 p.m. UTC | #4
Hi Ferenc,

On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> Hi!
> I know that might be little bit off-topic here, but the current
> implementation of the act_gate does nothing with the IPV value [0] even
> if the user set it to non -1.
> IMO this IPV value should be carried through in the tcf_gate struct [1]
> as something like a "current_ipv" member or so. Then this value can be
> applied in the tcf_gate_act function to the skb->priority.
>
> Background story: I tried to combine gate and taprio (802.1Qci and Qbv)
> to achieve 802.1Qch operation (which is really just a coordinated config
> of those two) but without the IPV (should by set by the ingress port) we
> have no way to carry the gating info to the taprio, and as a result its
> just sending every packet with the default priority, no matter how we
> open/close the gate at the ingress.
>
> [0]
> https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate.h#L21
> [1]
> https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate.h#L40
> [2]
> https://elixir.bootlin.com/linux/v5.18-rc5/source/net/sched/act_gate.c#L117

This is correct. I have been testing only with the offloaded tc-gate
action so I did not notice that the software does not act upon the ipv.
Your proposal sounds straightforward enough. Care to send a bug fix patch?
Ferenc Fejes May 6, 2022, 12:24 p.m. UTC | #5
Hi Vladimir!

On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
> Hi Ferenc,
>
> On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> This is correct. I have been testing only with the offloaded tc-gate
> action so I did not notice that the software does not act upon the ipv.
> Your proposal sounds straightforward enough. Care to send a bug fix patch?

Unfortunately I cant, our company policy does not allow direct 
open-source contributions :-(

However I would be more than happy if you can fix it.

Thanks,
Ferenc
Vladimir Oltean May 26, 2022, 12:50 a.m. UTC | #6
On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
> Hi Vladimir!
> 
> On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
> > Hi Ferenc,
> >
> > On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> > This is correct. I have been testing only with the offloaded tc-gate
> > action so I did not notice that the software does not act upon the ipv.
> > Your proposal sounds straightforward enough. Care to send a bug fix patch?
> 
> Unfortunately I cant, our company policy does not allow direct 
> open-source contributions :-(
> 
> However I would be more than happy if you can fix it.

That's too bad.

I have a patch which I am still testing, but you've managed to captivate
my attention for saying that you are testing 802.1Qch with a software
implementation of tc-gate.

Do you have a use case for this? What cycle times are you targeting?
How are you ensuring that you are deterministically meeting the deadlines?
Do you also have a software tc-taprio on the egress device or is that
at least offloaded?

I'm asking these questions because the peristaltic shaper is primarily
intended to be used on hardware switches. The patch I'm preparing
includes changes to struct sk_buff. I just want to know how well I'll be
able to sell these changes to maintainers strongly opposing the growth
of this structure for an exceedingly small niche :)
Ferenc Fejes May 26, 2022, 6:40 a.m. UTC | #7
Hi Vladimir!

On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
> On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
> > Hi Vladimir!
> > 
> > On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
> > > Hi Ferenc,
> > > 
> > > On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> > > This is correct. I have been testing only with the offloaded tc-
> > > gate
> > > action so I did not notice that the software does not act upon
> > > the ipv.
> > > Your proposal sounds straightforward enough. Care to send a bug
> > > fix patch?
> > 
> > Unfortunately I cant, our company policy does not allow direct 
> > open-source contributions :-(
> > 
> > However I would be more than happy if you can fix it.
> 
> That's too bad.
> 
> I have a patch which I am still testing, but you've managed to
> captivate
> my attention for saying that you are testing 802.1Qch with a software
> implementation of tc-gate.
> 
> Do you have a use case for this? What cycle times are you targeting?
> How are you ensuring that you are deterministically meeting the
> deadlines?

The cycle times I targeted were nowhere near to a realistic TSN
scenario:
I "generated" ping packets in every 100 msecs and on the ingress port
and I marked them with prio 1 for 500ms (gate 1) and prio 2 for another
500ms (gate 2). On the egress port I applied taprio with the same base-
time and same 500-500ms cycles but reverse ordered gates (that's the
"definition" of the Qch), so while act_gate on the ingress is in gate 1
cycle, the taprio kept open gate 2 and gate 1 closed, etc.
For "verification" I simply run a tcpdump on the listener machine what
I pinged from the talker and eyeballed wether the 5-5 packets bursts
shows up arrival timestamps.

> Do you also have a software tc-taprio on the egress device or is that
> at least offloaded?

No, I experimented with the software version, but that worked with my
netns tests and on physical machines too (after the IPV patch).

> 
> I'm asking these questions because the peristaltic shaper is
> primarily
> intended to be used on hardware switches. The patch I'm preparing
> includes changes to struct sk_buff. I just want to know how well I'll
> be
> able to sell these changes to maintainers strongly opposing the
> growth
> of this structure for an exceedingly small niche :)

Can you tell me about the intention behind the sk_buff changes? Does
that required because of some offloading scenario? In my case putting
the IPV into the skb->priority was good enough because the taprio using
that field by default to select the traffic class for the packet.

Thanks,
Ferenc
Vladimir Oltean May 26, 2022, 9:30 a.m. UTC | #8
On Thu, May 26, 2022 at 06:40:22AM +0000, Ferenc Fejes wrote:
> Hi Vladimir!
> 
> On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
> > On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
> > > Hi Vladimir!
> > > 
> > > On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
> > > > Hi Ferenc,
> > > > 
> > > > On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> > > > This is correct. I have been testing only with the offloaded tc-
> > > > gate
> > > > action so I did not notice that the software does not act upon
> > > > the ipv.
> > > > Your proposal sounds straightforward enough. Care to send a bug
> > > > fix patch?
> > > 
> > > Unfortunately I cant, our company policy does not allow direct 
> > > open-source contributions :-(
> > > 
> > > However I would be more than happy if you can fix it.
> > 
> > That's too bad.
> > 
> > I have a patch which I am still testing, but you've managed to
> > captivate
> > my attention for saying that you are testing 802.1Qch with a software
> > implementation of tc-gate.
> > 
> > Do you have a use case for this? What cycle times are you targeting?
> > How are you ensuring that you are deterministically meeting the
> > deadlines?
> 
> The cycle times I targeted were nowhere near to a realistic TSN
> scenario:
> I "generated" ping packets in every 100 msecs and on the ingress port
> and I marked them with prio 1 for 500ms (gate 1) and prio 2 for another
> 500ms (gate 2). On the egress port I applied taprio with the same base-
> time and same 500-500ms cycles but reverse ordered gates (that's the
> "definition" of the Qch), so while act_gate on the ingress is in gate 1
> cycle, the taprio kept open gate 2 and gate 1 closed, etc.
> For "verification" I simply run a tcpdump on the listener machine what
> I pinged from the talker and eyeballed wether the 5-5 packets bursts
> shows up arrival timestamps.
> 
> > Do you also have a software tc-taprio on the egress device or is that
> > at least offloaded?
> 
> No, I experimented with the software version, but that worked with my
> netns tests and on physical machines too (after the IPV patch).
> 
> > 
> > I'm asking these questions because the peristaltic shaper is
> > primarily
> > intended to be used on hardware switches. The patch I'm preparing
> > includes changes to struct sk_buff. I just want to know how well I'll
> > be
> > able to sell these changes to maintainers strongly opposing the
> > growth
> > of this structure for an exceedingly small niche :)
> 
> Can you tell me about the intention behind the sk_buff changes? Does
> that required because of some offloading scenario? In my case putting
> the IPV into the skb->priority was good enough because the taprio using
> that field by default to select the traffic class for the packet.
> 
> Thanks,
> Ferenc
>

Hopefully this patch explains it:

-----------------------------[ cut here ]-----------------------------
From c8d33e0d65a4a63884a626dc04c7190a2bbe122b Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Wed, 25 May 2022 16:27:52 +0300
Subject: [PATCH] net/sched: act_gate: act on Internal Priority Value

IEEE 802.1Q specifies in Annex T ("Cyclic queuing and forwarding") the
use case for an Internal Priority Value set by the PSFP function.
Essentially, the intended use is to be able to select a packet's egress
queue squarely based on its arrival time. Essentially, this means that a
packet with the same VLAN PCP 5 can be delivered to TC 7 or 6 depending
on whether it was received in an "odd" or "even" cycle. To quote the
spec, "The use of the IPV allows this direction of frames to outbound
queues to be independent of the received priority, and also does not
affect the priority associated with the frame on transmission."

The problem is that the software implementation of act_gate takes the
IPV as entry from user space, but does not act on it in its software
implementation - only offloading drivers do.

To fix this, we need to keep a current_ipv variable according to the
gate entry that's currently executed by act_gate, and use this IPV to
overwrite the skb->priority.

In fact, a complication arises due to the following clause from 802.1Q:

| 8.6.6.1 PSFP queuing
| If PSFP is supported (8.6.5.1), and the IPV associated with the stream
| filter that passed the frame is anything other than the null value, then
| that IPV is used to determine the traffic class of the frame, in place
| of the frame's priority, via the Traffic Class Table specified in 8.6.6.
| In all other respects, the queuing actions specified in 8.6.6 are
| unchanged. The IPV is used only to determine the traffic class
| associated with a frame, and hence select an outbound queue; for all
| other purposes, the received priority is used.

An example of layer that we don't want to see the IPV is the egress-qos-map
of a potential VLAN output device. Instead, the VLAN PCP should still be
populated based on the original skb->priority.

I'm aware of the existence of a skb_update_prio() function in
__dev_queue_xmit(), right before passing the skb to the qdisc layer,
for the use case where net_prio cgroups are used to assign processes to
network priorities on a per-interface basis. But rewriting the
skb->priority with the skb->ipv there simply wouldn't work, exactly
because there might be a VLAN in the output path of the skb.

One might suggest: just delay the IPV rewriting in the presence of
stacked devices until it is actually needed and likely to be used,
like when dev->num_tc != 0 (which VLAN doesn't have). But there are
still other uses of skb->priority in the qdisc layer, like skbprio,
htb etc. From the spec's wording it isn't clear that we want these
functions to look at the proper packet user priority or at the PSFP IPV.
Probably the former.

So the only implementation that conforms to these requirements
seems to be the invasive one, where we introduce a dedicated
helper for the pattern where drivers and the core ask for
netdev_get_prio_tc_map(dev, skb->priority). We replace all such
instances with a helper that selects the TC based on IPV, if the skb
was filtered by PSFP time gates on ingress, and falls back to
skb->priority otherwise.

Fixes: a51c328df310 ("net: qos: introduce a gate control flow action")
Reported-by: Ferenc Fejes <ferenc.fejes@ericsson.com>
Link: https://lore.kernel.org/netdev/87o80h1n65.fsf@kurt/T/#meaec9c24b3add9cb11edd411278a3a6ecf01573e
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
 include/linux/netdevice.h                     | 13 +++++++++++++
 include/linux/skbuff.h                        | 11 +++++++++++
 include/net/tc_act/tc_gate.h                  |  1 +
 net/core/dev.c                                |  4 ++--
 net/dsa/tag_ocelot.c                          |  2 +-
 net/sched/act_gate.c                          |  6 ++++++
 net/sched/sch_taprio.c                        | 12 +++---------
 8 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 77c2e70b0860..719259af9aaa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8531,7 +8531,7 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 	int txq;
 
 	if (sb_dev) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		u8 tc = skb_get_prio_tc_map(skb, dev);
 		struct net_device *vdev = sb_dev;
 
 		txq = vdev->tc_to_txq[tc].offset;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 47b59f99b037..8b87d017288f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2366,6 +2366,19 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
 	return 0;
 }
 
+static inline int skb_get_prio_tc_map(const struct sk_buff *skb,
+				      const struct net_device *dev)
+{
+	__u32 prio = skb->priority;
+
+#if IS_ENABLED(CONFIG_NET_ACT_GATE)
+	if (skb->use_ipv)
+		prio = skb->ipv;
+#endif
+
+	return netdev_get_prio_tc_map(dev, prio);
+}
+
 int netdev_txq_to_tc(struct net_device *dev, unsigned int txq);
 void netdev_reset_tc(struct net_device *dev);
 int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index da96f0d3e753..b0a463c0bc65 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -913,6 +913,9 @@ typedef unsigned char *sk_buff_data_t;
  *	@csum_start: Offset from skb->head where checksumming should start
  *	@csum_offset: Offset from csum_start where checksum should be stored
  *	@priority: Packet queueing priority
+ *	@use_ipv: Packet internal priority was altered by PSFP
+ *	@ipv: Internal Priority Value, overrides priority for traffic class
+ *		selection
  *	@ignore_df: allow local fragmentation
  *	@cloned: Head may be cloned (check refcnt to be sure)
  *	@ip_summed: Driver fed us an IP checksum
@@ -1145,6 +1148,10 @@ struct sk_buff {
 	__u8			slow_gro:1;
 	__u8			csum_not_inet:1;
 
+#ifdef CONFIG_NET_ACT_GATE
+	__u8			use_ipv:1;
+#endif
+
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
 #endif
@@ -1209,6 +1216,10 @@ struct sk_buff {
 	/* only useable after checking ->active_extensions != 0 */
 	struct skb_ext		*extensions;
 #endif
+
+#ifdef CONFIG_NET_ACT_GATE
+	__u32			ipv;
+#endif
 };
 
 /* if you move pkt_type around you also must adapt those constants */
diff --git a/include/net/tc_act/tc_gate.h b/include/net/tc_act/tc_gate.h
index c8fa11ebb397..b05c2c7d78e4 100644
--- a/include/net/tc_act/tc_gate.h
+++ b/include/net/tc_act/tc_gate.h
@@ -44,6 +44,7 @@ struct tcf_gate {
 	ktime_t			current_close_time;
 	u32			current_entry_octets;
 	s32			current_max_octets;
+	s32			current_ipv;
 	struct tcfg_gate_entry	*next_entry;
 	struct hrtimer		hitimer;
 	enum tk_offsets		tk_offset;
diff --git a/net/core/dev.c b/net/core/dev.c
index 721ba9c26554..956aa227c260 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3194,7 +3194,7 @@ static u16 skb_tx_hash(const struct net_device *dev,
 	u16 qcount = dev->real_num_tx_queues;
 
 	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		u8 tc = skb_get_prio_tc_map(skb, dev);
 
 		qoffset = sb_dev->tc_to_txq[tc].offset;
 		qcount = sb_dev->tc_to_txq[tc].count;
@@ -4002,7 +4002,7 @@ EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
 static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
 			       struct xps_dev_maps *dev_maps, unsigned int tci)
 {
-	int tc = netdev_get_prio_tc_map(dev, skb->priority);
+	int tc = skb_get_prio_tc_map(skb, dev);
 	struct xps_map *map;
 	int queue_index = -1;
 
diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c
index 0d81f172b7a6..036e746f18eb 100644
--- a/net/dsa/tag_ocelot.c
+++ b/net/dsa/tag_ocelot.c
@@ -52,7 +52,7 @@ static void ocelot_xmit_common(struct sk_buff *skb, struct net_device *netdev,
 	ocelot_xmit_get_vlan_info(skb, dp, &vlan_tci, &tag_type);
 
 	qos_class = netdev_get_num_tc(netdev) ?
-		    netdev_get_prio_tc_map(netdev, skb->priority) : skb->priority;
+		    skb_get_prio_tc_map(skb, netdev) : skb->priority;
 
 	injection = skb_push(skb, OCELOT_TAG_LEN);
 	prefix = skb_push(skb, OCELOT_SHORT_PREFIX_LEN);
diff --git a/net/sched/act_gate.c b/net/sched/act_gate.c
index fd5155274733..9fb248b104f8 100644
--- a/net/sched/act_gate.c
+++ b/net/sched/act_gate.c
@@ -81,6 +81,7 @@ static enum hrtimer_restart gate_timer_func(struct hrtimer *timer)
 	gact->current_gate_status = next->gate_state ? GATE_ACT_GATE_OPEN : 0;
 	gact->current_entry_octets = 0;
 	gact->current_max_octets = next->maxoctets;
+	gact->current_ipv = next->ipv;
 
 	gact->current_close_time = ktime_add_ns(gact->current_close_time,
 						next->interval);
@@ -140,6 +141,11 @@ static int tcf_gate_act(struct sk_buff *skb, const struct tc_action *a,
 		}
 	}
 
+	if (gact->current_ipv >= 0) {
+		skb->use_ipv = true;
+		skb->ipv = gact->current_ipv;
+	}
+
 	spin_unlock(&gact->tcf_lock);
 
 	return gact->tcf_action;
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index b9c71a304d39..fb8bc17e38bb 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -201,7 +201,7 @@ static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
 	s32 cycle_elapsed;
 	int tc, n;
 
-	tc = netdev_get_prio_tc_map(dev, skb->priority);
+	tc = skb_get_prio_tc_map(skb, dev);
 	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
 
 	*interval_start = 0;
@@ -509,7 +509,6 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		struct Qdisc *child = q->qdiscs[i];
-		int prio;
 		u8 tc;
 
 		if (unlikely(!child))
@@ -522,9 +521,7 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
 		if (TXTIME_ASSIST_IS_ENABLED(q->flags))
 			return skb;
 
-		prio = skb->priority;
-		tc = netdev_get_prio_tc_map(dev, prio);
-
+		tc = skb_get_prio_tc_map(skb, dev);
 		if (!(gate_mask & BIT(tc)))
 			continue;
 
@@ -579,7 +576,6 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		struct Qdisc *child = q->qdiscs[i];
 		ktime_t guard;
-		int prio;
 		int len;
 		u8 tc;
 
@@ -597,9 +593,7 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
 		if (!skb)
 			continue;
 
-		prio = skb->priority;
-		tc = netdev_get_prio_tc_map(dev, prio);
-
+		tc = skb_get_prio_tc_map(skb, dev);
 		if (!(gate_mask & BIT(tc))) {
 			skb = NULL;
 			continue;
-----------------------------[ cut here ]-----------------------------
Ferenc Fejes Feb. 16, 2023, 1:47 p.m. UTC | #9
Hi!

Sorry for the delay.

On Thu, 2022-05-26 at 09:30 +0000, Vladimir Oltean wrote:
> On Thu, May 26, 2022 at 06:40:22AM +0000, Ferenc Fejes wrote:
> > Hi Vladimir!
> > 
> > On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
> > > On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
> > > > Hi Vladimir!
> > > > 
> > > > On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
> > > > > Hi Ferenc,
> > > > > 
> > > > > On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> > > > > This is correct. I have been testing only with the offloaded
> > > > > tc-
> > > > > gate
> > > > > action so I did not notice that the software does not act
> > > > > upon
> > > > > the ipv.
> > > > > Your proposal sounds straightforward enough. Care to send a
> > > > > bug
> > > > > fix patch?
> > > > 
> > > > Unfortunately I cant, our company policy does not allow direct 
> > > > open-source contributions :-(
> > > > 
> > > > However I would be more than happy if you can fix it.
> > > 
> > > That's too bad.
> > > 
> > > I have a patch which I am still testing, but you've managed to
> > > captivate
> > > my attention for saying that you are testing 802.1Qch with a
> > > software
> > > implementation of tc-gate.
> > > 
> > > Do you have a use case for this? What cycle times are you
> > > targeting?
> > > How are you ensuring that you are deterministically meeting the
> > > deadlines?
> > 
> > The cycle times I targeted were nowhere near to a realistic TSN
> > scenario:
> > I "generated" ping packets in every 100 msecs and on the ingress
> > port
> > and I marked them with prio 1 for 500ms (gate 1) and prio 2 for
> > another
> > 500ms (gate 2). On the egress port I applied taprio with the same
> > base-
> > time and same 500-500ms cycles but reverse ordered gates (that's
> > the
> > "definition" of the Qch), so while act_gate on the ingress is in
> > gate 1
> > cycle, the taprio kept open gate 2 and gate 1 closed, etc.
> > For "verification" I simply run a tcpdump on the listener machine
> > what
> > I pinged from the talker and eyeballed wether the 5-5 packets
> > bursts
> > shows up arrival timestamps.
> > 
> > > Do you also have a software tc-taprio on the egress device or is
> > > that
> > > at least offloaded?
> > 
> > No, I experimented with the software version, but that worked with
> > my
> > netns tests and on physical machines too (after the IPV patch).
> > 
> > > 
> > > I'm asking these questions because the peristaltic shaper is
> > > primarily
> > > intended to be used on hardware switches. The patch I'm preparing
> > > includes changes to struct sk_buff. I just want to know how well
> > > I'll
> > > be
> > > able to sell these changes to maintainers strongly opposing the
> > > growth
> > > of this structure for an exceedingly small niche :)
> > 
> > Can you tell me about the intention behind the sk_buff changes?
> > Does
> > that required because of some offloading scenario? In my case
> > putting
> > the IPV into the skb->priority was good enough because the taprio
> > using
> > that field by default to select the traffic class for the packet.
> > 
> > Thanks,
> > Ferenc
> > 
> 
> Hopefully this patch explains it:
> 
> -----------------------------[ cut here ]----------------------------
> -
> From c8d33e0d65a4a63884a626dc04c7190a2bbe122b Mon Sep 17 00:00:00
> 2001
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> Date: Wed, 25 May 2022 16:27:52 +0300
> Subject: [PATCH] net/sched: act_gate: act on Internal Priority Value
> 
> IEEE 802.1Q specifies in Annex T ("Cyclic queuing and forwarding")
> the
> use case for an Internal Priority Value set by the PSFP function.
> Essentially, the intended use is to be able to select a packet's
> egress
> queue squarely based on its arrival time. Essentially, this means
> that a
> packet with the same VLAN PCP 5 can be delivered to TC 7 or 6
> depending
> on whether it was received in an "odd" or "even" cycle. To quote the
> spec, "The use of the IPV allows this direction of frames to outbound
> queues to be independent of the received priority, and also does not
> affect the priority associated with the frame on transmission."
> 
> The problem is that the software implementation of act_gate takes the
> IPV as entry from user space, but does not act on it in its software
> implementation - only offloading drivers do.
> 
> To fix this, we need to keep a current_ipv variable according to the
> gate entry that's currently executed by act_gate, and use this IPV to
> overwrite the skb->priority.
> 
> In fact, a complication arises due to the following clause from
> 802.1Q:
> 
> > 8.6.6.1 PSFP queuing
> > If PSFP is supported (8.6.5.1), and the IPV associated with the
> > stream
> > filter that passed the frame is anything other than the null value,
> > then
> > that IPV is used to determine the traffic class of the frame, in
> > place
> > of the frame's priority, via the Traffic Class Table specified in
> > 8.6.6.
> > In all other respects, the queuing actions specified in 8.6.6 are
> > unchanged. The IPV is used only to determine the traffic class
> > associated with a frame, and hence select an outbound queue; for
> > all
> > other purposes, the received priority is used.

Interesting. In my understanding this indeed something like "modify the
queueing decision while preserve the original PCP value".

> 
> An example of layer that we don't want to see the IPV is the egress-
> qos-map
> of a potential VLAN output device. Instead, the VLAN PCP should still
> be
> populated based on the original skb->priority.
> 
> I'm aware of the existence of a skb_update_prio() function in
> __dev_queue_xmit(), right before passing the skb to the qdisc layer,
> for the use case where net_prio cgroups are used to assign processes
> to
> network priorities on a per-interface basis. But rewriting the
> skb->priority with the skb->ipv there simply wouldn't work, exactly
> because there might be a VLAN in the output path of the skb.

So the problem is currently we use skb->priority with two semantics:
1. for egress-qos-mapping which set the PCP of the VLAN header based on
skb->priority
2. for selecting the traffic class (and the tx queues with that as
well).

> 
> One might suggest: just delay the IPV rewriting in the presence of
> stacked devices until it is actually needed and likely to be used,
> like when dev->num_tc != 0 (which VLAN doesn't have). But there are
> still other uses of skb->priority in the qdisc layer, like skbprio,
> htb etc. From the spec's wording it isn't clear that we want these
> functions to look at the proper packet user priority or at the PSFP
> IPV.
> Probably the former.
> 
> So the only implementation that conforms to these requirements
> seems to be the invasive one, where we introduce a dedicated
> helper for the pattern where drivers and the core ask for
> netdev_get_prio_tc_map(dev, skb->priority). We replace all such
> instances with a helper that selects the TC based on IPV, if the skb
> was filtered by PSFP time gates on ingress, and falls back to
> skb->priority otherwise.
> 
> Fixes: a51c328df310 ("net: qos: introduce a gate control flow
> action")
> Reported-by: Ferenc Fejes <ferenc.fejes@ericsson.com>
> Link:
> https://lore.kernel.org/netdev/87o80h1n65.fsf@kurt/T/#meaec9c24b3add9cb11edd411278a3a6ecf01573e
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
>  include/linux/netdevice.h                     | 13 +++++++++++++
>  include/linux/skbuff.h                        | 11 +++++++++++
>  include/net/tc_act/tc_gate.h                  |  1 +
>  net/core/dev.c                                |  4 ++--
>  net/dsa/tag_ocelot.c                          |  2 +-
>  net/sched/act_gate.c                          |  6 ++++++
>  net/sched/sch_taprio.c                        | 12 +++---------
>  8 files changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 77c2e70b0860..719259af9aaa 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -8531,7 +8531,7 @@ static u16 ixgbe_select_queue(struct net_device
> *dev, struct sk_buff *skb,
>         int txq;
>  
>         if (sb_dev) {
> -               u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
> +               u8 tc = skb_get_prio_tc_map(skb, dev);
>                 struct net_device *vdev = sb_dev;
>  
>                 txq = vdev->tc_to_txq[tc].offset;
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 47b59f99b037..8b87d017288f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2366,6 +2366,19 @@ int netdev_set_prio_tc_map(struct net_device
> *dev, u8 prio, u8 tc)
>         return 0;
>  }
>  
> +static inline int skb_get_prio_tc_map(const struct sk_buff *skb,
> +                                     const struct net_device *dev)
> +{
> +       __u32 prio = skb->priority;
> +
> +#if IS_ENABLED(CONFIG_NET_ACT_GATE)
> +       if (skb->use_ipv)
> +               prio = skb->ipv;
> +#endif
> +
> +       return netdev_get_prio_tc_map(dev, prio);
> +}
> +
>  int netdev_txq_to_tc(struct net_device *dev, unsigned int txq);
>  void netdev_reset_tc(struct net_device *dev);
>  int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count,
> u16 offset);
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index da96f0d3e753..b0a463c0bc65 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -913,6 +913,9 @@ typedef unsigned char *sk_buff_data_t;
>   *     @csum_start: Offset from skb->head where checksumming should
> start
>   *     @csum_offset: Offset from csum_start where checksum should be
> stored
>   *     @priority: Packet queueing priority
> + *     @use_ipv: Packet internal priority was altered by PSFP
> + *     @ipv: Internal Priority Value, overrides priority for traffic
> class
> + *             selection
>   *     @ignore_df: allow local fragmentation
>   *     @cloned: Head may be cloned (check refcnt to be sure)
>   *     @ip_summed: Driver fed us an IP checksum
> @@ -1145,6 +1148,10 @@ struct sk_buff {
>         __u8                    slow_gro:1;
>         __u8                    csum_not_inet:1;
>  
> +#ifdef CONFIG_NET_ACT_GATE
> +       __u8                    use_ipv:1;
> +#endif
> +
>  #ifdef CONFIG_NET_SCHED
>         __u16                   tc_index;       /* traffic control
> index */
>  #endif
> @@ -1209,6 +1216,10 @@ struct sk_buff {
>         /* only useable after checking ->active_extensions != 0 */
>         struct skb_ext          *extensions;
>  #endif
> +
> +#ifdef CONFIG_NET_ACT_GATE
> +       __u32                   ipv;
> +#endif
>  };
>  
>  /* if you move pkt_type around you also must adapt those constants
> */
> diff --git a/include/net/tc_act/tc_gate.h
> b/include/net/tc_act/tc_gate.h
> index c8fa11ebb397..b05c2c7d78e4 100644
> --- a/include/net/tc_act/tc_gate.h
> +++ b/include/net/tc_act/tc_gate.h
> @@ -44,6 +44,7 @@ struct tcf_gate {
>         ktime_t                 current_close_time;
>         u32                     current_entry_octets;
>         s32                     current_max_octets;
> +       s32                     current_ipv;
>         struct tcfg_gate_entry  *next_entry;
>         struct hrtimer          hitimer;
>         enum tk_offsets         tk_offset;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 721ba9c26554..956aa227c260 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3194,7 +3194,7 @@ static u16 skb_tx_hash(const struct net_device
> *dev,
>         u16 qcount = dev->real_num_tx_queues;
>  
>         if (dev->num_tc) {
> -               u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
> +               u8 tc = skb_get_prio_tc_map(skb, dev);
>  
>                 qoffset = sb_dev->tc_to_txq[tc].offset;
>                 qcount = sb_dev->tc_to_txq[tc].count;
> @@ -4002,7 +4002,7 @@ EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
>  static int __get_xps_queue_idx(struct net_device *dev, struct
> sk_buff *skb,
>                                struct xps_dev_maps *dev_maps,
> unsigned int tci)
>  {
> -       int tc = netdev_get_prio_tc_map(dev, skb->priority);
> +       int tc = skb_get_prio_tc_map(skb, dev);
>         struct xps_map *map;
>         int queue_index = -1;
>  
> diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c
> index 0d81f172b7a6..036e746f18eb 100644
> --- a/net/dsa/tag_ocelot.c
> +++ b/net/dsa/tag_ocelot.c
> @@ -52,7 +52,7 @@ static void ocelot_xmit_common(struct sk_buff *skb,
> struct net_device *netdev,
>         ocelot_xmit_get_vlan_info(skb, dp, &vlan_tci, &tag_type);
>  
>         qos_class = netdev_get_num_tc(netdev) ?
> -                   netdev_get_prio_tc_map(netdev, skb->priority) :
> skb->priority;
> +                   skb_get_prio_tc_map(skb, netdev) : skb->priority;
>  
>         injection = skb_push(skb, OCELOT_TAG_LEN);
>         prefix = skb_push(skb, OCELOT_SHORT_PREFIX_LEN);
> diff --git a/net/sched/act_gate.c b/net/sched/act_gate.c
> index fd5155274733..9fb248b104f8 100644
> --- a/net/sched/act_gate.c
> +++ b/net/sched/act_gate.c
> @@ -81,6 +81,7 @@ static enum hrtimer_restart gate_timer_func(struct
> hrtimer *timer)
>         gact->current_gate_status = next->gate_state ?
> GATE_ACT_GATE_OPEN : 0;
>         gact->current_entry_octets = 0;
>         gact->current_max_octets = next->maxoctets;
> +       gact->current_ipv = next->ipv;
>  
>         gact->current_close_time = ktime_add_ns(gact-
> >current_close_time,
>                                                 next->interval);
> @@ -140,6 +141,11 @@ static int tcf_gate_act(struct sk_buff *skb,
> const struct tc_action *a,
>                 }
>         }
>  
> +       if (gact->current_ipv >= 0) {
> +               skb->use_ipv = true;
> +               skb->ipv = gact->current_ipv;
> +       }
> +
>         spin_unlock(&gact->tcf_lock);
>  
>         return gact->tcf_action;
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index b9c71a304d39..fb8bc17e38bb 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -201,7 +201,7 @@ static struct sched_entry
> *find_entry_to_transmit(struct sk_buff *skb,
>         s32 cycle_elapsed;
>         int tc, n;
>  
> -       tc = netdev_get_prio_tc_map(dev, skb->priority);
> +       tc = skb_get_prio_tc_map(skb, dev);
>         packet_transmit_time = length_to_duration(q,
> qdisc_pkt_len(skb));
>  
>         *interval_start = 0;
> @@ -509,7 +509,6 @@ static struct sk_buff *taprio_peek_soft(struct
> Qdisc *sch)
>  
>         for (i = 0; i < dev->num_tx_queues; i++) {
>                 struct Qdisc *child = q->qdiscs[i];
> -               int prio;
>                 u8 tc;
>  
>                 if (unlikely(!child))
> @@ -522,9 +521,7 @@ static struct sk_buff *taprio_peek_soft(struct
> Qdisc *sch)
>                 if (TXTIME_ASSIST_IS_ENABLED(q->flags))
>                         return skb;
>  
> -               prio = skb->priority;
> -               tc = netdev_get_prio_tc_map(dev, prio);
> -
> +               tc = skb_get_prio_tc_map(skb, dev);
>                 if (!(gate_mask & BIT(tc)))
>                         continue;
>  
> @@ -579,7 +576,6 @@ static struct sk_buff *taprio_dequeue_soft(struct
> Qdisc *sch)
>         for (i = 0; i < dev->num_tx_queues; i++) {
>                 struct Qdisc *child = q->qdiscs[i];
>                 ktime_t guard;
> -               int prio;
>                 int len;
>                 u8 tc;
>  
> @@ -597,9 +593,7 @@ static struct sk_buff *taprio_dequeue_soft(struct
> Qdisc *sch)
>                 if (!skb)
>                         continue;
>  
> -               prio = skb->priority;
> -               tc = netdev_get_prio_tc_map(dev, prio);
> -
> +               tc = skb_get_prio_tc_map(skb, dev);
>                 if (!(gate_mask & BIT(tc))) {
>                         skb = NULL;
>                         continue;
> -----------------------------[ cut here ]----------------------------
> -


Your solution certainly correct and do the differentiation between the
cases where we have PSFP at ingress or not. However in my understanding
the only purpose of the IPV is the traffic class selection. 

Setting skb->queue_mapping to IPV probably wont work, because of two
reasons:
1. it brings inconsistency with the mqprio/taprio queues and the actual
hw rings the traffic sent
2. some drivers dont check if skb->queue_mapping is bounded, they
expect its smaller than the num_tx_queues.

The 2. might be solvable, but the 1. is more problematic. However with
a helper, we might check if skb->queue_mapping is already set and use
that as a traffic class. Is that possible? I dont really see any other
codepath where that value can be other than zero before the qdisc
layer. That way one flag (use_ipv) might be enough which tells that we
should use the skb->queue_mapping as is (set by act_gate) and preserve
skb->priority.

What do you think? Again, sorry for being late here, but I'm following
the list and see that recently you did major mqprio/taprio fixes and
refactors, so I hope your cache line is hot.

Best,
Ferenc
Vladimir Oltean Feb. 16, 2023, 3:58 p.m. UTC | #10
On Thu, Feb 16, 2023 at 02:47:11PM +0100, Ferenc Fejes wrote:
> > To fix this, we need to keep a current_ipv variable according to the
> > gate entry that's currently executed by act_gate, and use this IPV to
> > overwrite the skb->priority.
> > 
> > In fact, a complication arises due to the following clause from
> > 802.1Q:
> > 
> > > 8.6.6.1 PSFP queuing
> > > If PSFP is supported (8.6.5.1), and the IPV associated with the stream
> > > filter that passed the frame is anything other than the null value, then
> > > that IPV is used to determine the traffic class of the frame, in place
> > > of the frame's priority, via the Traffic Class Table specified in 8.6.6.
> > > In all other respects, the queuing actions specified in 8.6.6 are
> > > unchanged. The IPV is used only to determine the traffic class
> > > associated with a frame, and hence select an outbound queue; for all
> > > other purposes, the received priority is used.
> 
> Interesting. In my understanding this indeed something like "modify the
> queueing decision while preserve the original PCP value".

I actually wonder what the interpretation in the spirit of this clause is.
I'm known to over-interpret the text of these IEEE standards and finding
inconsistencies in the process (being able to prove both A and !A).

For example, if we consider the framePreemptionAdminStatus (express or
preemptible). It is expressed per priority. So this means that the
Internal Priority Value rewritten by PSFP will not affect the express/
preemptible nature of a frame, but it *will* affect the traffic class
selection?

This is insane, because it means that to enforce the requirements of
clause 12.30.1.1.1:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

it is insufficient to check that the prio_tc_map does not make some
traffic classes both express and preemptible. One has to also check the
tc-gate configuration, because the literal interpretation of the standard
would suggest that a packet is preemptible or express based on skb->priority,
but is classified to a traffic class based on skb->ipv. So I guess IPV
rewriting can only take place between one preemptible priority and another,
or only between one express priority and another, and that we should
somehow test this. But PSFP is at ingress, and FP is at egress, so we'd
somehow have to test the FP adminStatus of all the other net devices
that we can forward to!!!

I'll try to ask around and see if anyone knows more details about what
is the expected behavior.

> Your solution certainly correct and do the differentiation between the
> cases where we have PSFP at ingress or not. However in my understanding
> the only purpose of the IPV is the traffic class selection. 
> 
> Setting skb->queue_mapping to IPV probably wont work, because of two
> reasons:
> 1. it brings inconsistency with the mqprio/taprio queues and the actual
> hw rings the traffic sent
> 2. some drivers dont check if skb->queue_mapping is bounded, they
> expect its smaller than the num_tx_queues.
> 
> The 2. might be solvable, but the 1. is more problematic. However with
> a helper, we might check if skb->queue_mapping is already set and use
> that as a traffic class. Is that possible? I dont really see any other
> codepath where that value can be other than zero before the qdisc
> layer. That way one flag (use_ipv) might be enough which tells that we
> should use the skb->queue_mapping as is (set by act_gate) and preserve
> skb->priority.
> 
> What do you think? Again, sorry for being late here, but I'm following
> the list and see that recently you did major mqprio/taprio fixes and
> refactors, so I hope your cache line is hot.

I'm afraid it's not so easy to reuse this field for the IPV, because
skb->queue_mapping is already used between the RX and the TX path for
"recording the RX queue", see skb_rx_queue_recorded(), skb_record_rx_queue(),
skb_get_rx_queue().
Ferenc Fejes Feb. 17, 2023, 8:03 a.m. UTC | #11
Hi Vladimir!

On Thu, 2023-02-16 at 17:58 +0200, Vladimir Oltean wrote:
> On Thu, Feb 16, 2023 at 02:47:11PM +0100, Ferenc Fejes wrote:
> > > To fix this, we need to keep a current_ipv variable according to
> > > the
> > > gate entry that's currently executed by act_gate, and use this
> > > IPV to
> > > overwrite the skb->priority.
> > > 
> > > In fact, a complication arises due to the following clause from
> > > 802.1Q:
> > > 
> > > > 8.6.6.1 PSFP queuing
> > > > If PSFP is supported (8.6.5.1), and the IPV associated with the
> > > > stream
> > > > filter that passed the frame is anything other than the null
> > > > value, then
> > > > that IPV is used to determine the traffic class of the frame,
> > > > in place
> > > > of the frame's priority, via the Traffic Class Table specified
> > > > in 8.6.6.
> > > > In all other respects, the queuing actions specified in 8.6.6
> > > > are
> > > > unchanged. The IPV is used only to determine the traffic class
> > > > associated with a frame, and hence select an outbound queue;
> > > > for all
> > > > other purposes, the received priority is used.
> > 
> > Interesting. In my understanding this indeed something like "modify
> > the
> > queueing decision while preserve the original PCP value".
> 
> I actually wonder what the interpretation in the spirit of this
> clause is.
> I'm known to over-interpret the text of these IEEE standards and
> finding
> inconsistencies in the process (being able to prove both A and !A).

I agree, it takes time to guess what the intention behind the wording
of the standard in some cases. I have the standard in front of me right
now and its 2163 pages... Even if I grep to IPV, the context is
overwhelmingly dense.

> 
> For example, if we consider the framePreemptionAdminStatus (express
> or
> preemptible). It is expressed per priority. So this means that the
> Internal Priority Value rewritten by PSFP will not affect the
> express/
> preemptible nature of a frame, but it *will* affect the traffic class
> selection?
> 
> This is insane, because it means that to enforce the requirements of
> clause 12.30.1.1.1:
> 
> > Priorities that all map to the same traffic class should be
> > constrained to use the same value of preemption status.
> 
> it is insufficient to check that the prio_tc_map does not make some
> traffic classes both express and preemptible. One has to also check
> the
> tc-gate configuration, because the literal interpretation of the
> standard
> would suggest that a packet is preemptible or express based on skb-
> >priority,
> but is classified to a traffic class based on skb->ipv. So I guess
> IPV
> rewriting can only take place between one preemptible priority and
> another,
> or only between one express priority and another, and that we should
> somehow test this. But PSFP is at ingress, and FP is at egress, so
> we'd
> somehow have to test the FP adminStatus of all the other net devices
> that we can forward to!!!
> 
> I'll try to ask around and see if anyone knows more details about
> what
> is the expected behavior.

I'll try to ask around too, thanks for pointing this out. My best
understanding from the IPV that the standard treat it as skb->priority.
It defines IPV as a 32bit signed value, which clearly imply similar
semantics as skb->priority, which can be much larger than the number of
the queues or traffic classes.

> 
> > Your solution certainly correct and do the differentiation between
> > the
> > cases where we have PSFP at ingress or not. However in my
> > understanding
> > the only purpose of the IPV is the traffic class selection. 
> > 
> > Setting skb->queue_mapping to IPV probably wont work, because of
> > two
> > reasons:
> > 1. it brings inconsistency with the mqprio/taprio queues and the
> > actual
> > hw rings the traffic sent
> > 2. some drivers dont check if skb->queue_mapping is bounded, they
> > expect its smaller than the num_tx_queues.
> > 
> > The 2. might be solvable, but the 1. is more problematic. However
> > with
> > a helper, we might check if skb->queue_mapping is already set and
> > use
> > that as a traffic class. Is that possible? I dont really see any
> > other
> > codepath where that value can be other than zero before the qdisc
> > layer. That way one flag (use_ipv) might be enough which tells that
> > we
> > should use the skb->queue_mapping as is (set by act_gate) and
> > preserve
> > skb->priority.
> > 
> > What do you think? Again, sorry for being late here, but I'm
> > following
> > the list and see that recently you did major mqprio/taprio fixes
> > and
> > refactors, so I hope your cache line is hot.
> 
> I'm afraid it's not so easy to reuse this field for the IPV, because
> skb->queue_mapping is already used between the RX and the TX path for
> "recording the RX queue", see skb_rx_queue_recorded(),
> skb_record_rx_queue(),
> skb_get_rx_queue().

Oh, alright. I continue to think about alternatives over introducing
new members into sk_buff. It would be very nice to have proper act_gate
IPV handling without hardware offload. Its great to see the support of
frame preemption and PSFP support in more and more hardware but on the
other hand it makes the lack of the proper software mode operation more
and more awkward.

Best,
Ferenc
Vladimir Oltean Feb. 17, 2023, 1:07 p.m. UTC | #12
Hi Ferenc,

On Fri, Feb 17, 2023 at 09:03:30AM +0100, Ferenc Fejes wrote:
> I agree, it takes time to guess what the intention behind the wording
> of the standard in some cases. I have the standard in front of me right
> now and its 2163 pages... Even if I grep to IPV, the context is
> overwhelmingly dense.
> 
(...)
> I'll try to ask around too, thanks for pointing this out. My best
> understanding from the IPV that the standard treat it as skb->priority.
> It defines IPV as a 32bit signed value, which clearly imply similar
> semantics as skb->priority, which can be much larger than the number of
> the queues or traffic classes.

What would you say if we made the software act_gate implementation
simply alter skb->priority, which would potentially affect more stuff
including the egress-qos-map of a VLAN device in the output path of the
skb? It would definitely put less pressure on the networking data
structures, at the price of leaving an exceedingly unlikely case
uncovered.

> Oh, alright. I continue to think about alternatives over introducing
> new members into sk_buff. It would be very nice to have proper act_gate
> IPV handling without hardware offload. Its great to see the support of
> frame preemption and PSFP support in more and more hardware but on the
> other hand it makes the lack of the proper software mode operation more
> and more awkward.

I'm not sure that cyclic queuing and forwarding done with software
forwarding is going to be that practical anyway?
Ferenc Fejes Feb. 17, 2023, 2:35 p.m. UTC | #13
Hi Vladimir!

On Fri, 2023-02-17 at 15:07 +0200, Vladimir Oltean wrote:
> Hi Ferenc,
> 
> On Fri, Feb 17, 2023 at 09:03:30AM +0100, Ferenc Fejes wrote:
> > I agree, it takes time to guess what the intention behind the
> > wording
> > of the standard in some cases. I have the standard in front of me
> > right
> > now and its 2163 pages... Even if I grep to IPV, the context is
> > overwhelmingly dense.
> > 
> (...)
> > I'll try to ask around too, thanks for pointing this out. My best
> > understanding from the IPV that the standard treat it as skb-
> > >priority.
> > It defines IPV as a 32bit signed value, which clearly imply similar
> > semantics as skb->priority, which can be much larger than the
> > number of
> > the queues or traffic classes.
> 
> What would you say if we made the software act_gate implementation
> simply alter skb->priority, which would potentially affect more stuff
> including the egress-qos-map of a VLAN device in the output path of
> the
> skb? It would definitely put less pressure on the networking data
> structures, at the price of leaving an exceedingly unlikely case
> uncovered.

This is exactly what I just started to write in my reply! Yes, this
would be the right choice. The key here is the "exceedingly unlikely
case" what you just mentioned.

If you are lucky enough to have the luxury to think about cases where
the IPV mapping should only affect the queueing and not the egress-qos-
map, it would be nice to have options. Sadly mqprio and taprio
classless, so you can't do the following as far as I understand it:

1. Configure tc act_gate with IPV-s like 0, 1, 2, etc.
2. Configure mqprio/taprio with prio to tc mapping for 0, 1, 2, etc.
3. As mqprio/taprio leaf-s you can apply tc skbedit/bpf direct actions
which altering the skb->priority to values you would like to see at
egress-to-qos mapping
4. The egress-to-qos mapping act on the new skb->priority

However what if mqprio/taprio can read prio:tc map like:
tc qdisc add dev eth0 root mqprio num_tc 4 map 1000:0 1001:1 1000:2
1003:3 queues 1@0 1@1 1@2 1@3 hw 0
I think that dont necessarily break the existing script since we can
check if ":" if in the map parameters or not.
That way skb->priority ---> tc would be flexible (as in egress-qos-
map), the tc ---> queue mapping flexible too, and the original skb-
>priority can control the egress-qos-map differently than the queueing
that way.
I miss something fundamental here?


> 
> > Oh, alright. I continue to think about alternatives over
> > introducing
> > new members into sk_buff. It would be very nice to have proper
> > act_gate
> > IPV handling without hardware offload. Its great to see the support
> > of
> > frame preemption and PSFP support in more and more hardware but on
> > the
> > other hand it makes the lack of the proper software mode operation
> > more
> > and more awkward.
> 
> I'm not sure that cyclic queuing and forwarding done with software
> forwarding is going to be that practical anyway?

VNFs can perform PSFP as well or at least reprioritize the packets.
Also it would be handy for selftests too. Other than that future driver
implementers can verify their operation with the software behavior.

Best,
Ferenc
diff mbox series

Patch

diff --git a/tools/testing/selftests/drivers/net/ocelot/psfp.sh b/tools/testing/selftests/drivers/net/ocelot/psfp.sh
new file mode 100755
index 000000000000..5a5cee92c665
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/ocelot/psfp.sh
@@ -0,0 +1,327 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright 2021-2022 NXP
+
+# Note: On LS1028A, in lack of enough user ports, this setup requires patching
+# the device tree to use the second CPU port as a user port
+
+WAIT_TIME=1
+NUM_NETIFS=4
+STABLE_MAC_ADDRS=yes
+NETIF_CREATE=no
+lib_dir=$(dirname $0)/../../../net/forwarding
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/tsn_lib.sh
+
+UDS_ADDRESS_H1="/var/run/ptp4l_h1"
+UDS_ADDRESS_SWP1="/var/run/ptp4l_swp1"
+
+# Tunables
+NUM_PKTS=1000
+STREAM_VID=100
+STREAM_PRIO=6
+# Use a conservative cycle of 10 ms to allow the test to still pass when the
+# kernel has some extra overhead like lockdep etc
+CYCLE_TIME_NS=10000000
+# Create two Gate Control List entries, one OPEN and one CLOSE, of equal
+# durations
+GATE_DURATION_NS=$((${CYCLE_TIME_NS} / 2))
+# Give 2/3 of the cycle time to user space and 1/3 to the kernel
+FUDGE_FACTOR=$((${CYCLE_TIME_NS} / 3))
+# Shift the isochron base time by half the gate time, so that packets are
+# always received by swp1 close to the middle of the time slot, to minimize
+# inaccuracies due to network sync
+SHIFT_TIME_NS=$((${GATE_DURATION_NS} / 2))
+
+h1=${NETIFS[p1]}
+swp1=${NETIFS[p2]}
+swp2=${NETIFS[p3]}
+h2=${NETIFS[p4]}
+
+H1_IPV4="192.0.2.1"
+H2_IPV4="192.0.2.2"
+H1_IPV6="2001:db8:1::1"
+H2_IPV6="2001:db8:1::2"
+
+# Chain number exported by the ocelot driver for
+# Per-Stream Filtering and Policing filters
+PSFP()
+{
+	echo 30000
+}
+
+psfp_chain_create()
+{
+	local if_name=$1
+
+	tc qdisc add dev $if_name clsact
+
+	tc filter add dev $if_name ingress chain 0 pref 49152 flower \
+		skip_sw action goto chain $(PSFP)
+}
+
+psfp_chain_destroy()
+{
+	local if_name=$1
+
+	tc qdisc del dev $if_name clsact
+}
+
+psfp_filter_check()
+{
+	local expected=$1
+	local packets=""
+	local drops=""
+	local stats=""
+
+	stats=$(tc -j -s filter show dev ${swp1} ingress chain $(PSFP) pref 1)
+	packets=$(echo ${stats} | jq ".[1].options.actions[].stats.packets")
+	drops=$(echo ${stats} | jq ".[1].options.actions[].stats.drops")
+
+	if ! [ "${packets}" = "${expected}" ]; then
+		printf "Expected filter to match on %d packets but matched on %d instead\n" \
+			"${expected}" "${packets}"
+	fi
+
+	echo "Hardware filter reports ${drops} drops"
+}
+
+h1_create()
+{
+	simple_if_init $h1 $H1_IPV4/24 $H1_IPV6/64
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1 $H1_IPV4/24 $H1_IPV6/64
+}
+
+h2_create()
+{
+	simple_if_init $h2 $H2_IPV4/24 $H2_IPV6/64
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2 $H2_IPV4/24 $H2_IPV6/64
+}
+
+switch_create()
+{
+	local h2_mac_addr=$(mac_get $h2)
+
+	ip link set ${swp1} up
+	ip link set ${swp2} up
+
+	ip link add br0 type bridge vlan_filtering 1
+	ip link set ${swp1} master br0
+	ip link set ${swp2} master br0
+	ip link set br0 up
+
+	bridge vlan add dev ${swp2} vid ${STREAM_VID}
+	bridge vlan add dev ${swp1} vid ${STREAM_VID}
+	# PSFP on Ocelot requires the filter to also be added to the bridge
+	# FDB, and not be removed
+	bridge fdb add dev ${swp2} \
+		${h2_mac_addr} vlan ${STREAM_VID} static master
+
+	psfp_chain_create ${swp1}
+
+	tc filter add dev ${swp1} ingress chain $(PSFP) pref 1 \
+		protocol 802.1Q flower skip_sw \
+		dst_mac ${h2_mac_addr} vlan_id ${STREAM_VID} \
+		action gate base-time 0.000000000 \
+		sched-entry OPEN  ${GATE_DURATION_NS} -1 -1 \
+		sched-entry CLOSE ${GATE_DURATION_NS} -1 -1
+}
+
+switch_destroy()
+{
+	psfp_chain_destroy ${swp1}
+	ip link del br0
+}
+
+txtime_setup()
+{
+	local if_name=$1
+
+	tc qdisc add dev ${if_name} clsact
+	# Classify PTP on TC 7 and isochron on TC 6
+	tc filter add dev ${if_name} egress protocol 0x88f7 \
+		flower action skbedit priority 7
+	tc filter add dev ${if_name} egress protocol 802.1Q \
+		flower vlan_ethtype 0xdead action skbedit priority 6
+	tc qdisc add dev ${if_name} handle 100: parent root mqprio num_tc 8 \
+		queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
+		map 0 1 2 3 4 5 6 7 \
+		hw 1
+	# Set up TC 6 for SO_TXTIME. tc-mqprio queues count from 1.
+	tc qdisc replace dev ${if_name} parent 100:$((${STREAM_PRIO} + 1)) etf \
+		clockid CLOCK_TAI offload delta ${FUDGE_FACTOR}
+}
+
+txtime_cleanup()
+{
+	local if_name=$1
+
+	tc qdisc del dev ${if_name} root
+	tc qdisc del dev ${if_name} clsact
+}
+
+setup_prepare()
+{
+	vrf_prepare
+
+	h1_create
+	h2_create
+	switch_create
+
+	txtime_setup ${h1}
+
+	# Set up swp1 as a master PHC for h1, synchronized to the local
+	# CLOCK_REALTIME.
+	phc2sys_start ${swp1} ${UDS_ADDRESS_SWP1}
+
+	# Assumption true for LS1028A: h1 and h2 use the same PHC. So by
+	# synchronizing h1 to swp1 via PTP, h2 is also implicitly synchronized
+	# to swp1 (and both to CLOCK_REALTIME).
+	ptp4l_start ${h1} true ${UDS_ADDRESS_H1}
+	ptp4l_start ${swp1} false ${UDS_ADDRESS_SWP1}
+
+	# Make sure there are no filter matches at the beginning of the test
+	psfp_filter_check 0
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	ptp4l_stop ${swp1}
+	ptp4l_stop ${h1}
+	phc2sys_stop
+	isochron_recv_stop
+
+	txtime_cleanup ${h1}
+
+	h2_destroy
+	h1_destroy
+	switch_destroy
+
+	vrf_cleanup
+}
+
+debug_incorrectly_dropped_packets()
+{
+	local isochron_dat=$1
+	local dropped_seqids
+	local seqid
+
+	echo "Packets incorrectly dropped:"
+
+	dropped_seqids=$(isochron report \
+		--input-file "${isochron_dat}" \
+		--printf-format "%u RX hw %T\n" \
+		--printf-args "qR" | \
+		grep 'RX hw 0.000000000' | \
+		awk '{print $1}')
+
+	for seqid in ${dropped_seqids}; do
+		isochron report \
+			--input-file "${isochron_dat}" \
+			--start ${seqid} --stop ${seqid} \
+			--printf-format "seqid %u scheduled for %T, HW TX timestamp %T\n" \
+			--printf-args "qST"
+	done
+}
+
+debug_incorrectly_received_packets()
+{
+	local isochron_dat=$1
+
+	echo "Packets incorrectly received:"
+
+	isochron report \
+		--input-file "${isochron_dat}" \
+		--printf-format "seqid %u scheduled for %T, HW TX timestamp %T, HW RX timestamp %T\n" \
+		--printf-args "qSTR" |
+		grep -v 'HW RX timestamp 0.000000000'
+}
+
+run_test()
+{
+	local base_time=$1
+	local expected=$2
+	local test_name=$3
+	local debug=$4
+	local isochron_dat="$(mktemp)"
+	local extra_args=""
+	local received
+
+	isochron_do \
+		"${h1}" \
+		"${h2}" \
+		"${UDS_ADDRESS_H1}" \
+		"" \
+		"${base_time}" \
+		"${CYCLE_TIME_NS}" \
+		"${SHIFT_TIME_NS}" \
+		"${NUM_PKTS}" \
+		"${STREAM_VID}" \
+		"${STREAM_PRIO}" \
+		"" \
+		"${isochron_dat}"
+
+	# Count all received packets by looking at the non-zero RX timestamps
+	received=$(isochron report \
+		--input-file "${isochron_dat}" \
+		--printf-format "%u\n" --printf-args "R" | \
+		grep -w -v '0' | wc -l)
+
+	if [ "${received}" = "${expected}" ]; then
+		RET=0
+	else
+		RET=1
+		echo "Expected isochron to receive ${expected} packets but received ${received}"
+	fi
+
+	log_test "${test_name}"
+
+	if [ "$RET" = "1" ]; then
+		${debug} "${isochron_dat}"
+	fi
+
+	rm ${isochron_dat} 2> /dev/null
+}
+
+test_gate_in_band()
+{
+	# Send packets in-band with the OPEN gate entry
+	run_test 0.000000000 ${NUM_PKTS} "In band" \
+		debug_incorrectly_dropped_packets
+
+	psfp_filter_check ${NUM_PKTS}
+}
+
+test_gate_out_of_band()
+{
+	# Send packets in-band with the CLOSE gate entry
+	run_test 0.005000000 0 "Out of band" \
+		debug_incorrectly_received_packets
+
+	psfp_filter_check $((2 * ${NUM_PKTS}))
+}
+
+trap cleanup EXIT
+
+ALL_TESTS="
+	test_gate_in_band
+	test_gate_out_of_band
+"
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
diff --git a/tools/testing/selftests/net/forwarding/tsn_lib.sh b/tools/testing/selftests/net/forwarding/tsn_lib.sh
new file mode 100644
index 000000000000..60a1423e8116
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tsn_lib.sh
@@ -0,0 +1,235 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright 2021-2022 NXP
+
+REQUIRE_ISOCHRON=${REQUIRE_ISOCHRON:=yes}
+REQUIRE_LINUXPTP=${REQUIRE_LINUXPTP:=yes}
+
+# Tunables
+UTC_TAI_OFFSET=37
+ISOCHRON_CPU=1
+
+if [[ "$REQUIRE_ISOCHRON" = "yes" ]]; then
+	# https://github.com/vladimiroltean/tsn-scripts
+	# WARNING: isochron versions pre-1.0 are unstable,
+	# always use the latest version
+	require_command isochron
+fi
+if [[ "$REQUIRE_LINUXPTP" = "yes" ]]; then
+	require_command phc2sys
+	require_command ptp4l
+fi
+
+phc2sys_start()
+{
+	local if_name=$1
+	local uds_address=$2
+	local extra_args=""
+
+	if ! [ -z "${uds_address}" ]; then
+		extra_args="${extra_args} -z ${uds_address}"
+	fi
+
+	phc2sys_log="$(mktemp)"
+
+	chrt -f 10 phc2sys -m \
+		-c ${if_name} \
+		-s CLOCK_REALTIME \
+		-O ${UTC_TAI_OFFSET} \
+		--step_threshold 0.00002 \
+		--first_step_threshold 0.00002 \
+		${extra_args} \
+		> "${phc2sys_log}" 2>&1 &
+	phc2sys_pid=$!
+
+	echo "phc2sys logs to ${phc2sys_log} and has pid ${phc2sys_pid}"
+
+	sleep 1
+}
+
+phc2sys_stop()
+{
+	{ kill ${phc2sys_pid} && wait ${phc2sys_pid}; } 2> /dev/null
+	rm "${phc2sys_log}" 2> /dev/null
+}
+
+ptp4l_start()
+{
+	local if_name=$1
+	local slave_only=$2
+	local uds_address=$3
+	local log="ptp4l_log_${if_name}"
+	local pid="ptp4l_pid_${if_name}"
+	local extra_args=""
+
+	if [ "${slave_only}" = true ]; then
+		extra_args="${extra_args} -s"
+	fi
+
+	# declare dynamic variables ptp4l_log_${if_name} and ptp4l_pid_${if_name}
+	# as global, so that they can be referenced later
+	declare -g "${log}=$(mktemp)"
+
+	chrt -f 10 ptp4l -m -2 -P \
+		-i ${if_name} \
+		--step_threshold 0.00002 \
+		--first_step_threshold 0.00002 \
+		--tx_timestamp_timeout 100 \
+		--uds_address="${uds_address}" \
+		${extra_args} \
+		> "${!log}" 2>&1 &
+	declare -g "${pid}=$!"
+
+	echo "ptp4l for interface ${if_name} logs to ${!log} and has pid ${!pid}"
+
+	sleep 1
+}
+
+ptp4l_stop()
+{
+	local if_name=$1
+	local log="ptp4l_log_${if_name}"
+	local pid="ptp4l_pid_${if_name}"
+
+	{ kill ${!pid} && wait ${!pid}; } 2> /dev/null
+	rm "${!log}" 2> /dev/null
+}
+
+cpufreq_max()
+{
+	local cpu=$1
+	local freq="cpu${cpu}_freq"
+	local governor="cpu${cpu}_governor"
+
+	# Kernel may be compiled with CONFIG_CPU_FREQ disabled
+	if ! [ -d /sys/bus/cpu/devices/cpu${cpu}/cpufreq ]; then
+		return
+	fi
+
+	# declare dynamic variables cpu${cpu}_freq and cpu${cpu}_governor as
+	# global, so they can be referenced later
+	declare -g "${freq}=$(cat /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_min_freq)"
+	declare -g "${governor}=$(cat /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_governor)"
+
+	cat /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_max_freq > \
+		/sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_min_freq
+	echo -n "performance" > \
+		/sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_governor
+}
+
+cpufreq_restore()
+{
+	local cpu=$1
+	local freq="cpu${cpu}_freq"
+	local governor="cpu${cpu}_governor"
+
+	if ! [ -d /sys/bus/cpu/devices/cpu${cpu}/cpufreq ]; then
+		return
+	fi
+
+	echo "${!freq}" > /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_min_freq
+	echo -n "${!governor}" > \
+		/sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_governor
+}
+
+isochron_recv_start()
+{
+	local if_name=$1
+	local uds=$2
+	local extra_args=$3
+
+	if ! [ -z "${uds}" ]; then
+		extra_args="--unix-domain-socket ${uds}"
+	fi
+
+	isochron rcv \
+		--interface ${if_name} \
+		--sched-priority 98 \
+		--sched-fifo \
+		--utc-tai-offset ${UTC_TAI_OFFSET} \
+		--quiet \
+		${extra_args} & \
+	isochron_pid=$!
+
+	sleep 1
+}
+
+isochron_recv_stop()
+{
+	{ kill ${isochron_pid} && wait ${isochron_pid}; } 2> /dev/null
+}
+
+isochron_do()
+{
+	local sender_if_name=$1; shift
+	local receiver_if_name=$1; shift
+	local sender_uds=$1; shift
+	local receiver_uds=$1; shift
+	local base_time=$1; shift
+	local cycle_time=$1; shift
+	local shift_time=$1; shift
+	local num_pkts=$1; shift
+	local vid=$1; shift
+	local priority=$1; shift
+	local dst_ip=$1; shift
+	local isochron_dat=$1; shift
+	local extra_args=""
+	local receiver_extra_args=""
+	local vrf="$(master_name_get ${sender_if_name})"
+	local use_l2="true"
+
+	if ! [ -z "${dst_ip}" ]; then
+		use_l2="false"
+	fi
+
+	if ! [ -z "${vrf}" ]; then
+		dst_ip="${dst_ip}%${vrf}"
+	fi
+
+	if ! [ -z "${vid}" ]; then
+		vid="--vid=${vid}"
+	fi
+
+	if [ -z "${receiver_uds}" ]; then
+		extra_args="${extra_args} --omit-remote-sync"
+	fi
+
+	if ! [ -z "${shift_time}" ]; then
+		extra_args="${extra_args} --shift-time=${shift_time}"
+	fi
+
+	if [ "${use_l2}" = "true" ]; then
+		extra_args="${extra_args} --l2 --etype=0xdead ${vid}"
+		receiver_extra_args="--l2 --etype=0xdead"
+	else
+		extra_args="${extra_args} --l4 --ip-destination=${dst_ip}"
+		receiver_extra_args="--l4"
+	fi
+
+	cpufreq_max ${ISOCHRON_CPU}
+
+	isochron_recv_start "${h2}" "${receiver_uds}" "${receiver_extra_args}"
+
+	isochron send \
+		--interface ${sender_if_name} \
+		--unix-domain-socket ${sender_uds} \
+		--priority ${priority} \
+		--base-time ${base_time} \
+		--cycle-time ${cycle_time} \
+		--num-frames ${num_pkts} \
+		--frame-size 64 \
+		--txtime \
+		--utc-tai-offset ${UTC_TAI_OFFSET} \
+		--cpu-mask $((1 << ${ISOCHRON_CPU})) \
+		--sched-fifo \
+		--sched-priority 98 \
+		--client 127.0.0.1 \
+		--sync-threshold 5000 \
+		--output-file ${isochron_dat} \
+		${extra_args} \
+		--quiet
+
+	isochron_recv_stop
+
+	cpufreq_restore ${ISOCHRON_CPU}
+}