diff mbox series

[net-next,v9,2/2] net: ti: icssg_prueth: add TAPRIO offload support

Message ID 20240531044512.981587-3-danishanwar@ti.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Add TAPRIO offload support for ICSSG driver | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 908 this patch: 908
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 13 of 13 maintainers
netdev/build_clang success Errors and warnings before: 906 this patch: 906
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 913 this patch: 913
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 102 exceeds 80 columns WARNING: line length of 111 exceeds 80 columns WARNING: line length of 114 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

MD Danish Anwar May 31, 2024, 4:45 a.m. UTC
From: Roger Quadros <rogerq@ti.com>

The ICSSG dual-emac / switch firmware supports Enhanced Scheduled Traffic
(EST – defined in P802.1Qbv/D2.2 that later got included in IEEE
802.1Q-2018) configuration. EST allows express queue traffic to be
scheduled (placed) on the wire at specific repeatable time intervals. In
Linux kernel, EST configuration is done through tc command and the taprio
scheduler in the net core implements a software only scheduler
(SCH_TAPRIO). If the NIC is capable of EST configuration,user indicate
"flag 2" in the command which is then parsed by taprio scheduler in net
core and indicate that the command is to be offloaded to h/w. taprio then
offloads the command to the driver by calling ndo_setup_tc() ndo ops. This
patch implements ndo_setup_tc() to offload EST configuration to ICSSG.

Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
---
 drivers/net/ethernet/ti/Kconfig              |   1 +
 drivers/net/ethernet/ti/Makefile             |   1 +
 drivers/net/ethernet/ti/icssg/icssg_prueth.c |   3 +
 drivers/net/ethernet/ti/icssg/icssg_prueth.h |   2 +
 drivers/net/ethernet/ti/icssg/icssg_qos.c    | 288 +++++++++++++++++++
 drivers/net/ethernet/ti/icssg/icssg_qos.h    | 113 ++++++++
 6 files changed, 408 insertions(+)
 create mode 100644 drivers/net/ethernet/ti/icssg/icssg_qos.c
 create mode 100644 drivers/net/ethernet/ti/icssg/icssg_qos.h

Comments

Vladimir Oltean May 31, 2024, 1:51 p.m. UTC | #1
On Fri, May 31, 2024 at 10:15:12AM +0530, MD Danish Anwar wrote:
> From: Roger Quadros <rogerq@ti.com>
> 
> The ICSSG dual-emac / switch firmware supports Enhanced Scheduled Traffic
> (EST – defined in P802.1Qbv/D2.2 that later got included in IEEE
> 802.1Q-2018) configuration. EST allows express queue traffic to be
> scheduled (placed) on the wire at specific repeatable time intervals. In
> Linux kernel, EST configuration is done through tc command and the taprio
> scheduler in the net core implements a software only scheduler
> (SCH_TAPRIO). If the NIC is capable of EST configuration,user indicate
> "flag 2" in the command which is then parsed by taprio scheduler in net
> core and indicate that the command is to be offloaded to h/w. taprio then
> offloads the command to the driver by calling ndo_setup_tc() ndo ops. This
> patch implements ndo_setup_tc() to offload EST configuration to ICSSG.

This is all a lot of verbiage about generic tc-taprio and nothing
concrete about the ICSSG implementation... It is useless, sorry.

> 
> Signed-off-by: Roger Quadros <rogerq@ti.com>
> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
> Reviewed-by: Simon Horman <horms@kernel.org>
> Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
> ---
>  drivers/net/ethernet/ti/Kconfig              |   1 +
>  drivers/net/ethernet/ti/Makefile             |   1 +
>  drivers/net/ethernet/ti/icssg/icssg_prueth.c |   3 +
>  drivers/net/ethernet/ti/icssg/icssg_prueth.h |   2 +
>  drivers/net/ethernet/ti/icssg/icssg_qos.c    | 288 +++++++++++++++++++
>  drivers/net/ethernet/ti/icssg/icssg_qos.h    | 113 ++++++++
>  6 files changed, 408 insertions(+)
>  create mode 100644 drivers/net/ethernet/ti/icssg/icssg_qos.c
>  create mode 100644 drivers/net/ethernet/ti/icssg/icssg_qos.h
> 
> diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
> index f160a3b71499..6deac9035610 100644
> --- a/drivers/net/ethernet/ti/Kconfig
> +++ b/drivers/net/ethernet/ti/Kconfig
> @@ -190,6 +190,7 @@ config TI_ICSSG_PRUETH
>  	depends on PRU_REMOTEPROC
>  	depends on ARCH_K3 && OF && TI_K3_UDMA_GLUE_LAYER
>  	depends on PTP_1588_CLOCK_OPTIONAL
> +	depends on NET_SCH_TAPRIO

I think the pattern should be "depends on NET_SCH_TAPRIO || NET_SCH_TAPRIO=n".
What you want is to be built as a module when taprio is a module,
because you use symbols exported by it (taprio_offload_get(), taprio_offload_free()).

>  	help
>  	  Support dual Gigabit Ethernet ports over the ICSSG PRU Subsystem.
>  	  This subsystem is available starting with the AM65 platform.
> diff --git a/drivers/net/ethernet/ti/Makefile b/drivers/net/ethernet/ti/Makefile
> index 59cd20a38267..0a86311bd170 100644
> --- a/drivers/net/ethernet/ti/Makefile
> +++ b/drivers/net/ethernet/ti/Makefile
> @@ -35,6 +35,7 @@ obj-$(CONFIG_TI_ICSSG_PRUETH) += icssg-prueth.o
>  icssg-prueth-y := icssg/icssg_prueth.o \
>  		  icssg/icssg_common.o \
>  		  icssg/icssg_classifier.o \
> +		  icssg/icssg_qos.o \
>  		  icssg/icssg_queues.o \
>  		  icssg/icssg_config.o \
>  		  icssg/icssg_mii_cfg.o \
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
> index d159bdf7dd9d..8982ecb8a43d 100644
> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
> @@ -706,6 +706,7 @@ static const struct net_device_ops emac_netdev_ops = {
>  	.ndo_eth_ioctl = emac_ndo_ioctl,
>  	.ndo_get_stats64 = emac_ndo_get_stats64,
>  	.ndo_get_phys_port_name = emac_ndo_get_phys_port_name,
> +	.ndo_setup_tc = icssg_qos_ndo_setup_tc,
>  };
>  
>  static int prueth_netdev_init(struct prueth *prueth,
> @@ -840,6 +841,8 @@ static int prueth_netdev_init(struct prueth *prueth,
>  	emac->rx_hrtimer.function = &emac_rx_timer_callback;
>  	prueth->emac[mac] = emac;
>  
> +	icssg_qos_tas_init(ndev);
> +
>  	return 0;
>  
>  free:
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
> index fab2428de78b..c6851546e6c5 100644
> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
> @@ -37,6 +37,7 @@
>  #include "icssg_config.h"
>  #include "icss_iep.h"
>  #include "icssg_switch_map.h"
> +#include "icssg_qos.h"
>  
>  #define PRUETH_MAX_MTU          (2000 - ETH_HLEN - ETH_FCS_LEN)
>  #define PRUETH_MIN_PKT_SIZE     (VLAN_ETH_ZLEN)
> @@ -195,6 +196,7 @@ struct prueth_emac {
>  	/* RX IRQ Coalescing Related */
>  	struct hrtimer rx_hrtimer;
>  	unsigned long rx_pace_timeout_ns;
> +	struct prueth_qos qos;
>  };
>  
>  /**
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.c b/drivers/net/ethernet/ti/icssg/icssg_qos.c
> new file mode 100644
> index 000000000000..5e93b1b9ca43
> --- /dev/null
> +++ b/drivers/net/ethernet/ti/icssg/icssg_qos.c
> @@ -0,0 +1,288 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Texas Instruments ICSSG PRUETH QoS submodule
> + * Copyright (C) 2023 Texas Instruments Incorporated - http://www.ti.com/
> + */
> +
> +#include <linux/printk.h>
> +#include "icssg_prueth.h"
> +#include "icssg_switch_map.h"
> +
> +static void tas_update_fw_list_pointers(struct prueth_emac *emac)
> +{
> +	struct tas_config *tas = &emac->qos.tas.config;
> +
> +	if ((readb(tas->active_list)) == TAS_LIST0) {

Who and when updates tas->active_list from TAS_LIST0 to TAS_LIST1?

> +		tas->fw_active_list = emac->dram.va + TAS_GATE_MASK_LIST0;
> +		tas->fw_shadow_list = emac->dram.va + TAS_GATE_MASK_LIST1;
> +	} else {
> +		tas->fw_active_list = emac->dram.va + TAS_GATE_MASK_LIST1;
> +		tas->fw_shadow_list = emac->dram.va + TAS_GATE_MASK_LIST0;
> +	}
> +}
> +
> +static void tas_update_maxsdu_table(struct prueth_emac *emac)
> +{
> +	struct tas_config *tas = &emac->qos.tas.config;
> +	u16 __iomem *max_sdu_tbl_ptr;
> +	u8 gate_idx;
> +
> +	/* update the maxsdu table */
> +	max_sdu_tbl_ptr = emac->dram.va + TAS_QUEUE_MAX_SDU_LIST;
> +
> +	for (gate_idx = 0; gate_idx < TAS_MAX_NUM_QUEUES; gate_idx++)
> +		writew(tas->max_sdu_table.max_sdu[gate_idx], &max_sdu_tbl_ptr[gate_idx]);
> +}
> +
> +static void tas_reset(struct prueth_emac *emac)
> +{
> +	struct tas_config *tas = &emac->qos.tas.config;
> +	int i;
> +
> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
> +		tas->max_sdu_table.max_sdu[i] = 2048;

Macro + short comment for the magic number, please.

> +
> +	tas_update_maxsdu_table(emac);
> +
> +	writeb(TAS_LIST0, tas->active_list);
> +
> +	memset_io(tas->fw_active_list, 0, sizeof(*tas->fw_active_list));
> +	memset_io(tas->fw_shadow_list, 0, sizeof(*tas->fw_shadow_list));
> +}
> +
> +static int tas_set_state(struct prueth_emac *emac, enum tas_state state)
> +{
> +	struct tas_config *tas = &emac->qos.tas.config;
> +	int ret;
> +
> +	if (tas->state == state)
> +		return 0;
> +
> +	switch (state) {
> +	case TAS_STATE_RESET:
> +		tas_reset(emac);
> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_RESET);
> +		tas->state = TAS_STATE_RESET;
> +		break;
> +	case TAS_STATE_ENABLE:
> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
> +		tas->state = TAS_STATE_ENABLE;
> +		break;
> +	case TAS_STATE_DISABLE:
> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_DISABLE);
> +		tas->state = TAS_STATE_DISABLE;

This can be expressed as just "tas->state = state" outside the switch statement.
But probably shouldn't be, if "ret != 0".

> +		break;
> +	default:
> +		netdev_err(emac->ndev, "%s: unsupported state\n", __func__);

There are two levels of logging for this error, and this particular one
isn't useful. We can infer it went through the "default" case when the
printk below returned -EINVAL, because if that -EINVAL came from
emac_set_port_state(), that would have printed, in turn, "invalid port command".

I don't think that a "default" case is needed here, as long as all enum
values are handled, and the input is sanitized everywhere (which it is).

> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	if (ret)
> +		netdev_err(emac->ndev, "TAS set state failed %d\n", ret);

FWIW, emac_set_port_state() has its own logging. I don't necessarily see
the need for this print.

> +	return ret;
> +}
> +
> +static int tas_set_trigger_list_change(struct prueth_emac *emac)
> +{
> +	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
> +	struct tas_config *tas = &emac->qos.tas.config;
> +	struct ptp_system_timestamp sts;
> +	u32 change_cycle_count;
> +	u32 cycle_time;
> +	u64 base_time;
> +	u64 cur_time;
> +
> +	/* IEP clock has a hardware errata due to which it wraps around exactly
> +	 * once every taprio cycle. To compensate for that, adjust cycle time
> +	 * by the wrap around time which is stored in emac->iep->def_inc
> +	 */
> +	cycle_time = admin_list->cycle_time - emac->iep->def_inc;
> +	base_time = admin_list->base_time;
> +	cur_time = prueth_iep_gettime(emac, &sts);
> +
> +	if (base_time > cur_time)
> +		change_cycle_count = DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time);
> +	else
> +		change_cycle_count = 1;
> +
> +	writel(cycle_time, emac->dram.va + TAS_ADMIN_CYCLE_TIME);
> +	writel(change_cycle_count, emac->dram.va + TAS_CONFIG_CHANGE_CYCLE_COUNT);
> +	writeb(admin_list->num_entries, emac->dram.va + TAS_ADMIN_LIST_LENGTH);
> +
> +	/* config_change cleared by f/w to ack reception of new shadow list */
> +	writeb(1, &tas->config_list->config_change);
> +	/* config_pending cleared by f/w when new shadow list is copied to active list */
> +	writeb(1, &tas->config_list->config_pending);
> +
> +	return emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);

The call path here is:

emac_taprio_replace()
-> tas_update_oper_list()
   -> tas_set_trigger_list_change()
      -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
-> tas_set_state(emac, TAS_STATE_ENABLE);
   -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);

I'm surprised by the calls to emac_set_port_state() in such a quick
succession? Is there any firmware requirement for how much should the
port stay in the TAS_TRIGGER state? Or is it not really a state, despite
it being an argument to a function named emac_set_port_state()?

> +}

There's something extremely elementary about this function which I still
don't understand.

When does the schedule actually _start_? Can that be controlled by the
driver with the high (nanosecond) precision necessary in order for the
ICSSG to synchronize with the schedule of other equipment in the LAN?

You never pass the base time per se to the firmware. Just a number of
cycles from now. I guess that number of cycles decides when the schedule
starts, but what are those cycles relative to?

> +
> +static int tas_update_oper_list(struct prueth_emac *emac)
> +{
> +	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
> +	struct tas_config *tas = &emac->qos.tas.config;
> +	u32 tas_acc_gate_close_time = 0;
> +	u8 idx, gate_idx, val;
> +	int ret;
> +
> +	if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)
> +		return -EINVAL;
> +
> +	tas_update_fw_list_pointers(emac);
> +
> +	for (idx = 0; idx < admin_list->num_entries; idx++) {
> +		writeb(admin_list->entries[idx].gate_mask,
> +		       &tas->fw_shadow_list->gate_mask_list[idx]);
> +		tas_acc_gate_close_time += admin_list->entries[idx].interval;
> +
> +		/* extend last entry till end of cycle time */
> +		if (idx == admin_list->num_entries - 1)
> +			writel(admin_list->cycle_time,
> +			       &tas->fw_shadow_list->win_end_time_list[idx]);
> +		else
> +			writel(tas_acc_gate_close_time,
> +			       &tas->fw_shadow_list->win_end_time_list[idx]);
> +	}
> +
> +	/* clear remaining entries */
> +	for (idx = admin_list->num_entries; idx < TAS_MAX_CMD_LISTS; idx++) {
> +		writeb(0, &tas->fw_shadow_list->gate_mask_list[idx]);
> +		writel(0, &tas->fw_shadow_list->win_end_time_list[idx]);
> +	}
> +
> +	/* update the Array of gate close time for each queue in each window */
> +	for (idx = 0 ; idx < admin_list->num_entries; idx++) {
> +		/* On Linux, only PRUETH_MAX_TX_QUEUES are supported per port */
> +		for (gate_idx = 0; gate_idx < PRUETH_MAX_TX_QUEUES; gate_idx++) {
> +			u8 gate_mask_list_idx = readb(&tas->fw_shadow_list->gate_mask_list[idx]);
> +			u32 gate_close_time = 0;
> +
> +			if (gate_mask_list_idx & BIT(gate_idx))
> +				gate_close_time = readl(&tas->fw_shadow_list->win_end_time_list[idx]);
> +
> +			writel(gate_close_time,
> +			       &tas->fw_shadow_list->gate_close_time_list[idx][gate_idx]);
> +		}

An implementation which operates per TX queues rather than per traffic
classes should report caps->gate_mask_per_txq = true in TC_QUERY_CAPS.

> +	}
> +
> +	/* tell f/w to swap active & shadow list */
> +	ret = tas_set_trigger_list_change(emac);
> +	if (ret) {
> +		netdev_err(emac->ndev, "failed to swap f/w config list: %d\n", ret);
> +		return ret;
> +	}
> +
> +	/* Wait for completion */
> +	ret = readb_poll_timeout(&tas->config_list->config_change, val, !val,
> +				 USEC_PER_MSEC, 10 * USEC_PER_MSEC);
> +	if (ret) {
> +		netdev_err(emac->ndev, "TAS list change completion time out\n");
> +		return ret;
> +	}
> +
> +	tas_update_fw_list_pointers(emac);

Calling this twice in the same function? Explanation?

> +
> +	return 0;
> +}
> +
> +static int emac_taprio_replace(struct net_device *ndev,
> +			       struct tc_taprio_qopt_offload *taprio)
> +{
> +	struct prueth_emac *emac = netdev_priv(ndev);
> +	int ret;
> +
> +	if (taprio->cycle_time_extension) {
> +		NL_SET_ERR_MSG_MOD(taprio->extack, "Cycle time extension not supported");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (taprio->cycle_time < TAS_MIN_CYCLE_TIME) {
> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "cycle_time %llu is less than min supported cycle_time %d",
> +				       taprio->cycle_time, TAS_MIN_CYCLE_TIME);
> +		return -EINVAL;
> +	}
> +
> +	if (taprio->num_entries > TAS_MAX_CMD_LISTS) {
> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "num_entries %lu is more than max supported entries %d",
> +				       taprio->num_entries, TAS_MAX_CMD_LISTS);
> +		return -EINVAL;
> +	}
> +
> +	if (emac->qos.tas.taprio_admin)
> +		taprio_offload_free(emac->qos.tas.taprio_admin);
> +
> +	emac->qos.tas.taprio_admin = taprio_offload_get(taprio);
> +	ret = tas_update_oper_list(emac);

If this fails and there was a previous emac->qos.tas.taprio_admin
schedule present, you just broke it. In particular, the
"if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)" bounds check really
doesn't belong there; it should have been done much earlier, to avoid a
complete offload breakage for such a silly thing (replacing a working
taprio schedule with a new one that has too large cycle breaks the old
schedule).

> +	if (ret)
> +		goto clear_taprio;
> +
> +	ret = tas_set_state(emac, TAS_STATE_ENABLE);
> +	if (ret)
> +		goto clear_taprio;
> +
> +clear_taprio:
> +	emac->qos.tas.taprio_admin = NULL;
> +	taprio_offload_free(taprio);
> +
> +	return ret;
> +}
> +
> +static int emac_taprio_destroy(struct net_device *ndev,
> +			       struct tc_taprio_qopt_offload *taprio)
> +{
> +	struct prueth_emac *emac = netdev_priv(ndev);
> +	int ret;
> +
> +	taprio_offload_free(taprio);
> +
> +	ret = tas_set_state(emac, TAS_STATE_RESET);
> +	if (ret)
> +		return ret;
> +
> +	return tas_set_state(emac, TAS_STATE_DISABLE);

Again, any timing requirements for the state transitions? Why not
directly do TAS_STATE_DISABLE? It's not very clear what they do
different, despite of the attempt to document these firmware states.

> +}
> +
> +static int emac_setup_taprio(struct net_device *ndev, void *type_data)
> +{
> +	struct tc_taprio_qopt_offload *taprio = type_data;
> +	int ret;
> +
> +	switch (taprio->cmd) {
> +	case TAPRIO_CMD_REPLACE:
> +		ret = emac_taprio_replace(ndev, taprio);
> +		break;
> +	case TAPRIO_CMD_DESTROY:
> +		ret = emac_taprio_destroy(ndev, taprio);
> +		break;
> +	default:
> +		ret = -EOPNOTSUPP;
> +	}
> +
> +	return ret;
> +}
> +
> +int icssg_qos_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
> +			   void *type_data)
> +{
> +	switch (type) {
> +	case TC_SETUP_QDISC_TAPRIO:
> +		return emac_setup_taprio(ndev, type_data);
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +void icssg_qos_tas_init(struct net_device *ndev)
> +{
> +	struct prueth_emac *emac = netdev_priv(ndev);
> +	struct tas_config *tas;
> +
> +	tas = &emac->qos.tas.config;
> +
> +	tas->config_list = emac->dram.va + TAS_CONFIG_CHANGE_TIME;
> +	tas->active_list = emac->dram.va + TAS_ACTIVE_LIST_INDEX;
> +
> +	tas_update_fw_list_pointers(emac);
> +
> +	tas_set_state(emac, TAS_STATE_RESET);

Why leave it in TAS_STATE_RESET and not TAS_STATE_DISABLE? The firmware
state at probe time is not idempotent with its state after a
emac_taprio_replace() -> emac_taprio_destroy() sequence, which is not
good.

> +}
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.h b/drivers/net/ethernet/ti/icssg/icssg_qos.h
> new file mode 100644
> index 000000000000..25baccdd1ce5
> --- /dev/null
> +++ b/drivers/net/ethernet/ti/icssg/icssg_qos.h
> @@ -0,0 +1,113 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2023 Texas Instruments Incorporated - http://www.ti.com/
> + */
> +
> +#ifndef __NET_TI_ICSSG_QOS_H
> +#define __NET_TI_ICSSG_QOS_H
> +
> +#include <linux/atomic.h>
> +#include <linux/netdevice.h>
> +#include <net/pkt_sched.h>
> +
> +/* Maximum number of gate command entries in each list. */
> +#define TAS_MAX_CMD_LISTS   (16)
> +
> +/* Maximum number of transmit queues supported by implementation */
> +#define TAS_MAX_NUM_QUEUES  (8)
> +
> +/* Minimum cycle time supported by implementation (in ns) */
> +#define TAS_MIN_CYCLE_TIME  (1000000)
> +
> +/* Minimum cycle time supported by implementation (in ns) */
> +#define TAS_MAX_CYCLE_TIME  (4000000000)
> +
> +/* Minimum TAS window duration supported by implementation (in ns) */
> +#define TAS_MIN_WINDOW_DURATION  (10000)
> +
> +/**
> + * enum tas_list_num - TAS list number
> + * @TAS_LIST0: TAS list number is 0
> + * @TAS_LIST1: TAS list number is 1
> + */
> +enum tas_list_num {
> +	TAS_LIST0 = 0,
> +	TAS_LIST1 = 1
> +};
> +
> +/**
> + * enum tas_state - State of TAS in firmware
> + * @TAS_STATE_DISABLE: TAS state machine is disabled.
> + * @TAS_STATE_ENABLE: TAS state machine is enabled.
> + * @TAS_STATE_RESET: TAS state machine is reset.
> + */
> +enum tas_state {
> +	TAS_STATE_DISABLE = 0,
> +	TAS_STATE_ENABLE = 1,
> +	TAS_STATE_RESET = 2,
> +};
> +
> +/**
> + * struct tas_config_list - Config state machine variables
> + * @config_change_time: New list is copied at this time
> + * @config_change_error_counter: Incremented if admin->BaseTime < current time
> + *				 and TAS_enabled is true
> + * @config_pending: True if list update is pending
> + * @config_change: Set to true when application trigger updating of admin list
> + *		   to active list, cleared when configChangeTime is updated
> + */
> +struct tas_config_list {
> +	u64 config_change_time;
> +	u32 config_change_error_counter;
> +	u8 config_pending;
> +	u8 config_change;
> +};

Should be __packed since it maps over a firmware-defined __iomem memory
region. The compiler is not free to pad as it wishes.

> +
> +/* Max SDU table. See IEEE Std 802.1Q-2018 12.29.1.1 */
> +struct tas_max_sdu_table {
> +	u16 max_sdu[TAS_MAX_NUM_QUEUES];
> +};
> +
> +/**
> + * struct tas_firmware_list - TAS List Structure based on firmware memory map
> + * @gate_mask_list: Window gate mask list
> + * @win_end_time_list: Window end time list
> + * @gate_close_time_list: Array of gate close time for each queue in each window
> + */
> +struct tas_firmware_list {
> +	u8 gate_mask_list[TAS_MAX_CMD_LISTS];
> +	u32 win_end_time_list[TAS_MAX_CMD_LISTS];
> +	u32 gate_close_time_list[TAS_MAX_CMD_LISTS][TAS_MAX_NUM_QUEUES];
> +};

Should be __packed.

> +
> +/**
> + * struct tas_config - Main Time Aware Shaper Handle
> + * @state: TAS state
> + * @max_sdu_table: Max SDU table
> + * @config_list: Config change variables
> + * @active_list: Current operating list operating list
> + * @fw_active_list: Active List pointer, used by firmware
> + * @fw_shadow_list: Shadow List pointer, used by driver
> + */
> +struct tas_config {
> +	enum tas_state state;
> +	struct tas_max_sdu_table max_sdu_table;
> +	struct tas_config_list __iomem *config_list;
> +	u8 __iomem *active_list;
> +	struct tas_firmware_list __iomem *fw_active_list;
> +	struct tas_firmware_list __iomem *fw_shadow_list;
> +};
> +
> +struct prueth_qos_tas {
> +	struct tc_taprio_qopt_offload *taprio_admin;
> +	struct tc_taprio_qopt_offload *taprio_oper;

"taprio_oper" is unused.

> +	struct tas_config config;
> +};
> +
> +struct prueth_qos {
> +	struct prueth_qos_tas tas;
> +};
> +
> +void icssg_qos_tas_init(struct net_device *ndev);
> +int icssg_qos_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
> +			   void *type_data);
> +#endif /* __NET_TI_ICSSG_QOS_H */
> -- 
> 2.34.1
>
MD Danish Anwar June 3, 2024, 12:12 p.m. UTC | #2
Hi Vladimir,

On 31/05/24 7:21 pm, Vladimir Oltean wrote:
> On Fri, May 31, 2024 at 10:15:12AM +0530, MD Danish Anwar wrote:
>> From: Roger Quadros <rogerq@ti.com>
>>
>> The ICSSG dual-emac / switch firmware supports Enhanced Scheduled Traffic
>> (EST – defined in P802.1Qbv/D2.2 that later got included in IEEE
>> 802.1Q-2018) configuration. EST allows express queue traffic to be
>> scheduled (placed) on the wire at specific repeatable time intervals. In
>> Linux kernel, EST configuration is done through tc command and the taprio
>> scheduler in the net core implements a software only scheduler
>> (SCH_TAPRIO). If the NIC is capable of EST configuration,user indicate
>> "flag 2" in the command which is then parsed by taprio scheduler in net
>> core and indicate that the command is to be offloaded to h/w. taprio then
>> offloads the command to the driver by calling ndo_setup_tc() ndo ops. This
>> patch implements ndo_setup_tc() to offload EST configuration to ICSSG.
> 
> This is all a lot of verbiage about generic tc-taprio and nothing
> concrete about the ICSSG implementation... It is useless, sorry.
> 

I will try to add more information about ICSSG and remove generic
tc-taprio details.

>>
>> Signed-off-by: Roger Quadros <rogerq@ti.com>
>> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
>> Reviewed-by: Simon Horman <horms@kernel.org>
>> Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
>> ---
>>  drivers/net/ethernet/ti/Kconfig              |   1 +
>>  drivers/net/ethernet/ti/Makefile             |   1 +
>>  drivers/net/ethernet/ti/icssg/icssg_prueth.c |   3 +
>>  drivers/net/ethernet/ti/icssg/icssg_prueth.h |   2 +
>>  drivers/net/ethernet/ti/icssg/icssg_qos.c    | 288 +++++++++++++++++++
>>  drivers/net/ethernet/ti/icssg/icssg_qos.h    | 113 ++++++++
>>  6 files changed, 408 insertions(+)
>>  create mode 100644 drivers/net/ethernet/ti/icssg/icssg_qos.c
>>  create mode 100644 drivers/net/ethernet/ti/icssg/icssg_qos.h
>>
>> diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
>> index f160a3b71499..6deac9035610 100644
>> --- a/drivers/net/ethernet/ti/Kconfig
>> +++ b/drivers/net/ethernet/ti/Kconfig
>> @@ -190,6 +190,7 @@ config TI_ICSSG_PRUETH
>>  	depends on PRU_REMOTEPROC
>>  	depends on ARCH_K3 && OF && TI_K3_UDMA_GLUE_LAYER
>>  	depends on PTP_1588_CLOCK_OPTIONAL
>> +	depends on NET_SCH_TAPRIO
> 
> I think the pattern should be "depends on NET_SCH_TAPRIO || NET_SCH_TAPRIO=n".
> What you want is to be built as a module when taprio is a module,
> because you use symbols exported by it (taprio_offload_get(), taprio_offload_free()).
> 

Sure.

>>  	help
>>  	  Support dual Gigabit Ethernet ports over the ICSSG PRU Subsystem.
>>  	  This subsystem is available starting with the AM65 platform.
>> diff --git a/drivers/net/ethernet/ti/Makefile b/drivers/net/ethernet/ti/Makefile
>> index 59cd20a38267..0a86311bd170 100644
>> --- a/drivers/net/ethernet/ti/Makefile
>> +++ b/drivers/net/ethernet/ti/Makefile
>> @@ -35,6 +35,7 @@ obj-$(CONFIG_TI_ICSSG_PRUETH) += icssg-prueth.o
>>  icssg-prueth-y := icssg/icssg_prueth.o \
>>  		  icssg/icssg_common.o \
>>  		  icssg/icssg_classifier.o \
>> +		  icssg/icssg_qos.o \
>>  		  icssg/icssg_queues.o \
>>  		  icssg/icssg_config.o \
>>  		  icssg/icssg_mii_cfg.o \
>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
>> index d159bdf7dd9d..8982ecb8a43d 100644
>> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
>> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
>> @@ -706,6 +706,7 @@ static const struct net_device_ops emac_netdev_ops = {
>>  	.ndo_eth_ioctl = emac_ndo_ioctl,
>>  	.ndo_get_stats64 = emac_ndo_get_stats64,
>>  	.ndo_get_phys_port_name = emac_ndo_get_phys_port_name,
>> +	.ndo_setup_tc = icssg_qos_ndo_setup_tc,
>>  };
>>  
>>  static int prueth_netdev_init(struct prueth *prueth,
>> @@ -840,6 +841,8 @@ static int prueth_netdev_init(struct prueth *prueth,
>>  	emac->rx_hrtimer.function = &emac_rx_timer_callback;
>>  	prueth->emac[mac] = emac;
>>  
>> +	icssg_qos_tas_init(ndev);
>> +
>>  	return 0;
>>  
>>  free:
>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>> index fab2428de78b..c6851546e6c5 100644
>> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
>> @@ -37,6 +37,7 @@
>>  #include "icssg_config.h"
>>  #include "icss_iep.h"
>>  #include "icssg_switch_map.h"
>> +#include "icssg_qos.h"
>>  
>>  #define PRUETH_MAX_MTU          (2000 - ETH_HLEN - ETH_FCS_LEN)
>>  #define PRUETH_MIN_PKT_SIZE     (VLAN_ETH_ZLEN)
>> @@ -195,6 +196,7 @@ struct prueth_emac {
>>  	/* RX IRQ Coalescing Related */
>>  	struct hrtimer rx_hrtimer;
>>  	unsigned long rx_pace_timeout_ns;
>> +	struct prueth_qos qos;
>>  };
>>  
>>  /**
>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.c b/drivers/net/ethernet/ti/icssg/icssg_qos.c
>> new file mode 100644
>> index 000000000000..5e93b1b9ca43
>> --- /dev/null
>> +++ b/drivers/net/ethernet/ti/icssg/icssg_qos.c
>> @@ -0,0 +1,288 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Texas Instruments ICSSG PRUETH QoS submodule
>> + * Copyright (C) 2023 Texas Instruments Incorporated - http://www.ti.com/
>> + */
>> +
>> +#include <linux/printk.h>
>> +#include "icssg_prueth.h"
>> +#include "icssg_switch_map.h"
>> +
>> +static void tas_update_fw_list_pointers(struct prueth_emac *emac)
>> +{
>> +	struct tas_config *tas = &emac->qos.tas.config;
>> +
>> +	if ((readb(tas->active_list)) == TAS_LIST0) {
> 
> Who and when updates tas->active_list from TAS_LIST0 to TAS_LIST1?
>

->emac_taprio_replace()
	-> tas_update_oper_list()
		-> tas_set_trigger_list_change()

This API send a r30 command to firmware to trigger the list change
`emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);`

This once firmware recives this command, it swaps the active and shadow
list.

emac_taprio_replace() calls tas_update_oper_list()

In tas_update_oper_list() in the beginning active_list is 0 i.e.
TAS_LIST0, tas_update_fw_list_pointers() is called which configures the
active and shadow list pointers. TAS_LIST0 becomes the active_list and
TAS_LIST1 becomes the shadow list.

Let's say before this API was called, active_list is TAS_LIST0 (0) and
shadow_list is TAS_LIST1.

After getting the shadow_list we fill three different arrays,
1. gate_mask_list[]
2. win_end_time_list[]
3. gate_close_time_list[][] - 2D array with size = num_entries * num_queues

Driver only updates the shadow_list. Once shadow list is filled, we call
tas_set_trigger_list_change() and ask firmware to change the active
list. Now the shadow_list that we had filled (TAS_LIST1) will become
active list and vice versa. We will again update our pointers

This is how list is changed by calling tas_update_fw_list_pointers.

>> +		tas->fw_active_list = emac->dram.va + TAS_GATE_MASK_LIST0;
>> +		tas->fw_shadow_list = emac->dram.va + TAS_GATE_MASK_LIST1;
>> +	} else {
>> +		tas->fw_active_list = emac->dram.va + TAS_GATE_MASK_LIST1;
>> +		tas->fw_shadow_list = emac->dram.va + TAS_GATE_MASK_LIST0;
>> +	}
>> +}
>> +
>> +static void tas_update_maxsdu_table(struct prueth_emac *emac)
>> +{
>> +	struct tas_config *tas = &emac->qos.tas.config;
>> +	u16 __iomem *max_sdu_tbl_ptr;
>> +	u8 gate_idx;
>> +
>> +	/* update the maxsdu table */
>> +	max_sdu_tbl_ptr = emac->dram.va + TAS_QUEUE_MAX_SDU_LIST;
>> +
>> +	for (gate_idx = 0; gate_idx < TAS_MAX_NUM_QUEUES; gate_idx++)
>> +		writew(tas->max_sdu_table.max_sdu[gate_idx], &max_sdu_tbl_ptr[gate_idx]);
>> +}
>> +
>> +static void tas_reset(struct prueth_emac *emac)
>> +{
>> +	struct tas_config *tas = &emac->qos.tas.config;
>> +	int i;
>> +
>> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
>> +		tas->max_sdu_table.max_sdu[i] = 2048;
> 
> Macro + short comment for the magic number, please.
> 

Sure I will add it. Each elements in this array is a 2 byte value
showing the maximum length of frame to be allowed through each gate.

>> +
>> +	tas_update_maxsdu_table(emac);
>> +
>> +	writeb(TAS_LIST0, tas->active_list);
>> +
>> +	memset_io(tas->fw_active_list, 0, sizeof(*tas->fw_active_list));
>> +	memset_io(tas->fw_shadow_list, 0, sizeof(*tas->fw_shadow_list));
>> +}
>> +
>> +static int tas_set_state(struct prueth_emac *emac, enum tas_state state)
>> +{
>> +	struct tas_config *tas = &emac->qos.tas.config;
>> +	int ret;
>> +
>> +	if (tas->state == state)
>> +		return 0;
>> +
>> +	switch (state) {
>> +	case TAS_STATE_RESET:
>> +		tas_reset(emac);
>> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_RESET);
>> +		tas->state = TAS_STATE_RESET;
>> +		break;
>> +	case TAS_STATE_ENABLE:
>> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
>> +		tas->state = TAS_STATE_ENABLE;
>> +		break;
>> +	case TAS_STATE_DISABLE:
>> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_DISABLE);
>> +		tas->state = TAS_STATE_DISABLE;
> 
> This can be expressed as just "tas->state = state" outside the switch statement.
> But probably shouldn't be, if "ret != 0".

Yes we shouldn't do that as we are sending the r30 command to firmware
in each case.

> 
>> +		break;
>> +	default:
>> +		netdev_err(emac->ndev, "%s: unsupported state\n", __func__);
> 
> There are two levels of logging for this error, and this particular one
> isn't useful. We can infer it went through the "default" case when the
> printk below returned -EINVAL, because if that -EINVAL came from
> emac_set_port_state(), that would have printed, in turn, "invalid port command".
> 

But, the enum tas_state and enum icssg_port_state_cmd are not 1-1 mapped.

emac_set_port_state() will only return -EINVAL when `cmd >=
ICSSG_EMAC_PORT_MAX_COMMANDS` which is 19. But a tas_state value of 3 is
also invalid as we only support value of 0,1 and 2 so I think this print
shoudl be okay

enum tas_state {
	TAS_STATE_DISABLE = 0,
	TAS_STATE_ENABLE = 1,
	TAS_STATE_RESET = 2,
};


> I don't think that a "default" case is needed here, as long as all enum
> values are handled, and the input is sanitized everywhere (which it is).
> 

I think the default case should remain. Without default case the
function will return 0 even for invalid sates. By default ret = 0, in
the tas_state passed to API is not valid, none of the case will be
called, ret will remaing zero. No error will be printed and the function
will return 0. Keeping default case makes sure that the wrong state was
requested.

>> +		ret = -EINVAL;
>> +		break;
>> +	}
>> +
>> +	if (ret)
>> +		netdev_err(emac->ndev, "TAS set state failed %d\n", ret);
> 
> FWIW, emac_set_port_state() has its own logging. I don't necessarily see
> the need for this print.
> 

This print we can remove. I don't see any need to keep this print as
emac_set_port_state() will anyways print a ndetdev_err() if the cmd fails.

>> +	return ret;
>> +}
>> +
>> +static int tas_set_trigger_list_change(struct prueth_emac *emac)
>> +{
>> +	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
>> +	struct tas_config *tas = &emac->qos.tas.config;
>> +	struct ptp_system_timestamp sts;
>> +	u32 change_cycle_count;
>> +	u32 cycle_time;
>> +	u64 base_time;
>> +	u64 cur_time;
>> +
>> +	/* IEP clock has a hardware errata due to which it wraps around exactly
>> +	 * once every taprio cycle. To compensate for that, adjust cycle time
>> +	 * by the wrap around time which is stored in emac->iep->def_inc
>> +	 */
>> +	cycle_time = admin_list->cycle_time - emac->iep->def_inc;
>> +	base_time = admin_list->base_time;
>> +	cur_time = prueth_iep_gettime(emac, &sts);
>> +
>> +	if (base_time > cur_time)
>> +		change_cycle_count = DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time);
>> +	else
>> +		change_cycle_count = 1;
>> +
>> +	writel(cycle_time, emac->dram.va + TAS_ADMIN_CYCLE_TIME);
>> +	writel(change_cycle_count, emac->dram.va + TAS_CONFIG_CHANGE_CYCLE_COUNT);
>> +	writeb(admin_list->num_entries, emac->dram.va + TAS_ADMIN_LIST_LENGTH);
>> +
>> +	/* config_change cleared by f/w to ack reception of new shadow list */
>> +	writeb(1, &tas->config_list->config_change);
>> +	/* config_pending cleared by f/w when new shadow list is copied to active list */
>> +	writeb(1, &tas->config_list->config_pending);
>> +
>> +	return emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
> 
> The call path here is:
> 
> emac_taprio_replace()
> -> tas_update_oper_list()
>    -> tas_set_trigger_list_change()
>       -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
> -> tas_set_state(emac, TAS_STATE_ENABLE);
>    -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
> 
> I'm surprised by the calls to emac_set_port_state() in such a quick
> succession? Is there any firmware requirement for how much should the
> port stay in the TAS_TRIGGER state? Or is it not really a state, despite
> it being an argument to a function named emac_set_port_state()?
> 

ICSSG_EMAC_PORT_TAS_TRIGGER is not a state. emac_set_port_state() sends
a command to firmware, we call it r30 command. Driver then waits for the
response for some time. If a successfull response is recived the
function return 0 otherwise error.

Here first `emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER)` is
called which will ask firmware to swap the active_list and shadow_list
as explained above.

After that ICSSG_EMAC_PORT_TAS_ENABLE cmd is sent. Upon recievinig this
command firmware will Enable TAS for the particular port. (port is part
of emac structure).

I can see how that can be confusing given the API name is
emac_set_port_state(). Some of the cmds infact triggers a state change
eg. ICSSG_EMAC_PORT_DISABLE, ICSSG_EMAC_PORT_BLOCK,
ICSSG_EMAC_PORT_FORWARD but some of the commands just triggers some
action on the firmware side. Based on the command firmware does some
actions.


>> +}
> 
> There's something extremely elementary about this function which I still
> don't understand.
> 
> When does the schedule actually _start_? Can that be controlled by the
> driver with the high (nanosecond) precision necessary in order for the
> ICSSG to synchronize with the schedule of other equipment in the LAN?
> 
> You never pass the base time per se to the firmware. Just a number of
> cycles from now. I guess that number of cycles decides when the schedule
> starts, but what are those cycles relative to?
>

Once the shadow list is updated, the trigger is set in the firmware and
for that API tas_set_trigger_list_change() is called.

The following three offsets are configured in this function,
1. TAS_ADMIN_CYCLE_TIME → admin cycle time
2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
admin list is taken as operating list.
This parameter is calculated based on the base_time, cur_time and
cycle_time. If the base_time is in past (already passed) the
TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.

After configuring the above three parameters, the driver gives the
trigger signal to the firmware using the R30 command interface with
ICSSG_EMAC_PORT_TAS_TRIGGER command.

The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
are relative to time remaining in the base_time from now i.e. base_time
- cur_time.

>> +
>> +static int tas_update_oper_list(struct prueth_emac *emac)
>> +{
>> +	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
>> +	struct tas_config *tas = &emac->qos.tas.config;
>> +	u32 tas_acc_gate_close_time = 0;
>> +	u8 idx, gate_idx, val;
>> +	int ret;
>> +
>> +	if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)
>> +		return -EINVAL;
>> +
>> +	tas_update_fw_list_pointers(emac);
>> +
>> +	for (idx = 0; idx < admin_list->num_entries; idx++) {
>> +		writeb(admin_list->entries[idx].gate_mask,
>> +		       &tas->fw_shadow_list->gate_mask_list[idx]);
>> +		tas_acc_gate_close_time += admin_list->entries[idx].interval;
>> +
>> +		/* extend last entry till end of cycle time */
>> +		if (idx == admin_list->num_entries - 1)
>> +			writel(admin_list->cycle_time,
>> +			       &tas->fw_shadow_list->win_end_time_list[idx]);
>> +		else
>> +			writel(tas_acc_gate_close_time,
>> +			       &tas->fw_shadow_list->win_end_time_list[idx]);
>> +	}
>> +
>> +	/* clear remaining entries */
>> +	for (idx = admin_list->num_entries; idx < TAS_MAX_CMD_LISTS; idx++) {
>> +		writeb(0, &tas->fw_shadow_list->gate_mask_list[idx]);
>> +		writel(0, &tas->fw_shadow_list->win_end_time_list[idx]);
>> +	}
>> +
>> +	/* update the Array of gate close time for each queue in each window */
>> +	for (idx = 0 ; idx < admin_list->num_entries; idx++) {
>> +		/* On Linux, only PRUETH_MAX_TX_QUEUES are supported per port */
>> +		for (gate_idx = 0; gate_idx < PRUETH_MAX_TX_QUEUES; gate_idx++) {
>> +			u8 gate_mask_list_idx = readb(&tas->fw_shadow_list->gate_mask_list[idx]);
>> +			u32 gate_close_time = 0;
>> +
>> +			if (gate_mask_list_idx & BIT(gate_idx))
>> +				gate_close_time = readl(&tas->fw_shadow_list->win_end_time_list[idx]);
>> +
>> +			writel(gate_close_time,
>> +			       &tas->fw_shadow_list->gate_close_time_list[idx][gate_idx]);
>> +		}
> 
> An implementation which operates per TX queues rather than per traffic
> classes should report caps->gate_mask_per_txq = true in TC_QUERY_CAPS.
> 

Sure I will add this implementation.

>> +	}
>> +
>> +	/* tell f/w to swap active & shadow list */
>> +	ret = tas_set_trigger_list_change(emac);
>> +	if (ret) {
>> +		netdev_err(emac->ndev, "failed to swap f/w config list: %d\n", ret);
>> +		return ret;
>> +	}
>> +
>> +	/* Wait for completion */
>> +	ret = readb_poll_timeout(&tas->config_list->config_change, val, !val,
>> +				 USEC_PER_MSEC, 10 * USEC_PER_MSEC);
>> +	if (ret) {
>> +		netdev_err(emac->ndev, "TAS list change completion time out\n");
>> +		return ret;
>> +	}
>> +
>> +	tas_update_fw_list_pointers(emac);
> 
> Calling this twice in the same function? Explanation?
> 

As explained earlier tas_update_fw_list_pointers() is called in the
beginning to set the active and shadow_list. After that we fill the
shadow list and then send commmand to swap the active and shadow list.
As the list are swapped we will call tas_update_fw_list_pointers() to
update the list pointers.

>> +
>> +	return 0;
>> +}
>> +
>> +static int emac_taprio_replace(struct net_device *ndev,
>> +			       struct tc_taprio_qopt_offload *taprio)
>> +{
>> +	struct prueth_emac *emac = netdev_priv(ndev);
>> +	int ret;
>> +
>> +	if (taprio->cycle_time_extension) {
>> +		NL_SET_ERR_MSG_MOD(taprio->extack, "Cycle time extension not supported");
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	if (taprio->cycle_time < TAS_MIN_CYCLE_TIME) {
>> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "cycle_time %llu is less than min supported cycle_time %d",
>> +				       taprio->cycle_time, TAS_MIN_CYCLE_TIME);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (taprio->num_entries > TAS_MAX_CMD_LISTS) {
>> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "num_entries %lu is more than max supported entries %d",
>> +				       taprio->num_entries, TAS_MAX_CMD_LISTS);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (emac->qos.tas.taprio_admin)
>> +		taprio_offload_free(emac->qos.tas.taprio_admin);
>> +
>> +	emac->qos.tas.taprio_admin = taprio_offload_get(taprio);
>> +	ret = tas_update_oper_list(emac);
> 
> If this fails and there was a previous emac->qos.tas.taprio_admin
> schedule present, you just broke it. In particular, the
> "if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)" bounds check really
> doesn't belong there; it should have been done much earlier, to avoid a
> complete offload breakage for such a silly thing (replacing a working
> taprio schedule with a new one that has too large cycle breaks the old
> schedule).
> 

Will adding the check "if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)"
in emac_taprio_replace() along with the all the checks be OK?


>> +	if (ret)
>> +		goto clear_taprio;
>> +
>> +	ret = tas_set_state(emac, TAS_STATE_ENABLE);
>> +	if (ret)
>> +		goto clear_taprio;
>> +
>> +clear_taprio:
>> +	emac->qos.tas.taprio_admin = NULL;
>> +	taprio_offload_free(taprio);
>> +
>> +	return ret;
>> +}
>> +
>> +static int emac_taprio_destroy(struct net_device *ndev,
>> +			       struct tc_taprio_qopt_offload *taprio)
>> +{
>> +	struct prueth_emac *emac = netdev_priv(ndev);
>> +	int ret;
>> +
>> +	taprio_offload_free(taprio);
>> +
>> +	ret = tas_set_state(emac, TAS_STATE_RESET);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return tas_set_state(emac, TAS_STATE_DISABLE);
> 
> Again, any timing requirements for the state transitions? Why not
> directly do TAS_STATE_DISABLE? It's not very clear what they do
> different, despite of the attempt to document these firmware states.
> 

Doing a TAS_STATE_RESET resets all the configurations done previously.
Disable only disables the state but the configuration remains. When we
destroy the taprio I think we should clear out everything thus reset is
needed.

However I think the sequence is not correct here. It should be
TAS_STATE_DISABLE first and then TAS_STATE_RESET.

>> +}
>> +
>> +static int emac_setup_taprio(struct net_device *ndev, void *type_data)
>> +{
>> +	struct tc_taprio_qopt_offload *taprio = type_data;
>> +	int ret;
>> +
>> +	switch (taprio->cmd) {
>> +	case TAPRIO_CMD_REPLACE:
>> +		ret = emac_taprio_replace(ndev, taprio);
>> +		break;
>> +	case TAPRIO_CMD_DESTROY:
>> +		ret = emac_taprio_destroy(ndev, taprio);
>> +		break;
>> +	default:
>> +		ret = -EOPNOTSUPP;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +int icssg_qos_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
>> +			   void *type_data)
>> +{
>> +	switch (type) {
>> +	case TC_SETUP_QDISC_TAPRIO:
>> +		return emac_setup_taprio(ndev, type_data);
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +void icssg_qos_tas_init(struct net_device *ndev)
>> +{
>> +	struct prueth_emac *emac = netdev_priv(ndev);
>> +	struct tas_config *tas;
>> +
>> +	tas = &emac->qos.tas.config;
>> +
>> +	tas->config_list = emac->dram.va + TAS_CONFIG_CHANGE_TIME;
>> +	tas->active_list = emac->dram.va + TAS_ACTIVE_LIST_INDEX;
>> +
>> +	tas_update_fw_list_pointers(emac);
>> +
>> +	tas_set_state(emac, TAS_STATE_RESET);
> 
> Why leave it in TAS_STATE_RESET and not TAS_STATE_DISABLE? The firmware
> state at probe time is not idempotent with its state after a
> emac_taprio_replace() -> emac_taprio_destroy() sequence, which is not
> good.
> 

During init the we want the TAS to be completely cleaned. So if there is
any residual configurations left we want that to be cleared thus
TAS_STATE_RESET is called.

When tas_set_state(emac, TAS_STATE_RESET) or tas_set_state(emac,
TAS_STATE_DISABLE) is called a r30 cmd is sent to firmware to reset /
disable the tas.

As explained above the emac_taprio_destroy() sequence should be
TAS_STATE_DISABLE and then TAS_STATE_RESET.

Here in probe also we are setting the state to TAS_STATE_RESET. Now the
firmware state at probe time will be idempotent with its state after a
emac_taprio_replace() -> emac_taprio_destroy() sequence.

>> +}
>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.h b/drivers/net/ethernet/ti/icssg/icssg_qos.h
>> new file mode 100644
>> index 000000000000..25baccdd1ce5
>> --- /dev/null
>> +++ b/drivers/net/ethernet/ti/icssg/icssg_qos.h
>> @@ -0,0 +1,113 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (C) 2023 Texas Instruments Incorporated - http://www.ti.com/
>> + */
>> +
>> +#ifndef __NET_TI_ICSSG_QOS_H
>> +#define __NET_TI_ICSSG_QOS_H
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/netdevice.h>
>> +#include <net/pkt_sched.h>
>> +
>> +/* Maximum number of gate command entries in each list. */
>> +#define TAS_MAX_CMD_LISTS   (16)
>> +
>> +/* Maximum number of transmit queues supported by implementation */
>> +#define TAS_MAX_NUM_QUEUES  (8)
>> +
>> +/* Minimum cycle time supported by implementation (in ns) */
>> +#define TAS_MIN_CYCLE_TIME  (1000000)
>> +
>> +/* Minimum cycle time supported by implementation (in ns) */
>> +#define TAS_MAX_CYCLE_TIME  (4000000000)
>> +
>> +/* Minimum TAS window duration supported by implementation (in ns) */
>> +#define TAS_MIN_WINDOW_DURATION  (10000)
>> +
>> +/**
>> + * enum tas_list_num - TAS list number
>> + * @TAS_LIST0: TAS list number is 0
>> + * @TAS_LIST1: TAS list number is 1
>> + */
>> +enum tas_list_num {
>> +	TAS_LIST0 = 0,
>> +	TAS_LIST1 = 1
>> +};
>> +
>> +/**
>> + * enum tas_state - State of TAS in firmware
>> + * @TAS_STATE_DISABLE: TAS state machine is disabled.
>> + * @TAS_STATE_ENABLE: TAS state machine is enabled.
>> + * @TAS_STATE_RESET: TAS state machine is reset.
>> + */
>> +enum tas_state {
>> +	TAS_STATE_DISABLE = 0,
>> +	TAS_STATE_ENABLE = 1,
>> +	TAS_STATE_RESET = 2,
>> +};
>> +
>> +/**
>> + * struct tas_config_list - Config state machine variables
>> + * @config_change_time: New list is copied at this time
>> + * @config_change_error_counter: Incremented if admin->BaseTime < current time
>> + *				 and TAS_enabled is true
>> + * @config_pending: True if list update is pending
>> + * @config_change: Set to true when application trigger updating of admin list
>> + *		   to active list, cleared when configChangeTime is updated
>> + */
>> +struct tas_config_list {
>> +	u64 config_change_time;
>> +	u32 config_change_error_counter;
>> +	u8 config_pending;
>> +	u8 config_change;
>> +};
> 
> Should be __packed since it maps over a firmware-defined __iomem memory
> region. The compiler is not free to pad as it wishes.
> 

Sure.

>> +
>> +/* Max SDU table. See IEEE Std 802.1Q-2018 12.29.1.1 */
>> +struct tas_max_sdu_table {
>> +	u16 max_sdu[TAS_MAX_NUM_QUEUES];
>> +};
>> +
>> +/**
>> + * struct tas_firmware_list - TAS List Structure based on firmware memory map
>> + * @gate_mask_list: Window gate mask list
>> + * @win_end_time_list: Window end time list
>> + * @gate_close_time_list: Array of gate close time for each queue in each window
>> + */
>> +struct tas_firmware_list {
>> +	u8 gate_mask_list[TAS_MAX_CMD_LISTS];
>> +	u32 win_end_time_list[TAS_MAX_CMD_LISTS];
>> +	u32 gate_close_time_list[TAS_MAX_CMD_LISTS][TAS_MAX_NUM_QUEUES];
>> +};
> 
> Should be __packed.
> 

Sure.

>> +
>> +/**
>> + * struct tas_config - Main Time Aware Shaper Handle
>> + * @state: TAS state
>> + * @max_sdu_table: Max SDU table
>> + * @config_list: Config change variables
>> + * @active_list: Current operating list operating list
>> + * @fw_active_list: Active List pointer, used by firmware
>> + * @fw_shadow_list: Shadow List pointer, used by driver
>> + */
>> +struct tas_config {
>> +	enum tas_state state;
>> +	struct tas_max_sdu_table max_sdu_table;
>> +	struct tas_config_list __iomem *config_list;
>> +	u8 __iomem *active_list;
>> +	struct tas_firmware_list __iomem *fw_active_list;
>> +	struct tas_firmware_list __iomem *fw_shadow_list;
>> +};
>> +
>> +struct prueth_qos_tas {
>> +	struct tc_taprio_qopt_offload *taprio_admin;
>> +	struct tc_taprio_qopt_offload *taprio_oper;
> 
> "taprio_oper" is unused.
> 

I'll remove this.

>> +	struct tas_config config;
>> +};
>> +
>> +struct prueth_qos {
>> +	struct prueth_qos_tas tas;
>> +};
>> +
>> +void icssg_qos_tas_init(struct net_device *ndev);
>> +int icssg_qos_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
>> +			   void *type_data);
>> +#endif /* __NET_TI_ICSSG_QOS_H */
>> -- 
>> 2.34.1
>>

Thanks for your feedback. I will do the changes mentioned in this mail.
Apart from these if any other change is needed please let me know.

Also please let me know if you have any more queries.
Vladimir Oltean June 3, 2024, 1:51 p.m. UTC | #3
Hi Danish,

On Mon, Jun 03, 2024 at 05:42:06PM +0530, MD Danish Anwar wrote:
> >> diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.c b/drivers/net/ethernet/ti/icssg/icssg_qos.c
> >> new file mode 100644
> >> index 000000000000..5e93b1b9ca43
> >> --- /dev/null
> >> +++ b/drivers/net/ethernet/ti/icssg/icssg_qos.c
> >> @@ -0,0 +1,288 @@
> >> +static void tas_update_fw_list_pointers(struct prueth_emac *emac)
> >> +{
> >> +	struct tas_config *tas = &emac->qos.tas.config;
> >> +
> >> +	if ((readb(tas->active_list)) == TAS_LIST0) {
> > 
> > Who and when updates tas->active_list from TAS_LIST0 to TAS_LIST1?
> >
> 
> ->emac_taprio_replace()
> 	-> tas_update_oper_list()
> 		-> tas_set_trigger_list_change()
> 
> This API send a r30 command to firmware to trigger the list change
> `emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);`
> 
> This once firmware recives this command, it swaps the active and shadow
> list.
> 
> emac_taprio_replace() calls tas_update_oper_list()
> 
> In tas_update_oper_list() in the beginning active_list is 0 i.e.
> TAS_LIST0, tas_update_fw_list_pointers() is called which configures the
> active and shadow list pointers. TAS_LIST0 becomes the active_list and
> TAS_LIST1 becomes the shadow list.
> 
> Let's say before this API was called, active_list is TAS_LIST0 (0) and
> shadow_list is TAS_LIST1.
> 
> After getting the shadow_list we fill three different arrays,
> 1. gate_mask_list[]
> 2. win_end_time_list[]
> 3. gate_close_time_list[][] - 2D array with size = num_entries * num_queues
> 
> Driver only updates the shadow_list. Once shadow list is filled, we call
> tas_set_trigger_list_change() and ask firmware to change the active
> list. Now the shadow_list that we had filled (TAS_LIST1) will become
> active list and vice versa. We will again update our pointers
> 
> This is how list is changed by calling tas_update_fw_list_pointers.
> 
> >> +	tas_update_fw_list_pointers(emac);
> > 
> > Calling this twice in the same function? Explanation?
> > 
> 
> As explained earlier tas_update_fw_list_pointers() is called in the
> beginning to set the active and shadow_list. After that we fill the
> shadow list and then send commmand to swap the active and shadow list.
> As the list are swapped we will call tas_update_fw_list_pointers() to
> update the list pointers.

Ok, but if icssg_qos_tas_init() already calls tas_update_fw_list_pointers()
initially, I don't understand why the first tas_update_oper_list() call
of tas_update_oper_list() is necessary, if only tas_set_trigger_list_change()
swaps the active with the shadow list. There was no unaccounted list
swap prior to the tas_update_oper_list() call, was there?

> >> +static void tas_reset(struct prueth_emac *emac)
> >> +{
> >> +	struct tas_config *tas = &emac->qos.tas.config;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
> >> +		tas->max_sdu_table.max_sdu[i] = 2048;
> > 
> > Macro + short comment for the magic number, please.
> > 
> 
> Sure I will add it. Each elements in this array is a 2 byte value
> showing the maximum length of frame to be allowed through each gate.

Is the queueMaxSDU[] array active even with the TAS being in the reset
state? Does this configuration have any impact upon the device MTU?
I don't know why 2048 was chosen.

> >> +static int tas_set_state(struct prueth_emac *emac, enum tas_state state)
> >> +{
> >> +	struct tas_config *tas = &emac->qos.tas.config;
> >> +	int ret;
> >> +
> >> +	if (tas->state == state)
> >> +		return 0;
> >> +
> >> +	switch (state) {
> >> +	case TAS_STATE_RESET:
> >> +		tas_reset(emac);
> >> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_RESET);
> >> +		tas->state = TAS_STATE_RESET;
> >> +		break;
> >> +	case TAS_STATE_ENABLE:
> >> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
> >> +		tas->state = TAS_STATE_ENABLE;
> >> +		break;
> >> +	case TAS_STATE_DISABLE:
> >> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_DISABLE);
> >> +		tas->state = TAS_STATE_DISABLE;
> > 
> > This can be expressed as just "tas->state = state" outside the switch statement.
> > But probably shouldn't be, if "ret != 0".
> 
> Yes we shouldn't do that as we are sending the r30 command to firmware
> in each case.

I was saying that if there's a firmware error, we probably shouldn't
update our tas->state as if there wasn't.

And that the tas->state = state assignment is common across all switch
cases, so it's simpler to move it out.

> > 
> >> +		break;
> >> +	default:
> >> +		netdev_err(emac->ndev, "%s: unsupported state\n", __func__);
> > 
> > There are two levels of logging for this error, and this particular one
> > isn't useful. We can infer it went through the "default" case when the
> > printk below returned -EINVAL, because if that -EINVAL came from
> > emac_set_port_state(), that would have printed, in turn, "invalid port command".
> > 
> 
> But, the enum tas_state and enum icssg_port_state_cmd are not 1-1 mapped.

Correct, but you aren't printing the tas_state anyway, and there's no
code path possible with a tas_state outside the well-defined values.

> emac_set_port_state() will only return -EINVAL when `cmd >=
> ICSSG_EMAC_PORT_MAX_COMMANDS` which is 19. But a tas_state value of 3 is
> also invalid as we only support value of 0,1 and 2 so I think this print
> shoudl be okay
> 
> enum tas_state {
> 	TAS_STATE_DISABLE = 0,
> 	TAS_STATE_ENABLE = 1,
> 	TAS_STATE_RESET = 2,
> };
> 
> > I don't think that a "default" case is needed here, as long as all enum
> > values are handled, and the input is sanitized everywhere (which it is).
> > 
> 
> I think the default case should remain. Without default case the
> function will return 0 even for invalid sates. By default ret = 0, in
> the tas_state passed to API is not valid, none of the case will be
> called, ret will remaing zero. No error will be printed and the function
> will return 0. Keeping default case makes sure that the wrong state was
> requested.
> 

Dead code is what it is. If a new enum tas_state value is added and it's
not handled there, the _compiler_ will warn, rather than the Linux runtime.
So it's actually easier for the developer to catch it, rather than the user.
You don't need to protect against your own shadow.

> >> +static int tas_set_trigger_list_change(struct prueth_emac *emac)
> >> +{
> >> +	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
> >> +	struct tas_config *tas = &emac->qos.tas.config;
> >> +	struct ptp_system_timestamp sts;
> >> +	u32 change_cycle_count;
> >> +	u32 cycle_time;
> >> +	u64 base_time;
> >> +	u64 cur_time;
> >> +
> >> +	/* IEP clock has a hardware errata due to which it wraps around exactly
> >> +	 * once every taprio cycle. To compensate for that, adjust cycle time
> >> +	 * by the wrap around time which is stored in emac->iep->def_inc
> >> +	 */
> >> +	cycle_time = admin_list->cycle_time - emac->iep->def_inc;
> >> +	base_time = admin_list->base_time;
> >> +	cur_time = prueth_iep_gettime(emac, &sts);
> >> +
> >> +	if (base_time > cur_time)
> >> +		change_cycle_count = DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time);
> >> +	else
> >> +		change_cycle_count = 1;
> >> +
> >> +	writel(cycle_time, emac->dram.va + TAS_ADMIN_CYCLE_TIME);
> >> +	writel(change_cycle_count, emac->dram.va + TAS_CONFIG_CHANGE_CYCLE_COUNT);
> >> +	writeb(admin_list->num_entries, emac->dram.va + TAS_ADMIN_LIST_LENGTH);
> >> +
> >> +	/* config_change cleared by f/w to ack reception of new shadow list */
> >> +	writeb(1, &tas->config_list->config_change);
> >> +	/* config_pending cleared by f/w when new shadow list is copied to active list */
> >> +	writeb(1, &tas->config_list->config_pending);
> >> +
> >> +	return emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
> > 
> > The call path here is:
> > 
> > emac_taprio_replace()
> > -> tas_update_oper_list()
> >    -> tas_set_trigger_list_change()
> >       -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
> > -> tas_set_state(emac, TAS_STATE_ENABLE);
> >    -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
> > 
> > I'm surprised by the calls to emac_set_port_state() in such a quick
> > succession? Is there any firmware requirement for how much should the
> > port stay in the TAS_TRIGGER state? Or is it not really a state, despite
> > it being an argument to a function named emac_set_port_state()?
> > 
> 
> ICSSG_EMAC_PORT_TAS_TRIGGER is not a state. emac_set_port_state() sends
> a command to firmware, we call it r30 command. Driver then waits for the
> response for some time. If a successfull response is recived the
> function return 0 otherwise error.
> 
> Here first `emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER)` is
> called which will ask firmware to swap the active_list and shadow_list
> as explained above.
> 
> After that ICSSG_EMAC_PORT_TAS_ENABLE cmd is sent. Upon recievinig this
> command firmware will Enable TAS for the particular port. (port is part
> of emac structure).
> 
> I can see how that can be confusing given the API name is
> emac_set_port_state(). Some of the cmds infact triggers a state change
> eg. ICSSG_EMAC_PORT_DISABLE, ICSSG_EMAC_PORT_BLOCK,
> ICSSG_EMAC_PORT_FORWARD but some of the commands just triggers some
> action on the firmware side. Based on the command firmware does some
> actions.

If you're replacing an existing active schedule with a shadow one, the
ICSSG_EMAC_PORT_TAS_ENABLE command isn't needed because the TAS is
already enabled on the port, right? In fact it will be suppressed by
tas_set_state() without even generating an emac_set_port_state() call,
right?

> >> +}
> > 
> > There's something extremely elementary about this function which I still
> > don't understand.
> > 
> > When does the schedule actually _start_? Can that be controlled by the
> > driver with the high (nanosecond) precision necessary in order for the
> > ICSSG to synchronize with the schedule of other equipment in the LAN?
> > 
> > You never pass the base time per se to the firmware. Just a number of
> > cycles from now. I guess that number of cycles decides when the schedule
> > starts, but what are those cycles relative to?
> >
> 
> Once the shadow list is updated, the trigger is set in the firmware and
> for that API tas_set_trigger_list_change() is called.
> 
> The following three offsets are configured in this function,
> 1. TAS_ADMIN_CYCLE_TIME → admin cycle time
> 2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
> admin list is taken as operating list.
> This parameter is calculated based on the base_time, cur_time and
> cycle_time. If the base_time is in past (already passed) the
> TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
> future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
> DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
> 3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.
> 
> After configuring the above three parameters, the driver gives the
> trigger signal to the firmware using the R30 command interface with
> ICSSG_EMAC_PORT_TAS_TRIGGER command.
> 
> The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
> are relative to time remaining in the base_time from now i.e. base_time
> - cur_time.

So you're saying that the firmware executes the schedule switch at

	now                  +      TAS_ADMIN_CYCLE_TIME * TAS_CONFIG_CHANGE_CYCLE_COUNT ns
	~~~
	time of reception of
	ICSSG_EMAC_PORT_TAS_TRIGGER
	R30 command

?

I'm not really interested in how the driver calculates the cycle count,
just in what are the primitives that the firmware ABI wants.

Does the readb_poll_timeout() call from tas_update_oper_list() actually
wait until this whole time elapses? It is user space input, so it can
keep a task waiting in the kernel, with rtnl_lock() acquired, for a very
long time if the base_time is far away in the future.

If my understanding is correct, then there are 2 things you cannot do
(which IMO are very important) with the current firmware ABI:

1. You cannot synchronize the schedules on two ICSSG devices to one
another.

You are supposed to be able to run the same taprio command on the egress
port of 2 chained switches in a LAN:

tc qdisc replace dev swp0 parent root taprio \
	num_tc 8 \
	map 0 1 2 3 4 5 6 7 \
	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
	base-time 0 \
	sched-entry S 0x81 100000 \
	sched-entry S 0x01 900000 \
	flags 0x2 \
	max-sdu 0 0 0 0 0 0 0 79

and, assuming that the switches are synchronized by PTP, the gate events
will be synchronized on the 2 switches.

But if the schedule change formula in the firmware is fundamentally
dependant on a "now" that depends on when the Linux driver performed the
TAS_TRIGGER action, the gate events will never be precisely synchronized.

Here, "base-time 0" means that the driver/firmware/hardware should
advance the schedule start time into the closest moment in PTP time
which is a multiple of the cycle-time (100000+900000=1000000). So for
example, if the current PTP time is 1000.123456789, the closest start
time would be 100.124000000.

2. You cannot apply a phase offset between the schedules on two ICSSG
devices in the same network.

Since there is a PHY-dependent propagation delay on each link, network
engineers typically delay the schedules on switch ports along the path
of a stream.

Say for example there is a propagation delay of 800 ns on a switch with
base-time 0. On the next switch, you could add the schedule like this:

tc qdisc replace dev swp0 parent root taprio \
	num_tc 8 \
	map 0 1 2 3 4 5 6 7 \
	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
	base-time 800 \
	sched-entry S 0x81 100000 \
	sched-entry S 0x01 900000 \
	flags 0x2 \
	max-sdu 0 0 0 0 0 0 0 79

Same schedule, phase-shifted by 800 ns, so that if the packet goes
through an open gate in the first switch, it will also go through an
open gate through the second.

According to your own calculations and explanations, the firmware ABI
makes no difference between base-time 0 and base-time 800.

In this case they are probably both smaller than the current time, so
TAS_CONFIG_CHANGE_CYCLE_COUNT will be set to the same "1" in both cases.

But even assuming a future base-time, it still will make no difference.
The firmware seems to operate only on integer multiples of a cycle-time
(here 1000000).

Summarized, the blocking problems I see are:

- For issue #2, the driver should not lie to the user space that it
  applied a schedule with a base-time that isn't a precise multiple of
  the cycle-time, because it doesn't do that.

- For issue #1, the bigger problem is that there is always a
  software-induced jitter which makes whatever the user space has
  requested irrelevant.

This is sufficiently bad that I don't think it's worth spending any more
time on anything else until it is clear how you can make the firmware
actually obey the requested base-time.

> >> +static int emac_taprio_replace(struct net_device *ndev,
> >> +			       struct tc_taprio_qopt_offload *taprio)
> >> +{
> >> +	struct prueth_emac *emac = netdev_priv(ndev);
> >> +	int ret;
> >> +
> >> +	if (taprio->cycle_time_extension) {
> >> +		NL_SET_ERR_MSG_MOD(taprio->extack, "Cycle time extension not supported");
> >> +		return -EOPNOTSUPP;
> >> +	}
> >> +
> >> +	if (taprio->cycle_time < TAS_MIN_CYCLE_TIME) {
> >> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "cycle_time %llu is less than min supported cycle_time %d",
> >> +				       taprio->cycle_time, TAS_MIN_CYCLE_TIME);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	if (taprio->num_entries > TAS_MAX_CMD_LISTS) {
> >> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "num_entries %lu is more than max supported entries %d",
> >> +				       taprio->num_entries, TAS_MAX_CMD_LISTS);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	if (emac->qos.tas.taprio_admin)
> >> +		taprio_offload_free(emac->qos.tas.taprio_admin);
> >> +
> >> +	emac->qos.tas.taprio_admin = taprio_offload_get(taprio);
> >> +	ret = tas_update_oper_list(emac);
> > 
> > If this fails and there was a previous emac->qos.tas.taprio_admin
> > schedule present, you just broke it. In particular, the
> > "if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)" bounds check really
> > doesn't belong there; it should have been done much earlier, to avoid a
> > complete offload breakage for such a silly thing (replacing a working
> > taprio schedule with a new one that has too large cycle breaks the old
> > schedule).
> > 
> 
> Will adding the check "if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)"
> in emac_taprio_replace() along with the all the checks be OK?

Yes, but "it will be ok" needs to be put in proper context (this small
thing is OK doesn't mean everything is OK).
Vladimir Oltean June 3, 2024, 2:05 p.m. UTC | #4
On Mon, Jun 03, 2024 at 04:51:00PM +0300, Vladimir Oltean wrote:
> > >> +static void tas_reset(struct prueth_emac *emac)
> > >> +{
> > >> +	struct tas_config *tas = &emac->qos.tas.config;
> > >> +	int i;
> > >> +
> > >> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
> > >> +		tas->max_sdu_table.max_sdu[i] = 2048;
> > > 
> > > Macro + short comment for the magic number, please.
> > > 
> > 
> > Sure I will add it. Each elements in this array is a 2 byte value
> > showing the maximum length of frame to be allowed through each gate.
> 
> Is the queueMaxSDU[] array active even with the TAS being in the reset
> state? Does this configuration have any impact upon the device MTU?
> I don't know why 2048 was chosen.

Another comment here is: in the tc-taprio UAPI, a max-sdu value of 0
is special and means "no maxSDU limit for this TX queue". You are
programming the values from taprio straight away to hardware, so,
assuming there's no bug there, it means that the hardware also
understands 0 to mean "no maxSDU limit".

If so, then during tas_reset(), after which the TAS should be disabled,
why aren't you also using 0 as a default value, but 2048?
MD Danish Anwar June 6, 2024, 11:03 a.m. UTC | #5
Hi Vladimir,

On 03/06/24 7:21 pm, Vladimir Oltean wrote:
> Hi Danish,
> 
> On Mon, Jun 03, 2024 at 05:42:06PM +0530, MD Danish Anwar wrote:
>>>> diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.c b/drivers/net/ethernet/ti/icssg/icssg_qos.c
>>>> new file mode 100644
>>>> index 000000000000..5e93b1b9ca43
>>>> --- /dev/null
>>>> +++ b/drivers/net/ethernet/ti/icssg/icssg_qos.c
>>>> @@ -0,0 +1,288 @@
>>>> +static void tas_update_fw_list_pointers(struct prueth_emac *emac)
>>>> +{
>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>> +
>>>> +	if ((readb(tas->active_list)) == TAS_LIST0) {
>>>
>>> Who and when updates tas->active_list from TAS_LIST0 to TAS_LIST1?
>>>
>>
>> ->emac_taprio_replace()
>> 	-> tas_update_oper_list()
>> 		-> tas_set_trigger_list_change()
>>
>> This API send a r30 command to firmware to trigger the list change
>> `emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);`
>>
>> This once firmware recives this command, it swaps the active and shadow
>> list.
>>
>> emac_taprio_replace() calls tas_update_oper_list()
>>
>> In tas_update_oper_list() in the beginning active_list is 0 i.e.
>> TAS_LIST0, tas_update_fw_list_pointers() is called which configures the
>> active and shadow list pointers. TAS_LIST0 becomes the active_list and
>> TAS_LIST1 becomes the shadow list.
>>
>> Let's say before this API was called, active_list is TAS_LIST0 (0) and
>> shadow_list is TAS_LIST1.
>>
>> After getting the shadow_list we fill three different arrays,
>> 1. gate_mask_list[]
>> 2. win_end_time_list[]
>> 3. gate_close_time_list[][] - 2D array with size = num_entries * num_queues
>>
>> Driver only updates the shadow_list. Once shadow list is filled, we call
>> tas_set_trigger_list_change() and ask firmware to change the active
>> list. Now the shadow_list that we had filled (TAS_LIST1) will become
>> active list and vice versa. We will again update our pointers
>>
>> This is how list is changed by calling tas_update_fw_list_pointers.
>>
>>>> +	tas_update_fw_list_pointers(emac);
>>>
>>> Calling this twice in the same function? Explanation?
>>>
>>
>> As explained earlier tas_update_fw_list_pointers() is called in the
>> beginning to set the active and shadow_list. After that we fill the
>> shadow list and then send commmand to swap the active and shadow list.
>> As the list are swapped we will call tas_update_fw_list_pointers() to
>> update the list pointers.
> 
> Ok, but if icssg_qos_tas_init() already calls tas_update_fw_list_pointers()
> initially, I don't understand why the first tas_update_oper_list() call
> of tas_update_oper_list() is necessary, if only tas_set_trigger_list_change()
> swaps the active with the shadow list. There was no unaccounted list
> swap prior to the tas_update_oper_list() call, was there?
> 

You are right. This additional call to tas_update_fw_list_pointers() is
not needed. I will drop the call in the begining. The call should only
be made after the lists have been swapped.

>>>> +static void tas_reset(struct prueth_emac *emac)
>>>> +{
>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
>>>> +		tas->max_sdu_table.max_sdu[i] = 2048;
>>>
>>> Macro + short comment for the magic number, please.
>>>
>>
>> Sure I will add it. Each elements in this array is a 2 byte value
>> showing the maximum length of frame to be allowed through each gate.
> 
> Is the queueMaxSDU[] array active even with the TAS being in the reset
> state? Does this configuration have any impact upon the device MTU?
> I don't know why 2048 was chosen.

I talked to the firmware team. The value of 248 is actually wrong. It
should be the device mtu only i.e. PRUETH_MAX_MTU.

> 
>>>> +static int tas_set_state(struct prueth_emac *emac, enum tas_state state)
>>>> +{
>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>> +	int ret;
>>>> +
>>>> +	if (tas->state == state)
>>>> +		return 0;
>>>> +
>>>> +	switch (state) {
>>>> +	case TAS_STATE_RESET:
>>>> +		tas_reset(emac);
>>>> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_RESET);
>>>> +		tas->state = TAS_STATE_RESET;
>>>> +		break;
>>>> +	case TAS_STATE_ENABLE:
>>>> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
>>>> +		tas->state = TAS_STATE_ENABLE;
>>>> +		break;
>>>> +	case TAS_STATE_DISABLE:
>>>> +		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_DISABLE);
>>>> +		tas->state = TAS_STATE_DISABLE;
>>>
>>> This can be expressed as just "tas->state = state" outside the switch statement.
>>> But probably shouldn't be, if "ret != 0".
>>
>> Yes we shouldn't do that as we are sending the r30 command to firmware
>> in each case.
> 
> I was saying that if there's a firmware error, we probably shouldn't
> update our tas->state as if there wasn't.
> 
> And that the tas->state = state assignment is common across all switch
> cases, so it's simpler to move it out.
> 

Understood, I will move this out of switch block and only set it if
emac_set_port_state() was a success.

>>>
>>>> +		break;
>>>> +	default:
>>>> +		netdev_err(emac->ndev, "%s: unsupported state\n", __func__);
>>>
>>> There are two levels of logging for this error, and this particular one
>>> isn't useful. We can infer it went through the "default" case when the
>>> printk below returned -EINVAL, because if that -EINVAL came from
>>> emac_set_port_state(), that would have printed, in turn, "invalid port command".
>>>
>>
>> But, the enum tas_state and enum icssg_port_state_cmd are not 1-1 mapped.
> 
> Correct, but you aren't printing the tas_state anyway, and there's no
> code path possible with a tas_state outside the well-defined values.
> 
>> emac_set_port_state() will only return -EINVAL when `cmd >=
>> ICSSG_EMAC_PORT_MAX_COMMANDS` which is 19. But a tas_state value of 3 is
>> also invalid as we only support value of 0,1 and 2 so I think this print
>> shoudl be okay
>>
>> enum tas_state {
>> 	TAS_STATE_DISABLE = 0,
>> 	TAS_STATE_ENABLE = 1,
>> 	TAS_STATE_RESET = 2,
>> };
>>
>>> I don't think that a "default" case is needed here, as long as all enum
>>> values are handled, and the input is sanitized everywhere (which it is).
>>>
>>
>> I think the default case should remain. Without default case the
>> function will return 0 even for invalid sates. By default ret = 0, in
>> the tas_state passed to API is not valid, none of the case will be
>> called, ret will remaing zero. No error will be printed and the function
>> will return 0. Keeping default case makes sure that the wrong state was
>> requested.
>>
> 
> Dead code is what it is. If a new enum tas_state value is added and it's
> not handled there, the _compiler_ will warn, rather than the Linux runtime.
> So it's actually easier for the developer to catch it, rather than the user.
> You don't need to protect against your own shadow.


Sure, I will drop the default case.

> 
>>>> +static int tas_set_trigger_list_change(struct prueth_emac *emac)
>>>> +{
>>>> +	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>> +	struct ptp_system_timestamp sts;
>>>> +	u32 change_cycle_count;
>>>> +	u32 cycle_time;
>>>> +	u64 base_time;
>>>> +	u64 cur_time;
>>>> +
>>>> +	/* IEP clock has a hardware errata due to which it wraps around exactly
>>>> +	 * once every taprio cycle. To compensate for that, adjust cycle time
>>>> +	 * by the wrap around time which is stored in emac->iep->def_inc
>>>> +	 */
>>>> +	cycle_time = admin_list->cycle_time - emac->iep->def_inc;
>>>> +	base_time = admin_list->base_time;
>>>> +	cur_time = prueth_iep_gettime(emac, &sts);
>>>> +
>>>> +	if (base_time > cur_time)
>>>> +		change_cycle_count = DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time);
>>>> +	else
>>>> +		change_cycle_count = 1;
>>>> +
>>>> +	writel(cycle_time, emac->dram.va + TAS_ADMIN_CYCLE_TIME);
>>>> +	writel(change_cycle_count, emac->dram.va + TAS_CONFIG_CHANGE_CYCLE_COUNT);
>>>> +	writeb(admin_list->num_entries, emac->dram.va + TAS_ADMIN_LIST_LENGTH);
>>>> +
>>>> +	/* config_change cleared by f/w to ack reception of new shadow list */
>>>> +	writeb(1, &tas->config_list->config_change);
>>>> +	/* config_pending cleared by f/w when new shadow list is copied to active list */
>>>> +	writeb(1, &tas->config_list->config_pending);
>>>> +
>>>> +	return emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
>>>
>>> The call path here is:
>>>
>>> emac_taprio_replace()
>>> -> tas_update_oper_list()
>>>    -> tas_set_trigger_list_change()
>>>       -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
>>> -> tas_set_state(emac, TAS_STATE_ENABLE);
>>>    -> emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
>>>
>>> I'm surprised by the calls to emac_set_port_state() in such a quick
>>> succession? Is there any firmware requirement for how much should the
>>> port stay in the TAS_TRIGGER state? Or is it not really a state, despite
>>> it being an argument to a function named emac_set_port_state()?
>>>
>>
>> ICSSG_EMAC_PORT_TAS_TRIGGER is not a state. emac_set_port_state() sends
>> a command to firmware, we call it r30 command. Driver then waits for the
>> response for some time. If a successfull response is recived the
>> function return 0 otherwise error.
>>
>> Here first `emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER)` is
>> called which will ask firmware to swap the active_list and shadow_list
>> as explained above.
>>
>> After that ICSSG_EMAC_PORT_TAS_ENABLE cmd is sent. Upon recievinig this
>> command firmware will Enable TAS for the particular port. (port is part
>> of emac structure).
>>
>> I can see how that can be confusing given the API name is
>> emac_set_port_state(). Some of the cmds infact triggers a state change
>> eg. ICSSG_EMAC_PORT_DISABLE, ICSSG_EMAC_PORT_BLOCK,
>> ICSSG_EMAC_PORT_FORWARD but some of the commands just triggers some
>> action on the firmware side. Based on the command firmware does some
>> actions.
> 
> If you're replacing an existing active schedule with a shadow one, the
> ICSSG_EMAC_PORT_TAS_ENABLE command isn't needed because the TAS is
> already enabled on the port, right? In fact it will be suppressed by
> tas_set_state() without even generating an emac_set_port_state() call,
> right?
> 

As this point TAS is not enabled. TAS is enabled on the port only when
ICSSG_EMAC_PORT_TAS_ENABLE is sent. Which happens at the end of
emac_taprio_replace().

>>>> +}
>>>
>>> There's something extremely elementary about this function which I still
>>> don't understand.
>>>
>>> When does the schedule actually _start_? Can that be controlled by the
>>> driver with the high (nanosecond) precision necessary in order for the
>>> ICSSG to synchronize with the schedule of other equipment in the LAN?
>>>
>>> You never pass the base time per se to the firmware. Just a number of
>>> cycles from now. I guess that number of cycles decides when the schedule
>>> starts, but what are those cycles relative to?
>>>
>>
>> Once the shadow list is updated, the trigger is set in the firmware and
>> for that API tas_set_trigger_list_change() is called.
>>
>> The following three offsets are configured in this function,
>> 1. TAS_ADMIN_CYCLE_TIME → admin cycle time
>> 2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
>> admin list is taken as operating list.
>> This parameter is calculated based on the base_time, cur_time and
>> cycle_time. If the base_time is in past (already passed) the
>> TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
>> future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
>> DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
>> 3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.
>>
>> After configuring the above three parameters, the driver gives the
>> trigger signal to the firmware using the R30 command interface with
>> ICSSG_EMAC_PORT_TAS_TRIGGER command.
>>
>> The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
>> are relative to time remaining in the base_time from now i.e. base_time
>> - cur_time.
> 
> So you're saying that the firmware executes the schedule switch at
> 
> 	now                  +      TAS_ADMIN_CYCLE_TIME * TAS_CONFIG_CHANGE_CYCLE_COUNT ns
> 	~~~
> 	time of reception of
> 	ICSSG_EMAC_PORT_TAS_TRIGGER
> 	R30 command
> 
> ?
> 

I talked to the firmware team on this topic. Seems like this is actually
a bug in the firmware design. This *now* is very relative and it will
always introduce jitter as you have mentioned.

The firmware needs to change to handle the below two cases that you have
mentioned.

The schedule should start at base-time (given by user). Instead of
sending the cycle count from now to base-time to firmware. Driver should
send the absolute cycle count corresponding to the base-time. Firmware
can then check the curr cycle count and when it matches the count set by
driver firmware will start scheduling.

change_cycle_count = base-time / cycle-time;

This way the irregularity with *now* will be removed. Now even if we run
the same command on two different ICSSG devices(whose clocks are synced
with PTP), the scheduling will happen at same time.

As the change_cycle_count will be same for both of them. Since the
clocks are synced the current cycle count (read from
TIMESYNC_FW_WC_CYCLECOUNT_OFFSET) will also be same for both the devices

> I'm not really interested in how the driver calculates the cycle count,
> just in what are the primitives that the firmware ABI wants.
> 
> Does the readb_poll_timeout() call from tas_update_oper_list() actually
> wait until this whole time elapses? It is user space input, so it can
> keep a task waiting in the kernel, with rtnl_lock() acquired, for a very
> long time if the base_time is far away in the future.
> 

readb_poll_timeout() call from tas_update_oper_list() waits for exactly
10 msecs. Driver send the trigger_list_change command and sets
config_change register to 1 (details in tas_set_trigger_list_change()).
Driver waits for 10 ms for firmware to clear this register. If the
register is not cleared, list wasn't changed by firmware. Driver will
then return err.

> If my understanding is correct, then there are 2 things you cannot do
> (which IMO are very important) with the current firmware ABI:
> 
> 1. You cannot synchronize the schedules on two ICSSG devices to one
> another.
> 
> You are supposed to be able to run the same taprio command on the egress
> port of 2 chained switches in a LAN:
> 
> tc qdisc replace dev swp0 parent root taprio \
> 	num_tc 8 \
> 	map 0 1 2 3 4 5 6 7 \
> 	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> 	base-time 0 \
> 	sched-entry S 0x81 100000 \
> 	sched-entry S 0x01 900000 \
> 	flags 0x2 \
> 	max-sdu 0 0 0 0 0 0 0 79
> 
> and, assuming that the switches are synchronized by PTP, the gate events
> will be synchronized on the 2 switches.
> 
> But if the schedule change formula in the firmware is fundamentally
> dependant on a "now" that depends on when the Linux driver performed the
> TAS_TRIGGER action, the gate events will never be precisely synchronized.

As mentioned above, changing the implemntation in firmware and driver
can fix this. With the suggested new implementation the timing will not
be dependent on when the driver sends TAS_TRIGGER cmd.

> 
> Here, "base-time 0" means that the driver/firmware/hardware should
> advance the schedule start time into the closest moment in PTP time
> which is a multiple of the cycle-time (100000+900000=1000000). So for
> example, if the current PTP time is 1000.123456789, the closest start
> time would be 100.124000000.
> 

That would be handled. When base-time is 0, change_cycle_count would be
1 meaning firmware will start scheduling in the very next cycle.

> 2. You cannot apply a phase offset between the schedules on two ICSSG
> devices in the same network.
> 
> Since there is a PHY-dependent propagation delay on each link, network
> engineers typically delay the schedules on switch ports along the path
> of a stream.
> 
> Say for example there is a propagation delay of 800 ns on a switch with
> base-time 0. On the next switch, you could add the schedule like this:
> 
> tc qdisc replace dev swp0 parent root taprio \
> 	num_tc 8 \
> 	map 0 1 2 3 4 5 6 7 \
> 	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> 	base-time 800 \
> 	sched-entry S 0x81 100000 \
> 	sched-entry S 0x01 900000 \
> 	flags 0x2 \
> 	max-sdu 0 0 0 0 0 0 0 79
> 
> Same schedule, phase-shifted by 800 ns, so that if the packet goes
> through an open gate in the first switch, it will also go through an
> open gate through the second.
> 
> According to your own calculations and explanations, the firmware ABI
> makes no difference between base-time 0 and base-time 800.
> 

In the new implementation base-time 0 and base-time 800 will make a
difference. as the change_cycle_count will be different from both the cases.
In case of base-time 0, change_cycle_count will be 1. Implying schedule
will start on the very next cycle.

In case of base-time 800, change_cycle_count will be 800 / cycle-time.

> In this case they are probably both smaller than the current time, so
> TAS_CONFIG_CHANGE_CYCLE_COUNT will be set to the same "1" in both cases.
> 

If cycle-time is larger then both 0 and 800 then the change_cycle_count
would be 1 in both the cases.

> But even assuming a future base-time, it still will make no difference.
> The firmware seems to operate only on integer multiples of a cycle-time
> (here 1000000).

Yes, the firmware works only on multiple of cycle time. If the base-time
is not a multiple of cycle time, the scheduling will start on the next
cycle count.

i.e. change_cycle_count = ceil (base-time / cycle-time)

> 
> Summarized, the blocking problems I see are:
> 
> - For issue #2, the driver should not lie to the user space that it
>   applied a schedule with a base-time that isn't a precise multiple of
>   the cycle-time, because it doesn't do that.
> 

Yes, I acknowledge it's a limitation. Driver can print "requested
base-time is not multiple of cycle-time, secheduling will start on the
next available cycle from base-time". I agree the driver shouldn't lie
about this. Whenever driver encounters a base time which is not multiple
of cycle-time. It can still do the scheduling but throw a print so that
user is aware of this.

> - For issue #1, the bigger problem is that there is always a
>   software-induced jitter which makes whatever the user space has
>   requested irrelevant.
> 

As a I mentioned earlier, the new implementation will take care of this.

I will work with the firmware team to get this fixed. Once that's done I
will send a new revision.

Thanks for all the feedbacks. Please let me know if some more
clarification is needed.

> This is sufficiently bad that I don't think it's worth spending any more
> time on anything else until it is clear how you can make the firmware
> actually obey the requested base-time.
> 
>>>> +static int emac_taprio_replace(struct net_device *ndev,
>>>> +			       struct tc_taprio_qopt_offload *taprio)
>>>> +{
>>>> +	struct prueth_emac *emac = netdev_priv(ndev);
>>>> +	int ret;
>>>> +
>>>> +	if (taprio->cycle_time_extension) {
>>>> +		NL_SET_ERR_MSG_MOD(taprio->extack, "Cycle time extension not supported");
>>>> +		return -EOPNOTSUPP;
>>>> +	}
>>>> +
>>>> +	if (taprio->cycle_time < TAS_MIN_CYCLE_TIME) {
>>>> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "cycle_time %llu is less than min supported cycle_time %d",
>>>> +				       taprio->cycle_time, TAS_MIN_CYCLE_TIME);
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (taprio->num_entries > TAS_MAX_CMD_LISTS) {
>>>> +		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "num_entries %lu is more than max supported entries %d",
>>>> +				       taprio->num_entries, TAS_MAX_CMD_LISTS);
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (emac->qos.tas.taprio_admin)
>>>> +		taprio_offload_free(emac->qos.tas.taprio_admin);
>>>> +
>>>> +	emac->qos.tas.taprio_admin = taprio_offload_get(taprio);
>>>> +	ret = tas_update_oper_list(emac);
>>>
>>> If this fails and there was a previous emac->qos.tas.taprio_admin
>>> schedule present, you just broke it. In particular, the
>>> "if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)" bounds check really
>>> doesn't belong there; it should have been done much earlier, to avoid a
>>> complete offload breakage for such a silly thing (replacing a working
>>> taprio schedule with a new one that has too large cycle breaks the old
>>> schedule).
>>>
>>
>> Will adding the check "if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)"
>> in emac_taprio_replace() along with the all the checks be OK?
> 
> Yes, but "it will be ok" needs to be put in proper context (this small
> thing is OK doesn't mean everything is OK).

Understood. I know that the full code is still very far from being OK.
Vladimir Oltean June 6, 2024, 2:17 p.m. UTC | #6
On Thu, Jun 06, 2024 at 04:33:58PM +0530, MD Danish Anwar wrote:
> >>>> +static void tas_reset(struct prueth_emac *emac)
> >>>> +{
> >>>> +	struct tas_config *tas = &emac->qos.tas.config;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
> >>>> +		tas->max_sdu_table.max_sdu[i] = 2048;
> >>>
> >>> Macro + short comment for the magic number, please.
> >>>
> >>
> >> Sure I will add it. Each elements in this array is a 2 byte value
> >> showing the maximum length of frame to be allowed through each gate.
> > 
> > Is the queueMaxSDU[] array active even with the TAS being in the reset
> > state? Does this configuration have any impact upon the device MTU?
> > I don't know why 2048 was chosen.
> 
> I talked to the firmware team. The value of 248 is actually wrong. It
> should be the device mtu only i.e. PRUETH_MAX_MTU.

There was another comment about the value of 0, sent separately.

> > If you're replacing an existing active schedule with a shadow one, the
> > ICSSG_EMAC_PORT_TAS_ENABLE command isn't needed because the TAS is
> > already enabled on the port, right? In fact it will be suppressed by
> > tas_set_state() without even generating an emac_set_port_state() call,
> > right?
> > 
> 
> As this point TAS is not enabled. TAS is enabled on the port only when
> ICSSG_EMAC_PORT_TAS_ENABLE is sent. Which happens at the end of
> emac_taprio_replace().

"If you're replacing an existing active schedule" => emac_taprio_replace()
was already called once, and we're calling it again, with no emac_taprio_destroy()
in between.

This is done using the "tc qdisc replace" command. You can keep the
mqprio parameters the same, just change the schedule parameters.
The transition from the old to the new schedule is supposed to be
seamless and at a well-defined time, according to the IEEE definitions.

> >> The following three offsets are configured in this function,
> >> 1. TAS_ADMIN_CYCLE_TIME → admin cycle time
> >> 2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
> >> admin list is taken as operating list.
> >> This parameter is calculated based on the base_time, cur_time and
> >> cycle_time. If the base_time is in past (already passed) the
> >> TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
> >> future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
> >> DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
> >> 3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.
> >>
> >> After configuring the above three parameters, the driver gives the
> >> trigger signal to the firmware using the R30 command interface with
> >> ICSSG_EMAC_PORT_TAS_TRIGGER command.
> >>
> >> The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
> >> are relative to time remaining in the base_time from now i.e. base_time
> >> - cur_time.
> > 
> > So you're saying that the firmware executes the schedule switch at
> > 
> > 	now                  +      TAS_ADMIN_CYCLE_TIME * TAS_CONFIG_CHANGE_CYCLE_COUNT ns
> > 	~~~
> > 	time of reception of
> > 	ICSSG_EMAC_PORT_TAS_TRIGGER
> > 	R30 command
> > 
> > ?
> > 
> 
> I talked to the firmware team on this topic. Seems like this is actually
> a bug in the firmware design. This *now* is very relative and it will
> always introduce jitter as you have mentioned.
> 
> The firmware needs to change to handle the below two cases that you have
> mentioned.
> 
> The schedule should start at base-time (given by user). Instead of
> sending the cycle count from now to base-time to firmware. Driver should
> send the absolute cycle count corresponding to the base-time. Firmware
> can then check the curr cycle count and when it matches the count set by
> driver firmware will start scheduling.
> 
> change_cycle_count = base-time / cycle-time;
> 
> This way the irregularity with *now* will be removed. Now even if we run
> the same command on two different ICSSG devices(whose clocks are synced
> with PTP), the scheduling will happen at same time.
> 
> As the change_cycle_count will be same for both of them. Since the
> clocks are synced the current cycle count (read from
> TIMESYNC_FW_WC_CYCLECOUNT_OFFSET) will also be same for both the devices

You could pass the actual requested base-time to the firmware, and let
the firmware calculate a cycle count or whatever the hardware needs.
Otherwise, you advance the base-time in the driver into what was the
future at the time, but by the time the r30 command reaches the
firmware, the passed number of cycles has already elapsed.

> > I'm not really interested in how the driver calculates the cycle count,
> > just in what are the primitives that the firmware ABI wants.
> > 
> > Does the readb_poll_timeout() call from tas_update_oper_list() actually
> > wait until this whole time elapses? It is user space input, so it can
> > keep a task waiting in the kernel, with rtnl_lock() acquired, for a very
> > long time if the base_time is far away in the future.
> > 
> 
> readb_poll_timeout() call from tas_update_oper_list() waits for exactly
> 10 msecs. Driver send the trigger_list_change command and sets
> config_change register to 1 (details in tas_set_trigger_list_change()).
> Driver waits for 10 ms for firmware to clear this register. If the
> register is not cleared, list wasn't changed by firmware. Driver will
> then return err.

And the firmware clears this register when? Quickly upon reception of
the TAS_TRIGGER command, or after the TAS is actually triggered (after
change_cycle_count cycles)?

> > 2. You cannot apply a phase offset between the schedules on two ICSSG
> > devices in the same network.
> > 
> > Since there is a PHY-dependent propagation delay on each link, network
> > engineers typically delay the schedules on switch ports along the path
> > of a stream.
> > 
> > Say for example there is a propagation delay of 800 ns on a switch with
> > base-time 0. On the next switch, you could add the schedule like this:
> > 
> > tc qdisc replace dev swp0 parent root taprio \
> > 	num_tc 8 \
> > 	map 0 1 2 3 4 5 6 7 \
> > 	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> > 	base-time 800 \
> > 	sched-entry S 0x81 100000 \
> > 	sched-entry S 0x01 900000 \
> > 	flags 0x2 \
> > 	max-sdu 0 0 0 0 0 0 0 79
> > 
> > Same schedule, phase-shifted by 800 ns, so that if the packet goes
> > through an open gate in the first switch, it will also go through an
> > open gate through the second.
> > 
> > According to your own calculations and explanations, the firmware ABI
> > makes no difference between base-time 0 and base-time 800.
> > 
> 
> In the new implementation base-time 0 and base-time 800 will make a
> difference. as the change_cycle_count will be different from both the cases.
> In case of base-time 0, change_cycle_count will be 1. Implying schedule
> will start on the very next cycle.
> 
> In case of base-time 800, change_cycle_count will be 800 / cycle-time.

In this example, cycle-time is (much) larger than 800 ns, so 800 / cycle-time is 0.
Simply put, base-time 0 and base-time 800 will still be treated equally,
if the firmware only starts the schedule upon integer multiples of the
cycle time. A use case is offsetting schedules by a small value, smaller
than the cycle time.

The base-time value of 800 should be advanced by the smallest integer
multiple of the cycle-time that satisfies the inequality
new-base-time = (base-time + N * cycle-time) >= now.

You can see that for the same value of N and cycle-time, new-base-time
will different when base-time = 0 vs when base-time = 800. Taprio
expects that difference to be reflected into the schedule.

> > In this case they are probably both smaller than the current time, so
> > TAS_CONFIG_CHANGE_CYCLE_COUNT will be set to the same "1" in both cases.
> > 
> 
> If cycle-time is larger then both 0 and 800 then the change_cycle_count
> would be 1 in both the cases.
> 
> > But even assuming a future base-time, it still will make no difference.
> > The firmware seems to operate only on integer multiples of a cycle-time
> > (here 1000000).
> 
> Yes, the firmware works only on multiple of cycle time. If the base-time
> is not a multiple of cycle time, the scheduling will start on the next
> cycle count.
> 
> i.e. change_cycle_count = ceil (base-time / cycle-time)
> > Summarized, the blocking problems I see are:
> > 
> > - For issue #2, the driver should not lie to the user space that it
> >   applied a schedule with a base-time that isn't a precise multiple of
> >   the cycle-time, because it doesn't do that.
> > 
> 
> Yes, I acknowledge it's a limitation. Driver can print "requested
> base-time is not multiple of cycle-time, secheduling will start on the
> next available cycle from base-time". I agree the driver shouldn't lie
> about this. Whenever driver encounters a base time which is not multiple
> of cycle-time. It can still do the scheduling but throw a print so that
> user is aware of this.

Is that a firmware or a hardware limitation? You're making it sound as
if we shouldn't be expecting for it to be lifted.

> > - For issue #1, the bigger problem is that there is always a
> >   software-induced jitter which makes whatever the user space has
> >   requested irrelevant.
> > 
> 
> As a I mentioned earlier, the new implementation will take care of this.
> 
> I will work with the firmware team to get this fixed. Once that's done I
> will send a new revision.
> 
> Thanks for all the feedbacks. Please let me know if some more
> clarification is needed.

Ok, so we're waiting for a new firmware release, and a check in the
driver that the firmware version >= some minimum requirement, else
-EOPNOTSUPP?
Anwar, Md Danish June 7, 2024, 4:59 a.m. UTC | #7
On 6/3/2024 7:35 PM, Vladimir Oltean wrote:
> On Mon, Jun 03, 2024 at 04:51:00PM +0300, Vladimir Oltean wrote:
>>>>> +static void tas_reset(struct prueth_emac *emac)
>>>>> +{
>>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>>> +	int i;
>>>>> +
>>>>> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
>>>>> +		tas->max_sdu_table.max_sdu[i] = 2048;
>>>>
>>>> Macro + short comment for the magic number, please.
>>>>
>>>
>>> Sure I will add it. Each elements in this array is a 2 byte value
>>> showing the maximum length of frame to be allowed through each gate.
>>
>> Is the queueMaxSDU[] array active even with the TAS being in the reset
>> state? Does this configuration have any impact upon the device MTU?
>> I don't know why 2048 was chosen.
> 
> Another comment here is: in the tc-taprio UAPI, a max-sdu value of 0
> is special and means "no maxSDU limit for this TX queue". You are
> programming the values from taprio straight away to hardware, so,
> assuming there's no bug there, it means that the hardware also
> understands 0 to mean "no maxSDU limit".
> 

I discussed this with the firmware team. They are not treating 0 as
something special (""no maxSDU limit for this TX queue"). They have
limit on every queue. Driver needs to handle the max-sdu size carefully.

> If so, then during tas_reset(), after which the TAS should be disabled,
> why aren't you also using 0 as a default value, but 2048?

As using 0 doesn't mean anything special in firmware. The default value
during reset is kept as the max supported value.

There's also one thing missing here, the max-sdu table in firmware is
updated (by calling tas_update_maxsdu_table()) only once by driver
during tas_reset(). The firmware table should also be updated once
before triggering the list change so that the firmware would know what
are the max-sdu value that user has requested.

If a user request max-sdu as 0 0 0 80 for 4 queues. The driver will
update these values to firmware as PRUETH_MAX_MTU, PRUETH_MAX_MTU,
PRUETH_MAX_MTU, 80.
MD Danish Anwar June 12, 2024, 10:15 a.m. UTC | #8
On 06/06/24 7:47 pm, Vladimir Oltean wrote:
> On Thu, Jun 06, 2024 at 04:33:58PM +0530, MD Danish Anwar wrote:
>>>>>> +static void tas_reset(struct prueth_emac *emac)
>>>>>> +{
>>>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>>>> +	int i;
>>>>>> +
>>>>>> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
>>>>>> +		tas->max_sdu_table.max_sdu[i] = 2048;
>>>>>
>>>>> Macro + short comment for the magic number, please.
>>>>>
>>>>
>>>> Sure I will add it. Each elements in this array is a 2 byte value
>>>> showing the maximum length of frame to be allowed through each gate.
>>>
>>> Is the queueMaxSDU[] array active even with the TAS being in the reset
>>> state? Does this configuration have any impact upon the device MTU?
>>> I don't know why 2048 was chosen.
>>
>> I talked to the firmware team. The value of 248 is actually wrong. It
>> should be the device mtu only i.e. PRUETH_MAX_MTU.
> 
> There was another comment about the value of 0, sent separately.
> 

Yes, I have replied to that.

>>> If you're replacing an existing active schedule with a shadow one, the
>>> ICSSG_EMAC_PORT_TAS_ENABLE command isn't needed because the TAS is
>>> already enabled on the port, right? In fact it will be suppressed by
>>> tas_set_state() without even generating an emac_set_port_state() call,
>>> right?
>>>
>>
>> As this point TAS is not enabled. TAS is enabled on the port only when
>> ICSSG_EMAC_PORT_TAS_ENABLE is sent. Which happens at the end of
>> emac_taprio_replace().
> 
> "If you're replacing an existing active schedule" => emac_taprio_replace()
> was already called once, and we're calling it again, with no emac_taprio_destroy()
> in between.
> 
> This is done using the "tc qdisc replace" command. You can keep the
> mqprio parameters the same, just change the schedule parameters.
> The transition from the old to the new schedule is supposed to be
> seamless and at a well-defined time, according to the IEEE definitions.
> 

I talked with the FW team. During "tc qdisc replace" there is no need to
send ICSSG_EMAC_PORT_TAS_ENABLE as it is already enabled. I'll fix it.

>>>> The following three offsets are configured in this function,
>>>> 1. TAS_ADMIN_CYCLE_TIME → admin cycle time
>>>> 2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
>>>> admin list is taken as operating list.
>>>> This parameter is calculated based on the base_time, cur_time and
>>>> cycle_time. If the base_time is in past (already passed) the
>>>> TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
>>>> future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
>>>> DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
>>>> 3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.
>>>>
>>>> After configuring the above three parameters, the driver gives the
>>>> trigger signal to the firmware using the R30 command interface with
>>>> ICSSG_EMAC_PORT_TAS_TRIGGER command.
>>>>
>>>> The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
>>>> are relative to time remaining in the base_time from now i.e. base_time
>>>> - cur_time.
>>>
>>> So you're saying that the firmware executes the schedule switch at
>>>
>>> 	now                  +      TAS_ADMIN_CYCLE_TIME * TAS_CONFIG_CHANGE_CYCLE_COUNT ns
>>> 	~~~
>>> 	time of reception of
>>> 	ICSSG_EMAC_PORT_TAS_TRIGGER
>>> 	R30 command
>>>
>>> ?
>>>
>>
>> I talked to the firmware team on this topic. Seems like this is actually
>> a bug in the firmware design. This *now* is very relative and it will
>> always introduce jitter as you have mentioned.
>>
>> The firmware needs to change to handle the below two cases that you have
>> mentioned.
>>
>> The schedule should start at base-time (given by user). Instead of
>> sending the cycle count from now to base-time to firmware. Driver should
>> send the absolute cycle count corresponding to the base-time. Firmware
>> can then check the curr cycle count and when it matches the count set by
>> driver firmware will start scheduling.
>>
>> change_cycle_count = base-time / cycle-time;
>>
>> This way the irregularity with *now* will be removed. Now even if we run
>> the same command on two different ICSSG devices(whose clocks are synced
>> with PTP), the scheduling will happen at same time.
>>
>> As the change_cycle_count will be same for both of them. Since the
>> clocks are synced the current cycle count (read from
>> TIMESYNC_FW_WC_CYCLECOUNT_OFFSET) will also be same for both the devices
> 
> You could pass the actual requested base-time to the firmware, and let
> the firmware calculate a cycle count or whatever the hardware needs.
> Otherwise, you advance the base-time in the driver into what was the
> future at the time, but by the time the r30 command reaches the
> firmware, the passed number of cycles has already elapsed.
> 

Yes that would work too.

>>> I'm not really interested in how the driver calculates the cycle count,
>>> just in what are the primitives that the firmware ABI wants.
>>>
>>> Does the readb_poll_timeout() call from tas_update_oper_list() actually
>>> wait until this whole time elapses? It is user space input, so it can
>>> keep a task waiting in the kernel, with rtnl_lock() acquired, for a very
>>> long time if the base_time is far away in the future.
>>>
>>
>> readb_poll_timeout() call from tas_update_oper_list() waits for exactly
>> 10 msecs. Driver send the trigger_list_change command and sets
>> config_change register to 1 (details in tas_set_trigger_list_change()).
>> Driver waits for 10 ms for firmware to clear this register. If the
>> register is not cleared, list wasn't changed by firmware. Driver will
>> then return err.
> 
> And the firmware clears this register when? Quickly upon reception of
> the TAS_TRIGGER command, or after the TAS is actually triggered (after
> change_cycle_count cycles)?
> 

tas->config_list->config_change is set to 1 by driver in
tas_set_trigger_list_change() before sending the TAS_TRIGGER command.
tas->config_list->config_change indicates that the driver has filled the
shadow list.

Firmware then receives the TAS_TRIGGER command and clears
`tas->config_list->config_change` as soon as new shadow list is received.

Firmware then goes on to copy the shadow list to active list and once
that is done, firmware clears `tas->config_list->config_pending`
register. After this TAS is actually triggered when change_cycle_count
cycles have passed.

>>> 2. You cannot apply a phase offset between the schedules on two ICSSG
>>> devices in the same network.
>>>
>>> Since there is a PHY-dependent propagation delay on each link, network
>>> engineers typically delay the schedules on switch ports along the path
>>> of a stream.
>>>
>>> Say for example there is a propagation delay of 800 ns on a switch with
>>> base-time 0. On the next switch, you could add the schedule like this:
>>>
>>> tc qdisc replace dev swp0 parent root taprio \
>>> 	num_tc 8 \
>>> 	map 0 1 2 3 4 5 6 7 \
>>> 	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
>>> 	base-time 800 \
>>> 	sched-entry S 0x81 100000 \
>>> 	sched-entry S 0x01 900000 \
>>> 	flags 0x2 \
>>> 	max-sdu 0 0 0 0 0 0 0 79
>>>
>>> Same schedule, phase-shifted by 800 ns, so that if the packet goes
>>> through an open gate in the first switch, it will also go through an
>>> open gate through the second.
>>>
>>> According to your own calculations and explanations, the firmware ABI
>>> makes no difference between base-time 0 and base-time 800.
>>>
>>
>> In the new implementation base-time 0 and base-time 800 will make a
>> difference. as the change_cycle_count will be different from both the cases.
>> In case of base-time 0, change_cycle_count will be 1. Implying schedule
>> will start on the very next cycle.
>>
>> In case of base-time 800, change_cycle_count will be 800 / cycle-time.
> 
> In this example, cycle-time is (much) larger than 800 ns, so 800 / cycle-time is 0.
> Simply put, base-time 0 and base-time 800 will still be treated equally,
> if the firmware only starts the schedule upon integer multiples of the
> cycle time. A use case is offsetting schedules by a small value, smaller
> than the cycle time.
> 
> The base-time value of 800 should be advanced by the smallest integer
> multiple of the cycle-time that satisfies the inequality
> new-base-time = (base-time + N * cycle-time) >= now.
> 
> You can see that for the same value of N and cycle-time, new-base-time
> will different when base-time = 0 vs when base-time = 800. Taprio
> expects that difference to be reflected into the schedule.
> 
>>> In this case they are probably both smaller than the current time, so
>>> TAS_CONFIG_CHANGE_CYCLE_COUNT will be set to the same "1" in both cases.
>>>
>>
>> If cycle-time is larger then both 0 and 800 then the change_cycle_count
>> would be 1 in both the cases.
>>
>>> But even assuming a future base-time, it still will make no difference.
>>> The firmware seems to operate only on integer multiples of a cycle-time
>>> (here 1000000).
>>
>> Yes, the firmware works only on multiple of cycle time. If the base-time
>> is not a multiple of cycle time, the scheduling will start on the next
>> cycle count.
>>
>> i.e. change_cycle_count = ceil (base-time / cycle-time)
>>> Summarized, the blocking problems I see are:
>>>
>>> - For issue #2, the driver should not lie to the user space that it
>>>   applied a schedule with a base-time that isn't a precise multiple of
>>>   the cycle-time, because it doesn't do that.
>>>
>>
>> Yes, I acknowledge it's a limitation. Driver can print "requested
>> base-time is not multiple of cycle-time, secheduling will start on the
>> next available cycle from base-time". I agree the driver shouldn't lie
>> about this. Whenever driver encounters a base time which is not multiple
>> of cycle-time. It can still do the scheduling but throw a print so that
>> user is aware of this.
> 
> Is that a firmware or a hardware limitation? You're making it sound as
> if we shouldn't be expecting for it to be lifted.
> 

It is a firmware limitation and I have conveyed this issue to the
firmware team. They are working on fixing this as well. In my earlier
reply I meant to say that if this is not fixed or if it takes very long
to fix this, we can still go ahead with the driver by mentioning this
limitation by a print.

>>> - For issue #1, the bigger problem is that there is always a
>>>   software-induced jitter which makes whatever the user space has
>>>   requested irrelevant.
>>>
>>
>> As a I mentioned earlier, the new implementation will take care of this.
>>
>> I will work with the firmware team to get this fixed. Once that's done I
>> will send a new revision.
>>
>> Thanks for all the feedbacks. Please let me know if some more
>> clarification is needed.
> 
> Ok, so we're waiting for a new firmware release, and a check in the
> driver that the firmware version >= some minimum requirement, else
> -EOPNOTSUPP?

Yes, we are waiting on a new firmware release. The check however might
not be necessary as ICSSG firmware is not yet public. It's not part of
Linux-firmware. We are planning on integrating the ICSSG firmware with
Linux-firmware. However as of now it's not public. Since the firmware is
not public yet and the first public version will most probably be after
this things are fixed and driver support is upstream-ed, I don't think
version check will be needed.

Once the firmware is public and the some change is done in driver that
depends on firmware versions then the check will be necessary.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
index f160a3b71499..6deac9035610 100644
--- a/drivers/net/ethernet/ti/Kconfig
+++ b/drivers/net/ethernet/ti/Kconfig
@@ -190,6 +190,7 @@  config TI_ICSSG_PRUETH
 	depends on PRU_REMOTEPROC
 	depends on ARCH_K3 && OF && TI_K3_UDMA_GLUE_LAYER
 	depends on PTP_1588_CLOCK_OPTIONAL
+	depends on NET_SCH_TAPRIO
 	help
 	  Support dual Gigabit Ethernet ports over the ICSSG PRU Subsystem.
 	  This subsystem is available starting with the AM65 platform.
diff --git a/drivers/net/ethernet/ti/Makefile b/drivers/net/ethernet/ti/Makefile
index 59cd20a38267..0a86311bd170 100644
--- a/drivers/net/ethernet/ti/Makefile
+++ b/drivers/net/ethernet/ti/Makefile
@@ -35,6 +35,7 @@  obj-$(CONFIG_TI_ICSSG_PRUETH) += icssg-prueth.o
 icssg-prueth-y := icssg/icssg_prueth.o \
 		  icssg/icssg_common.o \
 		  icssg/icssg_classifier.o \
+		  icssg/icssg_qos.o \
 		  icssg/icssg_queues.o \
 		  icssg/icssg_config.o \
 		  icssg/icssg_mii_cfg.o \
diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
index d159bdf7dd9d..8982ecb8a43d 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
@@ -706,6 +706,7 @@  static const struct net_device_ops emac_netdev_ops = {
 	.ndo_eth_ioctl = emac_ndo_ioctl,
 	.ndo_get_stats64 = emac_ndo_get_stats64,
 	.ndo_get_phys_port_name = emac_ndo_get_phys_port_name,
+	.ndo_setup_tc = icssg_qos_ndo_setup_tc,
 };
 
 static int prueth_netdev_init(struct prueth *prueth,
@@ -840,6 +841,8 @@  static int prueth_netdev_init(struct prueth *prueth,
 	emac->rx_hrtimer.function = &emac_rx_timer_callback;
 	prueth->emac[mac] = emac;
 
+	icssg_qos_tas_init(ndev);
+
 	return 0;
 
 free:
diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
index fab2428de78b..c6851546e6c5 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
@@ -37,6 +37,7 @@ 
 #include "icssg_config.h"
 #include "icss_iep.h"
 #include "icssg_switch_map.h"
+#include "icssg_qos.h"
 
 #define PRUETH_MAX_MTU          (2000 - ETH_HLEN - ETH_FCS_LEN)
 #define PRUETH_MIN_PKT_SIZE     (VLAN_ETH_ZLEN)
@@ -195,6 +196,7 @@  struct prueth_emac {
 	/* RX IRQ Coalescing Related */
 	struct hrtimer rx_hrtimer;
 	unsigned long rx_pace_timeout_ns;
+	struct prueth_qos qos;
 };
 
 /**
diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.c b/drivers/net/ethernet/ti/icssg/icssg_qos.c
new file mode 100644
index 000000000000..5e93b1b9ca43
--- /dev/null
+++ b/drivers/net/ethernet/ti/icssg/icssg_qos.c
@@ -0,0 +1,288 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Texas Instruments ICSSG PRUETH QoS submodule
+ * Copyright (C) 2023 Texas Instruments Incorporated - http://www.ti.com/
+ */
+
+#include <linux/printk.h>
+#include "icssg_prueth.h"
+#include "icssg_switch_map.h"
+
+static void tas_update_fw_list_pointers(struct prueth_emac *emac)
+{
+	struct tas_config *tas = &emac->qos.tas.config;
+
+	if ((readb(tas->active_list)) == TAS_LIST0) {
+		tas->fw_active_list = emac->dram.va + TAS_GATE_MASK_LIST0;
+		tas->fw_shadow_list = emac->dram.va + TAS_GATE_MASK_LIST1;
+	} else {
+		tas->fw_active_list = emac->dram.va + TAS_GATE_MASK_LIST1;
+		tas->fw_shadow_list = emac->dram.va + TAS_GATE_MASK_LIST0;
+	}
+}
+
+static void tas_update_maxsdu_table(struct prueth_emac *emac)
+{
+	struct tas_config *tas = &emac->qos.tas.config;
+	u16 __iomem *max_sdu_tbl_ptr;
+	u8 gate_idx;
+
+	/* update the maxsdu table */
+	max_sdu_tbl_ptr = emac->dram.va + TAS_QUEUE_MAX_SDU_LIST;
+
+	for (gate_idx = 0; gate_idx < TAS_MAX_NUM_QUEUES; gate_idx++)
+		writew(tas->max_sdu_table.max_sdu[gate_idx], &max_sdu_tbl_ptr[gate_idx]);
+}
+
+static void tas_reset(struct prueth_emac *emac)
+{
+	struct tas_config *tas = &emac->qos.tas.config;
+	int i;
+
+	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
+		tas->max_sdu_table.max_sdu[i] = 2048;
+
+	tas_update_maxsdu_table(emac);
+
+	writeb(TAS_LIST0, tas->active_list);
+
+	memset_io(tas->fw_active_list, 0, sizeof(*tas->fw_active_list));
+	memset_io(tas->fw_shadow_list, 0, sizeof(*tas->fw_shadow_list));
+}
+
+static int tas_set_state(struct prueth_emac *emac, enum tas_state state)
+{
+	struct tas_config *tas = &emac->qos.tas.config;
+	int ret;
+
+	if (tas->state == state)
+		return 0;
+
+	switch (state) {
+	case TAS_STATE_RESET:
+		tas_reset(emac);
+		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_RESET);
+		tas->state = TAS_STATE_RESET;
+		break;
+	case TAS_STATE_ENABLE:
+		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_ENABLE);
+		tas->state = TAS_STATE_ENABLE;
+		break;
+	case TAS_STATE_DISABLE:
+		ret = emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_DISABLE);
+		tas->state = TAS_STATE_DISABLE;
+		break;
+	default:
+		netdev_err(emac->ndev, "%s: unsupported state\n", __func__);
+		ret = -EINVAL;
+		break;
+	}
+
+	if (ret)
+		netdev_err(emac->ndev, "TAS set state failed %d\n", ret);
+	return ret;
+}
+
+static int tas_set_trigger_list_change(struct prueth_emac *emac)
+{
+	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
+	struct tas_config *tas = &emac->qos.tas.config;
+	struct ptp_system_timestamp sts;
+	u32 change_cycle_count;
+	u32 cycle_time;
+	u64 base_time;
+	u64 cur_time;
+
+	/* IEP clock has a hardware errata due to which it wraps around exactly
+	 * once every taprio cycle. To compensate for that, adjust cycle time
+	 * by the wrap around time which is stored in emac->iep->def_inc
+	 */
+	cycle_time = admin_list->cycle_time - emac->iep->def_inc;
+	base_time = admin_list->base_time;
+	cur_time = prueth_iep_gettime(emac, &sts);
+
+	if (base_time > cur_time)
+		change_cycle_count = DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time);
+	else
+		change_cycle_count = 1;
+
+	writel(cycle_time, emac->dram.va + TAS_ADMIN_CYCLE_TIME);
+	writel(change_cycle_count, emac->dram.va + TAS_CONFIG_CHANGE_CYCLE_COUNT);
+	writeb(admin_list->num_entries, emac->dram.va + TAS_ADMIN_LIST_LENGTH);
+
+	/* config_change cleared by f/w to ack reception of new shadow list */
+	writeb(1, &tas->config_list->config_change);
+	/* config_pending cleared by f/w when new shadow list is copied to active list */
+	writeb(1, &tas->config_list->config_pending);
+
+	return emac_set_port_state(emac, ICSSG_EMAC_PORT_TAS_TRIGGER);
+}
+
+static int tas_update_oper_list(struct prueth_emac *emac)
+{
+	struct tc_taprio_qopt_offload *admin_list = emac->qos.tas.taprio_admin;
+	struct tas_config *tas = &emac->qos.tas.config;
+	u32 tas_acc_gate_close_time = 0;
+	u8 idx, gate_idx, val;
+	int ret;
+
+	if (admin_list->cycle_time > TAS_MAX_CYCLE_TIME)
+		return -EINVAL;
+
+	tas_update_fw_list_pointers(emac);
+
+	for (idx = 0; idx < admin_list->num_entries; idx++) {
+		writeb(admin_list->entries[idx].gate_mask,
+		       &tas->fw_shadow_list->gate_mask_list[idx]);
+		tas_acc_gate_close_time += admin_list->entries[idx].interval;
+
+		/* extend last entry till end of cycle time */
+		if (idx == admin_list->num_entries - 1)
+			writel(admin_list->cycle_time,
+			       &tas->fw_shadow_list->win_end_time_list[idx]);
+		else
+			writel(tas_acc_gate_close_time,
+			       &tas->fw_shadow_list->win_end_time_list[idx]);
+	}
+
+	/* clear remaining entries */
+	for (idx = admin_list->num_entries; idx < TAS_MAX_CMD_LISTS; idx++) {
+		writeb(0, &tas->fw_shadow_list->gate_mask_list[idx]);
+		writel(0, &tas->fw_shadow_list->win_end_time_list[idx]);
+	}
+
+	/* update the Array of gate close time for each queue in each window */
+	for (idx = 0 ; idx < admin_list->num_entries; idx++) {
+		/* On Linux, only PRUETH_MAX_TX_QUEUES are supported per port */
+		for (gate_idx = 0; gate_idx < PRUETH_MAX_TX_QUEUES; gate_idx++) {
+			u8 gate_mask_list_idx = readb(&tas->fw_shadow_list->gate_mask_list[idx]);
+			u32 gate_close_time = 0;
+
+			if (gate_mask_list_idx & BIT(gate_idx))
+				gate_close_time = readl(&tas->fw_shadow_list->win_end_time_list[idx]);
+
+			writel(gate_close_time,
+			       &tas->fw_shadow_list->gate_close_time_list[idx][gate_idx]);
+		}
+	}
+
+	/* tell f/w to swap active & shadow list */
+	ret = tas_set_trigger_list_change(emac);
+	if (ret) {
+		netdev_err(emac->ndev, "failed to swap f/w config list: %d\n", ret);
+		return ret;
+	}
+
+	/* Wait for completion */
+	ret = readb_poll_timeout(&tas->config_list->config_change, val, !val,
+				 USEC_PER_MSEC, 10 * USEC_PER_MSEC);
+	if (ret) {
+		netdev_err(emac->ndev, "TAS list change completion time out\n");
+		return ret;
+	}
+
+	tas_update_fw_list_pointers(emac);
+
+	return 0;
+}
+
+static int emac_taprio_replace(struct net_device *ndev,
+			       struct tc_taprio_qopt_offload *taprio)
+{
+	struct prueth_emac *emac = netdev_priv(ndev);
+	int ret;
+
+	if (taprio->cycle_time_extension) {
+		NL_SET_ERR_MSG_MOD(taprio->extack, "Cycle time extension not supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (taprio->cycle_time < TAS_MIN_CYCLE_TIME) {
+		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "cycle_time %llu is less than min supported cycle_time %d",
+				       taprio->cycle_time, TAS_MIN_CYCLE_TIME);
+		return -EINVAL;
+	}
+
+	if (taprio->num_entries > TAS_MAX_CMD_LISTS) {
+		NL_SET_ERR_MSG_FMT_MOD(taprio->extack, "num_entries %lu is more than max supported entries %d",
+				       taprio->num_entries, TAS_MAX_CMD_LISTS);
+		return -EINVAL;
+	}
+
+	if (emac->qos.tas.taprio_admin)
+		taprio_offload_free(emac->qos.tas.taprio_admin);
+
+	emac->qos.tas.taprio_admin = taprio_offload_get(taprio);
+	ret = tas_update_oper_list(emac);
+	if (ret)
+		goto clear_taprio;
+
+	ret = tas_set_state(emac, TAS_STATE_ENABLE);
+	if (ret)
+		goto clear_taprio;
+
+clear_taprio:
+	emac->qos.tas.taprio_admin = NULL;
+	taprio_offload_free(taprio);
+
+	return ret;
+}
+
+static int emac_taprio_destroy(struct net_device *ndev,
+			       struct tc_taprio_qopt_offload *taprio)
+{
+	struct prueth_emac *emac = netdev_priv(ndev);
+	int ret;
+
+	taprio_offload_free(taprio);
+
+	ret = tas_set_state(emac, TAS_STATE_RESET);
+	if (ret)
+		return ret;
+
+	return tas_set_state(emac, TAS_STATE_DISABLE);
+}
+
+static int emac_setup_taprio(struct net_device *ndev, void *type_data)
+{
+	struct tc_taprio_qopt_offload *taprio = type_data;
+	int ret;
+
+	switch (taprio->cmd) {
+	case TAPRIO_CMD_REPLACE:
+		ret = emac_taprio_replace(ndev, taprio);
+		break;
+	case TAPRIO_CMD_DESTROY:
+		ret = emac_taprio_destroy(ndev, taprio);
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+int icssg_qos_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+			   void *type_data)
+{
+	switch (type) {
+	case TC_SETUP_QDISC_TAPRIO:
+		return emac_setup_taprio(ndev, type_data);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+void icssg_qos_tas_init(struct net_device *ndev)
+{
+	struct prueth_emac *emac = netdev_priv(ndev);
+	struct tas_config *tas;
+
+	tas = &emac->qos.tas.config;
+
+	tas->config_list = emac->dram.va + TAS_CONFIG_CHANGE_TIME;
+	tas->active_list = emac->dram.va + TAS_ACTIVE_LIST_INDEX;
+
+	tas_update_fw_list_pointers(emac);
+
+	tas_set_state(emac, TAS_STATE_RESET);
+}
diff --git a/drivers/net/ethernet/ti/icssg/icssg_qos.h b/drivers/net/ethernet/ti/icssg/icssg_qos.h
new file mode 100644
index 000000000000..25baccdd1ce5
--- /dev/null
+++ b/drivers/net/ethernet/ti/icssg/icssg_qos.h
@@ -0,0 +1,113 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Texas Instruments Incorporated - http://www.ti.com/
+ */
+
+#ifndef __NET_TI_ICSSG_QOS_H
+#define __NET_TI_ICSSG_QOS_H
+
+#include <linux/atomic.h>
+#include <linux/netdevice.h>
+#include <net/pkt_sched.h>
+
+/* Maximum number of gate command entries in each list. */
+#define TAS_MAX_CMD_LISTS   (16)
+
+/* Maximum number of transmit queues supported by implementation */
+#define TAS_MAX_NUM_QUEUES  (8)
+
+/* Minimum cycle time supported by implementation (in ns) */
+#define TAS_MIN_CYCLE_TIME  (1000000)
+
+/* Minimum cycle time supported by implementation (in ns) */
+#define TAS_MAX_CYCLE_TIME  (4000000000)
+
+/* Minimum TAS window duration supported by implementation (in ns) */
+#define TAS_MIN_WINDOW_DURATION  (10000)
+
+/**
+ * enum tas_list_num - TAS list number
+ * @TAS_LIST0: TAS list number is 0
+ * @TAS_LIST1: TAS list number is 1
+ */
+enum tas_list_num {
+	TAS_LIST0 = 0,
+	TAS_LIST1 = 1
+};
+
+/**
+ * enum tas_state - State of TAS in firmware
+ * @TAS_STATE_DISABLE: TAS state machine is disabled.
+ * @TAS_STATE_ENABLE: TAS state machine is enabled.
+ * @TAS_STATE_RESET: TAS state machine is reset.
+ */
+enum tas_state {
+	TAS_STATE_DISABLE = 0,
+	TAS_STATE_ENABLE = 1,
+	TAS_STATE_RESET = 2,
+};
+
+/**
+ * struct tas_config_list - Config state machine variables
+ * @config_change_time: New list is copied at this time
+ * @config_change_error_counter: Incremented if admin->BaseTime < current time
+ *				 and TAS_enabled is true
+ * @config_pending: True if list update is pending
+ * @config_change: Set to true when application trigger updating of admin list
+ *		   to active list, cleared when configChangeTime is updated
+ */
+struct tas_config_list {
+	u64 config_change_time;
+	u32 config_change_error_counter;
+	u8 config_pending;
+	u8 config_change;
+};
+
+/* Max SDU table. See IEEE Std 802.1Q-2018 12.29.1.1 */
+struct tas_max_sdu_table {
+	u16 max_sdu[TAS_MAX_NUM_QUEUES];
+};
+
+/**
+ * struct tas_firmware_list - TAS List Structure based on firmware memory map
+ * @gate_mask_list: Window gate mask list
+ * @win_end_time_list: Window end time list
+ * @gate_close_time_list: Array of gate close time for each queue in each window
+ */
+struct tas_firmware_list {
+	u8 gate_mask_list[TAS_MAX_CMD_LISTS];
+	u32 win_end_time_list[TAS_MAX_CMD_LISTS];
+	u32 gate_close_time_list[TAS_MAX_CMD_LISTS][TAS_MAX_NUM_QUEUES];
+};
+
+/**
+ * struct tas_config - Main Time Aware Shaper Handle
+ * @state: TAS state
+ * @max_sdu_table: Max SDU table
+ * @config_list: Config change variables
+ * @active_list: Current operating list operating list
+ * @fw_active_list: Active List pointer, used by firmware
+ * @fw_shadow_list: Shadow List pointer, used by driver
+ */
+struct tas_config {
+	enum tas_state state;
+	struct tas_max_sdu_table max_sdu_table;
+	struct tas_config_list __iomem *config_list;
+	u8 __iomem *active_list;
+	struct tas_firmware_list __iomem *fw_active_list;
+	struct tas_firmware_list __iomem *fw_shadow_list;
+};
+
+struct prueth_qos_tas {
+	struct tc_taprio_qopt_offload *taprio_admin;
+	struct tc_taprio_qopt_offload *taprio_oper;
+	struct tas_config config;
+};
+
+struct prueth_qos {
+	struct prueth_qos_tas tas;
+};
+
+void icssg_qos_tas_init(struct net_device *ndev);
+int icssg_qos_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+			   void *type_data);
+#endif /* __NET_TI_ICSSG_QOS_H */