[09/15] net: hbl_en: add habanalabs Ethernet driver

Message ID	20240613082208.1439968-10-oshpigelman@habana.ai (mailing list archive)
State	Changes Requested
Headers	show Received: from mail02.habana.ai (habanamailrelay02.habana.ai [62.90.112.121]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01EC413D534; Thu, 13 Jun 2024 08:22:47 +0000 (UTC) From: Omer Shpigelman <oshpigelman@habana.ai> To: linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, netdev@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: ogabbay@kernel.org, oshpigelman@habana.ai, zyehudai@habana.ai Subject: [PATCH 09/15] net: hbl_en: add habanalabs Ethernet driver Date: Thu, 13 Jun 2024 11:22:02 +0300 Message-Id: <20240613082208.1439968-10-oshpigelman@habana.ai> In-Reply-To: <20240613082208.1439968-1-oshpigelman@habana.ai> References: <20240613082208.1439968-1-oshpigelman@habana.ai> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Introduce HabanaLabs network drivers \| expand [00/15] Introduce HabanaLabs network drivers [01/15] net: hbl_cn: add habanalabs Core Network driver [02/15] net: hbl_cn: memory manager component [03/15] net: hbl_cn: physical layer support [04/15] net: hbl_cn: QP state machine [05/15] net: hbl_cn: memory trace events [06/15] net: hbl_cn: debugfs support [07/15] net: hbl_cn: gaudi2: ASIC register header files [08/15] net: hbl_cn: gaudi2: ASIC specific support [09/15] net: hbl_en: add habanalabs Ethernet driver [10/15] net: hbl_en: gaudi2: ASIC specific support [11/15] RDMA/hbl: add habanalabs RDMA driver [12/15] RDMA/hbl: direct verbs support [13/15] accel/habanalabs: network scaling support [14/15] accel/habanalabs/gaudi2: CN registers header files [15/15] accel/habanalabs/gaudi2: network scaling support

Omer Shpigelman June 13, 2024, 8:22 a.m. UTC

This ethernet driver is initialized via auxiliary bus by the hbl_cn
driver.
It serves mainly for control operations that are needed for AI scaling.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Co-developed-by: Abhilash K V <kvabhilash@habana.ai>
Signed-off-by: Abhilash K V <kvabhilash@habana.ai>
Co-developed-by: Andrey Agranovich <aagranovich@habana.ai>
Signed-off-by: Andrey Agranovich <aagranovich@habana.ai>
Co-developed-by: Bharat Jauhari <bjauhari@habana.ai>
Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Co-developed-by: David Meriin <dmeriin@habana.ai>
Signed-off-by: David Meriin <dmeriin@habana.ai>
Co-developed-by: Sagiv Ozeri <sozeri@habana.ai>
Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Co-developed-by: Zvika Yehudai <zyehudai@habana.ai>
Signed-off-by: Zvika Yehudai <zyehudai@habana.ai>
---
 MAINTAINERS                                   |    9 +
 drivers/net/ethernet/intel/Kconfig            |   18 +
 drivers/net/ethernet/intel/Makefile           |    1 +
 drivers/net/ethernet/intel/hbl_en/Makefile    |    9 +
 .../net/ethernet/intel/hbl_en/common/Makefile |    3 +
 .../net/ethernet/intel/hbl_en/common/hbl_en.c | 1168 +++++++++++++++++
 .../net/ethernet/intel/hbl_en/common/hbl_en.h |  206 +++
 .../intel/hbl_en/common/hbl_en_dcbnl.c        |  101 ++
 .../ethernet/intel/hbl_en/common/hbl_en_drv.c |  211 +++
 .../intel/hbl_en/common/hbl_en_ethtool.c      |  452 +++++++
 10 files changed, 2178 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/hbl_en/Makefile
 create mode 100644 drivers/net/ethernet/intel/hbl_en/common/Makefile
 create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
 create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
 create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
 create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
 create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c

Andrew Lunn June 13, 2024, 9:49 p.m. UTC | #1

> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget);
> +static int hbl_en_port_open(struct hbl_en_port *port);

When you do the Intel internal review, i expect this is crop up. No
forward declarations please. Put the code in the right order so they
are not needed.

> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	struct in_device *in_dev;
> +	struct in_ifaddr *ifa;
> +	int rc = 0;
> +
> +	/* for the case where no src IP is configured */
> +	*src_ip = 0;
> +
> +	/* rtnl lock should be acquired in relevant flows before taking configuration lock */
> +	if (!rtnl_is_locked()) {
> +		netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}

You will find all other drivers just do:

	ASSERT_RTNL().

If your locking is broken, you are probably dead anyway, so you might
as well keep going and try to explode in the most interesting way
possible.

> +static void hbl_en_reset_stats(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	port->net_stats.rx_packets = 0;
> +	port->net_stats.tx_packets = 0;
> +	port->net_stats.rx_bytes = 0;
> +	port->net_stats.tx_bytes = 0;
> +	port->net_stats.tx_errors = 0;
> +	atomic64_set(&port->net_stats.rx_dropped, 0);
> +	atomic64_set(&port->net_stats.tx_dropped, 0);

Why atomic64_set? Atomics are expensive, so you should not be using
them. netdev has other cheaper methods, which other Intel developers
should be happy to tell you all about.

> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	u32 mtu;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(ndev, "port is in reset, can't get MTU\n");
> +		return 0;
> +	}
> +
> +	mtu = ndev->mtu;

I think you need a better error message. All this does is access
ndev->mtu. What does it matter if the port is in reset? You don't
access it.

> +static int hbl_en_close(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	ktime_t timeout;
> +
> +	/* Looks like the return value of this function is not checked, so we can't just return
> +	 * EBUSY if the port is under reset. We need to wait until the reset is finished and then
> +	 * close the port. Otherwise the netdev will set the port as closed although port_close()
> +	 * wasn't called. Only if we waited long enough and the reset hasn't finished, we can return
> +	 * an error without actually closing the port as it is a fatal flow anyway.
> +	 */
> +	timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +	while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		/* If this is called from unregister_netdev() then the port was already closed and
> +		 * hence we can safely return.
> +		 * We could have just check the port_open boolean, but that might hide some future
> +		 * bugs. Hence it is better to use a dedicated flag for that.
> +		 */
> +		if (READ_ONCE(hdev->in_teardown))
> +			return 0;
> +
> +		usleep_range(50, 200);
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			netdev_crit(netdev,
> +				    "Timeout while waiting for port to finish reset, can't close it\n"
> +				    );
> +			return -EBUSY;
> +		}

This has the usual bug. Please look at include/linux/iopoll.h. 

> +		timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +		while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +			usleep_range(50, 200);
> +			if (ktime_compare(ktime_get(), timeout) > 0) {
> +				netdev_crit(port->ndev,
> +					    "Timeout while waiting for port %d to finish reset\n",
> +					    port->idx);
> +				break;
> +			}
> +		}

and again. Don't roll your own timeout loops like this, use the core
version.

> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't change MTU\n");
> +		return -EBUSY;
> +	}
> +
> +	if (netif_running(port->ndev)) {
> +		hbl_en_port_close(port);
> +
> +		/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +		msleep(20);
> +
> +		netdev->mtu = new_mtu;
> +
> +		rc = hbl_en_port_open(port);
> +		if (rc)
> +			netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);

Does that mean the port is FUBAR?

Most operations like this are expected to roll back to the previous
working configuration on failure. So if changing the MTU requires new
buffers in your ring, you should first allocate the new buffers, then
free the old buffers, so that if allocation fails, you still have
buffers, and the device can continue operating.

> +module_param(poll_enable, bool, 0444);
> +MODULE_PARM_DESC(poll_enable,
> +		 "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");

Module parameters are not liked. This probably needs to go away.

> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
> +{
> +	modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
> +	modinfo->type = ETH_MODULE_SFF_8636;

Is this an SFF, not an SFP? How else can you know what module it is
without doing an I2C transfer to ask the module what it is?

> +static int hbl_en_ethtool_get_module_eeprom(struct net_device *ndev, struct ethtool_eeprom *ee,
> +					    u8 *data)
> +{

This is the old API. Please update to the new API so there is access
to all the pages of the SFF/SFP.

> +static int hbl_en_ethtool_get_link_ksettings(struct net_device *ndev,
> +					     struct ethtool_link_ksettings *cmd)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 port_idx, speed;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	speed = aux_ops->get_speed(aux_dev, port_idx);
> +
> +	cmd->base.speed = speed;
> +	cmd->base.duplex = DUPLEX_FULL;
> +
> +	ethtool_link_ksettings_zero_link_mode(cmd, supported);
> +	ethtool_link_ksettings_zero_link_mode(cmd, advertising);
> +
> +	switch (speed) {
> +	case SPEED_100000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseLR4_ER4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseLR4_ER4_Full);
> +
> +		cmd->base.port = PORT_FIBRE;
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, Backplane);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Backplane);
> +		break;
> +	case SPEED_50000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseKR2_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseKR2_Full);
> +		break;
> +	case SPEED_25000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 25000baseCR_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 25000baseCR_Full);
> +		break;
> +	case SPEED_200000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseKR4_Full);
> +		break;
> +	case SPEED_400000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseKR4_Full);
> +		break;
> +	default:
> +		netdev_err(port->ndev, "unknown speed %d\n", speed);
> +		return -EFAULT;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
> +
> +	if (port->auto_neg_enable) {
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
> +		cmd->base.autoneg = AUTONEG_ENABLE;
> +		if (port->auto_neg_resolved)
> +			ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);

That looks odd. Care to explain?

> +	} else {
> +		cmd->base.autoneg = AUTONEG_DISABLE;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
> +
> +	if (port->pfc_enable)
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);

And is suspect that is wrong. Everybody gets pause wrong. Please
double check my previous posts about pause.

> +	if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
> +		netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
> +		rc = -EFAULT;
> +		goto out;

Don't say you support autoneg in supported if that is the case.

And EFAULT is about memory problems. EINVAL, maybe EPERM? or
EOPNOTSUPP.

	Andrew

Joe Damato June 14, 2024, 10:48 p.m. UTC | #2

On Thu, Jun 13, 2024 at 11:22:02AM +0300, Omer Shpigelman wrote:
> This ethernet driver is initialized via auxiliary bus by the hbl_cn
> driver.
> It serves mainly for control operations that are needed for AI scaling.
> 
> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> Co-developed-by: Abhilash K V <kvabhilash@habana.ai>
> Signed-off-by: Abhilash K V <kvabhilash@habana.ai>
> Co-developed-by: Andrey Agranovich <aagranovich@habana.ai>
> Signed-off-by: Andrey Agranovich <aagranovich@habana.ai>
> Co-developed-by: Bharat Jauhari <bjauhari@habana.ai>
> Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
> Co-developed-by: David Meriin <dmeriin@habana.ai>
> Signed-off-by: David Meriin <dmeriin@habana.ai>
> Co-developed-by: Sagiv Ozeri <sozeri@habana.ai>
> Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
> Co-developed-by: Zvika Yehudai <zyehudai@habana.ai>
> Signed-off-by: Zvika Yehudai <zyehudai@habana.ai>
> ---
>  MAINTAINERS                                   |    9 +
>  drivers/net/ethernet/intel/Kconfig            |   18 +
>  drivers/net/ethernet/intel/Makefile           |    1 +
>  drivers/net/ethernet/intel/hbl_en/Makefile    |    9 +
>  .../net/ethernet/intel/hbl_en/common/Makefile |    3 +
>  .../net/ethernet/intel/hbl_en/common/hbl_en.c | 1168 +++++++++++++++++
>  .../net/ethernet/intel/hbl_en/common/hbl_en.h |  206 +++
>  .../intel/hbl_en/common/hbl_en_dcbnl.c        |  101 ++
>  .../ethernet/intel/hbl_en/common/hbl_en_drv.c |  211 +++
>  .../intel/hbl_en/common/hbl_en_ethtool.c      |  452 +++++++
>  10 files changed, 2178 insertions(+)
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/Makefile
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/common/Makefile
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
>  create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 096439a62129..7301f38e9cfb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9617,6 +9617,15 @@ F:	include/linux/habanalabs/
>  F:	include/linux/net/intel/cn*
>  F:	include/linux/net/intel/gaudi2*
>  
> +HABANALABS ETHERNET DRIVER
> +M:	Omer Shpigelman <oshpigelman@habana.ai>
> +L:	netdev@vger.kernel.org
> +S:	Supported
> +W:	https://www.habana.ai
> +F:	Documentation/networking/device_drivers/ethernet/intel/hbl.rst
> +F:	drivers/net/ethernet/intel/hbl_en/
> +F:	include/linux/net/intel/cn*
> +
>  HACKRF MEDIA DRIVER
>  L:	linux-media@vger.kernel.org
>  S:	Orphan
> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> index 0d1b8a2bae99..5d07349348a0 100644
> --- a/drivers/net/ethernet/intel/Kconfig
> +++ b/drivers/net/ethernet/intel/Kconfig
> @@ -417,4 +417,22 @@ config HABANA_CN
>  	  To compile this driver as a module, choose M here. The module
>  	  will be called habanalabs_cn.
>  
> +config HABANA_EN
> +	tristate "HabanaLabs (an Intel Company) Ethernet driver"
> +	depends on NETDEVICES && ETHERNET && INET
> +	select HABANA_CN
> +	help
> +	  This driver enables Ethernet functionality for the network interfaces
> +	  that are part of the GAUDI ASIC family of AI Accelerators.
> +	  For more information on how to identify your adapter, go to the
> +	  Adapter & Driver ID Guide that can be located at:
> +
> +	  <http://support.intel.com>
> +
> +	  More specific information on configuring the driver is in
> +	  <file:Documentation/networking/device_drivers/ethernet/intel/hbl.rst>.
> +
> +	  To compile this driver as a module, choose M here. The module
> +	  will be called habanalabs_en.
> +
>  endif # NET_VENDOR_INTEL
> diff --git a/drivers/net/ethernet/intel/Makefile b/drivers/net/ethernet/intel/Makefile
> index 10049a28e336..ec62a0227897 100644
> --- a/drivers/net/ethernet/intel/Makefile
> +++ b/drivers/net/ethernet/intel/Makefile
> @@ -20,3 +20,4 @@ obj-$(CONFIG_FM10K) += fm10k/
>  obj-$(CONFIG_ICE) += ice/
>  obj-$(CONFIG_IDPF) += idpf/
>  obj-$(CONFIG_HABANA_CN) += hbl_cn/
> +obj-$(CONFIG_HABANA_EN) += hbl_en/
> diff --git a/drivers/net/ethernet/intel/hbl_en/Makefile b/drivers/net/ethernet/intel/hbl_en/Makefile
> new file mode 100644
> index 000000000000..695497ab93b6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/Makefile
> @@ -0,0 +1,9 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Makefile for HabanaLabs (an Intel Company) Ethernet network driver
> +#
> +
> +obj-$(CONFIG_HABANA_EN) := habanalabs_en.o
> +
> +include $(src)/common/Makefile
> +habanalabs_en-y += $(HBL_EN_COMMON_FILES)
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/Makefile b/drivers/net/ethernet/intel/hbl_en/common/Makefile
> new file mode 100644
> index 000000000000..a3ccb5dbf4a6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +HBL_EN_COMMON_FILES := common/hbl_en_drv.o common/hbl_en.o \
> +	common/hbl_en_ethtool.o common/hbl_en_dcbnl.o
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
> new file mode 100644
> index 000000000000..066be5ac2d84
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
> @@ -0,0 +1,1168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +#include <linux/inetdevice.h>
> +
> +#define TX_TIMEOUT			(5 * HZ)
> +#define PORT_RESET_TIMEOUT_MSEC		(60 * 1000ull) /* 60s */
> +
> +/**
> + * struct hbl_en_tx_pkt_work - used to schedule a work of a Tx packet.
> + * @tx_work: workqueue object to run when packet needs to be sent.
> + * @port: pointer to current port structure.
> + * @skb: copy of the packet to send.
> + */
> +struct hbl_en_tx_pkt_work {
> +	struct work_struct tx_work;
> +	struct hbl_en_port *port;
> +	struct sk_buff *skb;
> +};
> +
> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget);
> +static int hbl_en_port_open(struct hbl_en_port *port);
> +
> +static int hbl_en_ports_reopen(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int rc = 0, i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be that the port was shutdown by 'ip link set down' and there is no need
> +		 * in reopening it.
> +		 * Since we mark the ports as in reset even if they are disabled, we clear the flag
> +		 * here anyway.
> +		 * See hbl_en_ports_stop_prepare() for more info.
> +		 */
> +		if (!netif_running(port->ndev)) {
> +			atomic_set(&port->in_reset, 0);
> +			continue;
> +		}
> +
> +		rc = hbl_en_port_open(port);
> +
> +		atomic_set(&port->in_reset, 0);
> +
> +		if (rc)
> +			break;
> +	}
> +
> +	hdev->in_reset = false;
> +
> +	return rc;
> +}
> +
> +static void hbl_en_port_fini(struct hbl_en_port *port)
> +{
> +	if (port->rx_wq)
> +		destroy_workqueue(port->rx_wq);
> +}
> +
> +static int hbl_en_port_init(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u32 port_idx = port->idx;
> +	char wq_name[32];
> +	int rc;
> +
> +	if (hdev->poll_enable) {
> +		memset(wq_name, 0, sizeof(wq_name));
> +		snprintf(wq_name, sizeof(wq_name) - 1, "hbl%u-port%d-rx-wq", hdev->core_dev_id,
> +			 port_idx);
> +		port->rx_wq = alloc_ordered_workqueue(wq_name, 0);
> +		if (!port->rx_wq) {
> +			dev_err(hdev->dev, "Failed to allocate Rx WQ\n");
> +			rc = -ENOMEM;
> +			goto fail;
> +		}
> +	}
> +
> +	hbl_en_ethtool_init_coalesce(port);
> +
> +	return 0;
> +
> +fail:
> +	hbl_en_port_fini(port);
> +
> +	return rc;
> +}
> +
> +static void _hbl_en_set_port_status(struct hbl_en_port *port, bool up)
> +{
> +	struct net_device *ndev = port->ndev;
> +	u32 port_idx = port->idx;
> +
> +	if (up) {
> +		netif_carrier_on(ndev);
> +		netif_wake_queue(ndev);
> +	} else {
> +		netif_carrier_off(ndev);
> +		netif_stop_queue(ndev);
> +	}
> +
> +	/* Unless link events are getting through the EQ, no need to print about link down events
> +	 * during port reset
> +	 */
> +	if (port->hdev->has_eq || up || !atomic_read(&port->in_reset))
> +		netdev_info(port->ndev, "link %s, port %d\n", up ? "up" : "down", port_idx);
> +}
> +
> +static void hbl_en_set_port_status(struct hbl_aux_dev *aux_dev, u32 port_idx, bool up)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	_hbl_en_set_port_status(port, up);
> +}
> +
> +static bool hbl_en_is_port_open(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return port->is_initialized;
> +}
> +
> +/* get the src IP as it is done in devinet_ioctl() */
> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	struct in_device *in_dev;
> +	struct in_ifaddr *ifa;
> +	int rc = 0;
> +
> +	/* for the case where no src IP is configured */
> +	*src_ip = 0;
> +
> +	/* rtnl lock should be acquired in relevant flows before taking configuration lock */
> +	if (!rtnl_is_locked()) {
> +		netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	in_dev = __in_dev_get_rtnl(ndev);
> +	if (!in_dev) {
> +		netdev_err(port->ndev, "Failed to get IPv4 struct\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	ifa = rtnl_dereference(in_dev->ifa_list);
> +
> +	while (ifa) {
> +		if (!strcmp(ndev->name, ifa->ifa_label)) {
> +			/* convert the BE to native and later on it will be
> +			 * written to the HW as LE in QPC_SET
> +			 */
> +			*src_ip = be32_to_cpu(ifa->ifa_local);
> +			break;
> +		}
> +		ifa = rtnl_dereference(ifa->ifa_next);
> +	}
> +out:
> +	return rc;
> +}
> +
> +static void hbl_en_reset_stats(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	port->net_stats.rx_packets = 0;
> +	port->net_stats.tx_packets = 0;
> +	port->net_stats.rx_bytes = 0;
> +	port->net_stats.tx_bytes = 0;
> +	port->net_stats.tx_errors = 0;
> +	atomic64_set(&port->net_stats.rx_dropped, 0);
> +	atomic64_set(&port->net_stats.tx_dropped, 0);
> +}
> +
> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	u32 mtu;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(ndev, "port is in reset, can't get MTU\n");
> +		return 0;
> +	}
> +
> +	mtu = ndev->mtu;
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return mtu;
> +}
> +
> +static u32 hbl_en_get_pflags(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return port->pflags;
> +}
> +
> +static void hbl_en_set_dev_lpbk(struct hbl_aux_dev *aux_dev, u32 port_idx, bool enable)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +
> +	if (enable)
> +		ndev->features |= NETIF_F_LOOPBACK;
> +	else
> +		ndev->features &= ~NETIF_F_LOOPBACK;
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static int hbl_en_port_open_locked(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct net_device *ndev = port->ndev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (port->is_initialized)
> +		return 0;
> +
> +	if (!hdev->poll_enable)
> +		netif_napi_add(ndev, &port->napi, hbl_en_napi_poll);
> +
> +	rc = aux_ops->port_hw_init(aux_dev, port_idx);
> +	if (rc) {
> +		netdev_err(ndev, "Failed to configure the HW, rc %d\n", rc);
> +		goto hw_init_fail;
> +	}
> +
> +	if (!hdev->poll_enable)
> +		napi_enable(&port->napi);
> +
> +	rc = hdev->asic_funcs.eth_port_open(port);
> +	if (rc) {
> +		netdev_err(ndev, "Failed to init H/W, rc %d\n", rc);
> +		goto port_open_fail;
> +	}
> +
> +	rc = aux_ops->update_mtu(aux_dev, port_idx, ndev->mtu);
> +	if (rc) {
> +		netdev_err(ndev, "MTU update failed, rc %d\n", rc);
> +		goto update_mtu_fail;
> +	}
> +
> +	rc = aux_ops->phy_init(aux_dev, port_idx);
> +	if (rc) {
> +		netdev_err(ndev, "PHY init failed, rc %d\n", rc);
> +		goto phy_init_fail;
> +	}
> +
> +	netif_start_queue(ndev);
> +
> +	port->is_initialized = true;
> +
> +	return 0;
> +
> +phy_init_fail:
> +	/* no need to revert the MTU change, it will be updated on next port open */
> +update_mtu_fail:
> +	hdev->asic_funcs.eth_port_close(port);
> +port_open_fail:
> +	if (!hdev->poll_enable)
> +		napi_disable(&port->napi);
> +
> +	aux_ops->port_hw_fini(aux_dev, port_idx);
> +hw_init_fail:
> +	if (!hdev->poll_enable)
> +		netif_napi_del(&port->napi);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_port_open(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +	rc = hbl_en_port_open_locked(port);
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_open(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't open it\n");
> +		return -EBUSY;
> +	}
> +
> +	rc = hbl_en_port_open(port);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static void hbl_en_port_close_locked(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (!port->is_initialized)
> +		return;
> +
> +	port->is_initialized = false;
> +
> +	/* verify that the port is marked as closed before continuing */
> +	mb();
> +
> +	/* Print if not in hard reset flow e.g. from ip cmd */
> +	if (!hdev->in_reset && netif_carrier_ok(port->ndev))
> +		netdev_info(port->ndev, "port was closed\n");
> +
> +	/* disable the PHY here so no link changes will occur from this point forward */
> +	aux_ops->phy_fini(aux_dev, port_idx);
> +
> +	/* disable Tx SW flow */
> +	netif_carrier_off(port->ndev);
> +	netif_tx_disable(port->ndev);
> +
> +	/* stop Tx/Rx HW */
> +	aux_ops->port_hw_fini(aux_dev, port_idx);
> +
> +	/* disable Tx/Rx QPs */
> +	hdev->asic_funcs.eth_port_close(port);
> +
> +	/* stop Rx SW flow */
> +	if (hdev->poll_enable) {
> +		hbl_en_rx_poll_stop(port);
> +	} else {
> +		napi_disable(&port->napi);
> +		netif_napi_del(&port->napi);
> +	}
> +
> +	/* Explicitly count the port close operations as we don't get a link event for this.
> +	 * Upon port open we receive a link event, hence no additional action required.
> +	 */
> +	aux_ops->port_toggle_count(aux_dev, port_idx);
> +}
> +
> +static void hbl_en_port_close(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +	hbl_en_port_close_locked(port);
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static int __hbl_en_port_reset_locked(struct hbl_en_port *port)
> +{
> +	hbl_en_port_close_locked(port);
> +
> +	return hbl_en_port_open_locked(port);
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +int hbl_en_port_reset_locked(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return __hbl_en_port_reset_locked(port);
> +}
> +
> +int hbl_en_port_reset(struct hbl_en_port *port)
> +{
> +	hbl_en_port_close(port);
> +
> +	/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +	msleep(20);
> +
> +	return hbl_en_port_open(port);
> +}
> +
> +static int hbl_en_close(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	ktime_t timeout;
> +
> +	/* Looks like the return value of this function is not checked, so we can't just return
> +	 * EBUSY if the port is under reset. We need to wait until the reset is finished and then
> +	 * close the port. Otherwise the netdev will set the port as closed although port_close()
> +	 * wasn't called. Only if we waited long enough and the reset hasn't finished, we can return
> +	 * an error without actually closing the port as it is a fatal flow anyway.
> +	 */
> +	timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +	while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		/* If this is called from unregister_netdev() then the port was already closed and
> +		 * hence we can safely return.
> +		 * We could have just check the port_open boolean, but that might hide some future
> +		 * bugs. Hence it is better to use a dedicated flag for that.
> +		 */
> +		if (READ_ONCE(hdev->in_teardown))
> +			return 0;
> +
> +		usleep_range(50, 200);
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			netdev_crit(netdev,
> +				    "Timeout while waiting for port to finish reset, can't close it\n"
> +				    );
> +			return -EBUSY;
> +		}
> +	}
> +
> +	hbl_en_port_close(port);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return 0;
> +}
> +
> +/**
> + * hbl_en_ports_stop_prepare() - stop the Rx and Tx and synchronize with other reset flows.
> + * @aux_dev: habanalabs auxiliary device structure.
> + *
> + * This function makes sure that during the reset no packets will be processed and that
> + * ndo_open/ndo_close do not open/close the ports.
> + * A hard reset might occur right after the driver was loaded, which means before the ports
> + * initialization was finished. Therefore, even if the ports are not yet open, we mark it as in
> + * reset in order to avoid races. We clear the in reset flag later on when reopening the ports.
> + */
> +static void hbl_en_ports_stop_prepare(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	ktime_t timeout;
> +	int i;
> +
> +	/* Check if the ports where initialized. If not, we shouldn't mark them as in reset because
> +	 * they will fail to get opened.
> +	 */
> +	if (!hdev->is_initialized || hdev->in_reset)
> +		return;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* This function is competing with reset from ethtool/ip, so try to take the
> +		 * in_reset atomic and if we are already in a middle of reset, wait until reset
> +		 * function is finished.
> +		 * Reset function is designed to always finish (could take up to a few seconds in
> +		 * worst case).
> +		 * We mark also closed ports as in reset so they won't be able to get opened while
> +		 * the device in under reset.
> +		 */
> +
> +		timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +		while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +			usleep_range(50, 200);
> +			if (ktime_compare(ktime_get(), timeout) > 0) {
> +				netdev_crit(port->ndev,
> +					    "Timeout while waiting for port %d to finish reset\n",
> +					    port->idx);
> +				break;
> +			}
> +		}
> +	}
> +
> +	hdev->in_reset = true;
> +}
> +
> +static void hbl_en_ports_stop(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		if (netif_running(port->ndev))
> +			hbl_en_port_close(port);
> +	}
> +}
> +
> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't change MTU\n");
> +		return -EBUSY;
> +	}
> +
> +	if (netif_running(port->ndev)) {
> +		hbl_en_port_close(port);
> +
> +		/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +		msleep(20);
> +
> +		netdev->mtu = new_mtu;
> +
> +		rc = hbl_en_port_open(port);
> +		if (rc)
> +			netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
> +	} else {
> +		netdev->mtu = new_mtu;
> +	}
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +/* Swap source and destination MAC addresses */
> +static inline void swap_l2(char *buf)
> +{
> +	u16 *eth_hdr, tmp;
> +
> +	eth_hdr = (u16 *)buf;
> +	tmp = eth_hdr[0];
> +	eth_hdr[0] = eth_hdr[3];
> +	eth_hdr[3] = tmp;
> +	tmp = eth_hdr[1];
> +	eth_hdr[1] = eth_hdr[4];
> +	eth_hdr[4] = tmp;
> +	tmp = eth_hdr[2];
> +	eth_hdr[2] = eth_hdr[5];
> +	eth_hdr[5] = tmp;
> +}
> +
> +/* Swap source and destination IP addresses
> + */
> +static inline void swap_l3(char *buf)
> +{
> +	u32 tmp;
> +
> +	/* skip the Ethernet header and the IP header till source IP address */
> +	buf += ETH_HLEN + 12;
> +	tmp = ((u32 *)buf)[0];
> +	((u32 *)buf)[0] = ((u32 *)buf)[1];
> +	((u32 *)buf)[1] = tmp;
> +}
> +
> +static void do_tx_swap(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u16 *tmp_buff = (u16 *)skb->data;
> +	u32 port_idx = port->idx;
> +
> +	/* First, let's print the SKB we got */
> +	dev_dbg_ratelimited(hdev->dev,
> +			    "Send [P%d]: dst-mac:%04x%04x%04x, src-mac:%04x%04x%04x, eth-type:%04x, len:%u\n",
> +			    port_idx, swab16(tmp_buff[0]), swab16(tmp_buff[1]), swab16(tmp_buff[2]),
> +			    swab16(tmp_buff[3]), swab16(tmp_buff[4]), swab16(tmp_buff[5]),
> +			    swab16(tmp_buff[6]), skb->len);
> +
> +	/* Before submit it to HW, in case this is ipv4 pkt, swap eth/ip addresses.
> +	 * that way, we may send ECMP (ping) to ourselves in LB cases.
> +	 */
> +	swap_l2(skb->data);
> +	if (swab16(tmp_buff[6]) == ETH_P_IP)
> +		swap_l3(skb->data);
> +}
> +
> +static bool is_pkt_swap_enabled(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	return aux_ops->is_eth_lpbk(aux_dev);
> +}
> +
> +static bool is_tx_disabled(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	return aux_ops->get_mac_lpbk(aux_dev, port_idx) && !is_pkt_swap_enabled(hdev);
> +}
> +
> +static netdev_tx_t hbl_en_handle_tx(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	netdev_tx_t ret;
> +
> +	if (skb->len <= 0 || is_tx_disabled(port))
> +		goto free_skb;
> +
> +	if (skb->len > hdev->max_frm_len) {
> +		netdev_err(port->ndev, "Tx pkt size %uB exceeds maximum of %uB\n", skb->len,
> +			   hdev->max_frm_len);
> +		goto free_skb;
> +	}
> +
> +	if (is_pkt_swap_enabled(hdev))
> +		do_tx_swap(port, skb);
> +
> +	/* Pad the ethernet packets to the minimum frame size as the NIC hw doesn't do it.
> +	 * eth_skb_pad() frees the packet on failure, so just increment the dropped counter and
> +	 * return as success to avoid a retry.
> +	 */
> +	if (skb_put_padto(skb, hdev->pad_size)) {
> +		dev_err_ratelimited(hdev->dev, "Padding failed, the skb is dropped\n");
> +		atomic64_inc(&port->net_stats.tx_dropped);
> +		return NETDEV_TX_OK;
> +	}
> +
> +	ret = hdev->asic_funcs.write_pkt_to_hw(port, skb);
> +	if (ret == NETDEV_TX_OK) {
> +		port->net_stats.tx_packets++;
> +		port->net_stats.tx_bytes += skb->len;
> +	}
> +
> +	return ret;
> +
> +free_skb:
> +	dev_kfree_skb_any(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static netdev_tx_t hbl_en_start_xmit(struct sk_buff *skb, struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev;
> +
> +	hdev = port->hdev;
> +
> +	return hbl_en_handle_tx(port, skb);
> +}
> +
> +static int hbl_en_set_port_mac_loopback(struct hbl_en_port *port, bool enable)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct net_device *ndev = port->ndev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	rc = aux_ops->set_mac_lpbk(aux_dev, port_idx, enable);
> +	if (rc)
> +		return rc;
> +
> +	netdev_info(ndev, "port %u: mac loopback is %s\n", port_idx,
> +		    enable ? "enabled" : "disabled");
> +
> +	if (netif_running(ndev)) {
> +		rc = hbl_en_port_reset(port);
> +		if (rc) {
> +			netdev_err(ndev, "Failed to reset port %u, rc %d\n", port_idx, rc);
> +			return rc;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int hbl_en_set_features(struct net_device *netdev, netdev_features_t features)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	netdev_features_t changed;
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port %d is in reset, can't update settings", port->idx);
> +		return -EBUSY;
> +	}
> +
> +	changed = netdev->features ^ features;
> +
> +	if (changed & NETIF_F_LOOPBACK)
> +		rc = hbl_en_set_port_mac_loopback(port, !!(features & NETIF_F_LOOPBACK));
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static void hbl_en_handle_tx_timeout(struct net_device *netdev, unsigned int txqueue)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +
> +	port->net_stats.tx_errors++;
> +	atomic64_inc(&port->net_stats.tx_dropped);
> +}
> +
> +static void hbl_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(dev);
> +
> +	stats->rx_bytes = port->net_stats.rx_bytes;
> +	stats->tx_bytes = port->net_stats.tx_bytes;
> +	stats->rx_packets = port->net_stats.rx_packets;
> +	stats->tx_packets = port->net_stats.tx_packets;
> +	stats->tx_errors = port->net_stats.tx_errors;
> +	stats->tx_dropped = (u64)atomic64_read(&port->net_stats.tx_dropped);
> +	stats->rx_dropped = (u64)atomic64_read(&port->net_stats.rx_dropped);
> +}
> +
> +static const struct net_device_ops hbl_en_netdev_ops = {
> +	.ndo_open = hbl_en_open,
> +	.ndo_stop = hbl_en_close,
> +	.ndo_start_xmit = hbl_en_start_xmit,
> +	.ndo_validate_addr = eth_validate_addr,
> +	.ndo_change_mtu = hbl_en_change_mtu,
> +	.ndo_set_features = hbl_en_set_features,
> +	.ndo_get_stats64 = hbl_en_get_stats64,
> +	.ndo_tx_timeout = hbl_en_handle_tx_timeout,
> +};
> +
> +static void hbl_en_set_ops(struct net_device *ndev)
> +{
> +	ndev->netdev_ops = &hbl_en_netdev_ops;
> +	ndev->ethtool_ops = hbl_en_ethtool_get_ops(ndev);
> +#ifdef CONFIG_DCB
> +	ndev->dcbnl_ops = &hbl_en_dcbnl_ops;
> +#endif
> +}
> +
> +static int hbl_en_port_register(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	struct hbl_en_port **ptr;
> +	struct net_device *ndev;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	ndev = alloc_etherdev(sizeof(struct hbl_en_port *));
> +	if (!ndev) {
> +		dev_err(hdev->dev, "netdevice %d alloc failed\n", port_idx);
> +		return -ENOMEM;
> +	}
> +
> +	port->ndev = ndev;
> +	SET_NETDEV_DEV(ndev, &hdev->pdev->dev);
> +	ptr = netdev_priv(ndev);
> +	*ptr = port;
> +
> +	/* necessary for creating multiple interfaces */
> +	ndev->dev_port = port_idx;
> +
> +	hbl_en_set_ops(ndev);
> +
> +	ndev->watchdog_timeo = TX_TIMEOUT;
> +	ndev->min_mtu = hdev->min_raw_mtu;
> +	ndev->max_mtu = hdev->max_raw_mtu;
> +
> +	/* Add loopback capability to the device. */
> +	ndev->hw_features |= NETIF_F_LOOPBACK;
> +
> +	/* If this port was set to loopback, set it also to the ndev features */
> +	if (aux_ops->get_mac_lpbk(aux_dev, port_idx))
> +		ndev->features |= NETIF_F_LOOPBACK;
> +
> +	eth_hw_addr_set(ndev, port->mac_addr);
> +
> +	/* It's more an intelligent poll wherein, we enable the Rx completion EQE event and then
> +	 * start the poll from there.
> +	 * Inside the polling thread, we read packets from hardware and then reschedule the poll
> +	 * only if there are more packets to be processed. Else we re-enable the CQ Arm interrupt
> +	 * and exit the poll.
> +	 */
> +	if (hdev->poll_enable)
> +		hbl_en_rx_poll_trigger_init(port);
> +
> +	netif_carrier_off(ndev);
> +
> +	rc = register_netdev(ndev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Could not register netdevice %d\n", port_idx);
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	if (ndev) {
> +		free_netdev(ndev);
> +		port->ndev = NULL;
> +	}
> +
> +	return rc;
> +}
> +
> +static void dump_swap_pkt(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u16 *tmp_buff = (u16 *)skb->data;
> +	u32 port_idx = port->idx;
> +
> +	/* The SKB is ready now (before stripping-out the L2), print its content */
> +	dev_dbg_ratelimited(hdev->dev,
> +			    "Recv [P%d]: dst-mac:%04x%04x%04x, src-mac:%04x%04x%04x, eth-type:%04x, len:%u\n",
> +			    port_idx, swab16(tmp_buff[0]), swab16(tmp_buff[1]), swab16(tmp_buff[2]),
> +			    swab16(tmp_buff[3]), swab16(tmp_buff[4]), swab16(tmp_buff[5]),
> +			    swab16(tmp_buff[6]), skb->len);
> +}
> +
> +int hbl_en_handle_rx(struct hbl_en_port *port, int budget)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	enum hbl_en_eth_pkt_status pkt_status;
> +	struct net_device *ndev = port->ndev;
> +	int rc, pkt_count = 0;
> +	struct sk_buff *skb;
> +	void *pkt_addr;
> +	u32 pkt_size;
> +
> +	if (!netif_carrier_ok(ndev))
> +		return 0;
> +
> +	while (pkt_count < budget) {
> +		pkt_status = hdev->asic_funcs.read_pkt_from_hw(port, &pkt_addr, &pkt_size);
> +
> +		if (pkt_status == ETH_PKT_NONE)
> +			break;
> +
> +		pkt_count++;
> +
> +		if (pkt_status == ETH_PKT_DROP) {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +			continue;
> +		}
> +
> +		if (hdev->poll_enable)
> +			skb = __netdev_alloc_skb_ip_align(ndev, pkt_size, GFP_KERNEL);
> +		else
> +			skb = napi_alloc_skb(&port->napi, pkt_size);
> +
> +		if (!skb) {
> +			atomic64_inc(&port->net_stats.rx_dropped);

It seems like buffer exhaustion (!skb) would be rx_missed_errors?

The documentation in include/uapi/linux/if_link.h:

 * @rx_dropped: Number of packets received but not processed,
 *   e.g. due to lack of resources or unsupported protocol.
 *   For hardware interfaces this counter may include packets discarded
 *   due to L2 address filtering but should not include packets dropped
 *   by the device due to buffer exhaustion which are counted separately in
 *   @rx_missed_errors (since procfs folds those two counters together).

But, I don't know much about your hardware so I could be wrong.

> +			break;
> +		}
> +
> +		skb_copy_to_linear_data(skb, pkt_addr, pkt_size);
> +		skb_put(skb, pkt_size);
> +
> +		if (is_pkt_swap_enabled(hdev))
> +			dump_swap_pkt(port, skb);
> +
> +		skb->protocol = eth_type_trans(skb, ndev);
> +
> +		/* Zero the packet buffer memory to avoid leak in case of wrong
> +		 * size is used when next packet populates the same memory
> +		 */
> +		memset(pkt_addr, 0, pkt_size);
> +
> +		/* polling is done in thread context and hence BH should be disabled */
> +		if (hdev->poll_enable)
> +			local_bh_disable();
> +
> +		rc = netif_receive_skb(skb);

Is there any reason in particular to call netif_receive_skb instead of
napi_gro_receive ?

> +
> +		if (hdev->poll_enable)
> +			local_bh_enable();
> +
> +		if (rc == NET_RX_SUCCESS) {
> +			port->net_stats.rx_packets++;
> +			port->net_stats.rx_bytes += pkt_size;
> +		} else {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +		}
> +	}
> +
> +	return pkt_count;
> +}
> +
> +static bool __hbl_en_rx_poll_schedule(struct hbl_en_port *port, unsigned long delay)
> +{
> +	return queue_delayed_work(port->rx_wq, &port->rx_poll_work, delay);
> +}
> +
> +static void hbl_en_rx_poll_work(struct work_struct *work)
> +{
> +	struct hbl_en_port *port = container_of(work, struct hbl_en_port, rx_poll_work.work);
> +	struct hbl_en_device *hdev = port->hdev;
> +	int pkt_count;
> +
> +	pkt_count = hbl_en_handle_rx(port, NAPI_POLL_WEIGHT);
> +
> +	/* Reschedule the poll if we have consumed budget which means we still have packets to
> +	 * process. Else re-enable the Rx IRQs and exit the work.
> +	 */
> +	if (pkt_count < NAPI_POLL_WEIGHT)
> +		hdev->asic_funcs.reenable_rx_irq(port);
> +	else
> +		__hbl_en_rx_poll_schedule(port, 0);
> +}
> +
> +/* Rx poll init and trigger routines are used in event-driven setups where
> + * Rx polling is initialized once during init or open and started/triggered by the event handler.
> + */
> +void hbl_en_rx_poll_trigger_init(struct hbl_en_port *port)
> +{
> +	INIT_DELAYED_WORK(&port->rx_poll_work, hbl_en_rx_poll_work);
> +}
> +
> +bool hbl_en_rx_poll_start(struct hbl_en_port *port)
> +{
> +	return __hbl_en_rx_poll_schedule(port, msecs_to_jiffies(1));
> +}
> +
> +void hbl_en_rx_poll_stop(struct hbl_en_port *port)
> +{
> +	cancel_delayed_work_sync(&port->rx_poll_work);
> +}
> +
> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget)
> +{
> +	struct hbl_en_port *port = container_of(napi, struct hbl_en_port, napi);
> +	struct hbl_en_device *hdev = port->hdev;
> +	int pkt_count;
> +
> +	/* exit if we are called by netpoll as we free the Tx ring via EQ (if enabled) */
> +	if (!budget)
> +		return 0;
> +
> +	pkt_count = hbl_en_handle_rx(port, budget);
> +
> +	/* If budget not fully consumed, exit the polling mode */
> +	if (pkt_count < budget) {
> +		napi_complete_done(napi, pkt_count);

I believe this code might be incorrect and that it should be:

  if (napi_complete_done(napi, pkt_done))
     hdev->asic_funcs.reenable_rx_irq(port);

> +		hdev->asic_funcs.reenable_rx_irq(port);
> +	}
> +
> +	return pkt_count;
> +}
> +
> +static void hbl_en_port_unregister(struct hbl_en_port *port)
> +{
> +	struct net_device *ndev = port->ndev;
> +
> +	unregister_netdev(ndev);
> +	free_netdev(ndev);
> +	port->ndev = NULL;
> +}
> +
> +static int hbl_en_set_asic_funcs(struct hbl_en_device *hdev)
> +{
> +	switch (hdev->asic_type) {
> +	case HBL_ASIC_GAUDI2:
> +	default:
> +		dev_err(hdev->dev, "Unrecognized ASIC type %d\n", hdev->asic_type);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static void hbl_en_handle_eqe(struct hbl_aux_dev *aux_dev, u32 port, struct hbl_cn_eqe *eqe)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	hdev->asic_funcs.handle_eqe(aux_dev, port, eqe);
> +}
> +
> +static void hbl_en_set_aux_ops(struct hbl_en_device *hdev, bool enable)
> +{
> +	struct hbl_en_aux_ops *aux_ops = hdev->aux_dev->aux_ops;
> +
> +	if (enable) {
> +		aux_ops->ports_reopen = hbl_en_ports_reopen;
> +		aux_ops->ports_stop_prepare = hbl_en_ports_stop_prepare;
> +		aux_ops->ports_stop = hbl_en_ports_stop;
> +		aux_ops->set_port_status = hbl_en_set_port_status;
> +		aux_ops->is_port_open = hbl_en_is_port_open;
> +		aux_ops->get_src_ip = hbl_en_get_src_ip;
> +		aux_ops->reset_stats = hbl_en_reset_stats;
> +		aux_ops->get_mtu = hbl_en_get_mtu;
> +		aux_ops->get_pflags = hbl_en_get_pflags;
> +		aux_ops->set_dev_lpbk = hbl_en_set_dev_lpbk;
> +		aux_ops->handle_eqe = hbl_en_handle_eqe;
> +	} else {
> +		aux_ops->ports_reopen = NULL;
> +		aux_ops->ports_stop_prepare = NULL;
> +		aux_ops->ports_stop = NULL;
> +		aux_ops->set_port_status = NULL;
> +		aux_ops->is_port_open = NULL;
> +		aux_ops->get_src_ip = NULL;
> +		aux_ops->reset_stats = NULL;
> +		aux_ops->get_mtu = NULL;
> +		aux_ops->get_pflags = NULL;
> +		aux_ops->set_dev_lpbk = NULL;
> +		aux_ops->handle_eqe = NULL;
> +	}
> +}
> +
> +int hbl_en_dev_init(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_asic_funcs *asic_funcs = &hdev->asic_funcs;
> +	struct hbl_en_port *port;
> +	int rc, i, port_cnt = 0;
> +
> +	/* must be called before the call to dev_init() */
> +	rc = hbl_en_set_asic_funcs(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to set aux ops\n");
> +		return rc;
> +	}
> +
> +	rc = asic_funcs->dev_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "device init failed\n");
> +		return rc;
> +	}
> +
> +	/* init the function pointers here before calling hbl_en_port_register which sets up
> +	 * net_device_ops, and its ops might start getting called.
> +	 * If any failure is encountered, these will be made NULL and the core driver won't call
> +	 * them.
> +	 */
> +	hbl_en_set_aux_ops(hdev, true);
> +
> +	/* Port register depends on the above initialization so it must be called here and not
> +	 * before that.
> +	 */
> +	for (i = 0; i < hdev->max_num_of_ports; i++, port_cnt++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		rc = hbl_en_port_init(port);
> +		if (rc) {
> +			dev_err(hdev->dev, "port init failed\n");
> +			goto unregister_ports;
> +		}
> +
> +		rc = hbl_en_port_register(port);
> +		if (rc) {
> +			dev_err(hdev->dev, "port register failed\n");
> +
> +			hbl_en_port_fini(port);
> +			goto unregister_ports;
> +		}
> +	}
> +
> +	hdev->is_initialized = true;
> +
> +	return 0;
> +
> +unregister_ports:
> +	for (i = 0; i < port_cnt; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		hbl_en_port_unregister(port);
> +		hbl_en_port_fini(port);
> +	}
> +
> +	hbl_en_set_aux_ops(hdev, false);
> +
> +	asic_funcs->dev_fini(hdev);
> +
> +	return rc;
> +}
> +
> +void hbl_en_dev_fini(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_asic_funcs *asic_funcs = &hdev->asic_funcs;
> +	struct hbl_en_port *port;
> +	int i;
> +
> +	hdev->in_teardown = true;
> +
> +	if (!hdev->is_initialized)
> +		return;
> +
> +	hdev->is_initialized = false;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be this cleanup flow is called after a failed init flow.
> +		 * Hence we need to check that we indeed have a netdev to unregister.
> +		 */
> +		if (!port->ndev)
> +			continue;
> +
> +		hbl_en_port_unregister(port);
> +		hbl_en_port_fini(port);
> +	}
> +
> +	hbl_en_set_aux_ops(hdev, false);
> +
> +	asic_funcs->dev_fini(hdev);
> +}
> +
> +dma_addr_t hbl_en_dma_map(struct hbl_en_device *hdev, void *addr, int len)
> +{
> +	dma_addr_t dma_addr;
> +
> +	if (hdev->dma_map_support)
> +		dma_addr = dma_map_single(&hdev->pdev->dev, addr, len, DMA_TO_DEVICE);
> +	else
> +		dma_addr = virt_to_phys(addr);
> +
> +	return dma_addr;
> +}
> +
> +void hbl_en_dma_unmap(struct hbl_en_device *hdev, dma_addr_t dma_addr, int len)
> +{
> +	if (hdev->dma_map_support)
> +		dma_unmap_single(&hdev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
> +}
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
> new file mode 100644
> index 000000000000..15504c1f3cfb
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
> @@ -0,0 +1,206 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#ifndef HABANALABS_EN_H_
> +#define HABANALABS_EN_H_
> +
> +#include <linux/net/intel/cn.h>
> +
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
> +
> +#define HBL_EN_NAME			"habanalabs_en"
> +
> +#define HBL_EN_PORT(aux_dev, idx)	(&(((struct hbl_en_device *)(aux_dev)->priv)->ports[(idx)]))
> +
> +#define hbl_netdev_priv(ndev) \
> +({ \
> +	typecheck(struct net_device *, ndev); \
> +	*(struct hbl_en_port **)netdev_priv(ndev); \
> +})
> +
> +/**
> + * enum hbl_en_eth_pkt_status - status of Rx Ethernet packet.
> + * ETH_PKT_OK: packet was received successfully.
> + * ETH_PKT_DROP: packet should be dropped.
> + * ETH_PKT_NONE: no available packet.
> + */
> +enum hbl_en_eth_pkt_status {
> +	ETH_PKT_OK,
> +	ETH_PKT_DROP,
> +	ETH_PKT_NONE
> +};
> +
> +/**
> + * struct hbl_en_net_stats - stats of Ethernet interface.
> + * rx_packets: number of packets received.
> + * tx_packets: number of packets sent.
> + * rx_bytes: total bytes of data received.
> + * tx_bytes: total bytes of data sent.
> + * tx_errors: number of errors in the TX.
> + * rx_dropped: number of packets dropped by the RX.
> + * tx_dropped: number of packets dropped by the TX.
> + */
> +struct hbl_en_net_stats {
> +	u64 rx_packets;
> +	u64 tx_packets;
> +	u64 rx_bytes;
> +	u64 tx_bytes;
> +	u64 tx_errors;
> +	atomic64_t rx_dropped;
> +	atomic64_t tx_dropped;
> +};
> +
> +/**
> + * struct hbl_en_port - manage port common structure.
> + * @hdev: habanalabs Ethernet device structure.
> + * @ndev: network device.
> + * @rx_wq: WQ for Rx poll when we cannot schedule NAPI poll.
> + * @mac_addr: HW MAC addresses.
> + * @asic_specific: ASIC specific port structure.
> + * @napi: New API structure.
> + * @rx_poll_work: Rx work for polling mode.
> + * @net_stats: statistics of the ethernet interface.
> + * @in_reset: true if the NIC was marked as in reset, false otherwise. Used to avoid an additional
> + *            stopping of the NIC if a hard reset was re-initiated.
> + * @pflags: ethtool private flags bit mask.
> + * @idx: index of this specific port.
> + * @rx_max_coalesced_frames: Maximum number of packets to receive before an RX interrupt.
> + * @tx_max_coalesced_frames: Maximum number of packets to be sent before a TX interrupt.
> + * @rx_coalesce_usecs: How many usecs to delay an RX interrupt after a packet arrives.
> + * @is_initialized: true if the port H/W is initialized, false otherwise.
> + * @pfc_enable: true if this port supports Priority Flow Control, false otherwise.
> + * @auto_neg_enable: is autoneg enabled.
> + * @auto_neg_resolved: was autoneg phase finished successfully.
> + */
> +struct hbl_en_port {
> +	struct hbl_en_device *hdev;
> +	struct net_device *ndev;
> +	struct workqueue_struct *rx_wq;
> +	char *mac_addr;
> +	void *asic_specific;
> +	struct napi_struct napi;
> +	struct delayed_work rx_poll_work;
> +	struct hbl_en_net_stats net_stats;
> +	atomic_t in_reset;
> +	u32 pflags;
> +	u32 idx;
> +	u32 rx_max_coalesced_frames;
> +	u32 tx_max_coalesced_frames;
> +	u16 rx_coalesce_usecs;
> +	u8 is_initialized;
> +	u8 pfc_enable;
> +	u8 auto_neg_enable;
> +	u8 auto_neg_resolved;
> +};
> +
> +/**
> + * struct hbl_en_asic_funcs - ASIC specific Ethernet functions.
> + * @dev_init: device init.
> + * @dev_fini: device cleanup.
> + * @reenable_rx_irq: re-enable Rx interrupts.
> + * @eth_port_open: initialize and open the Ethernet port.
> + * @eth_port_close: close the Ethernet port.
> + * @write_pkt_to_hw: write skb to HW.
> + * @read_pkt_from_hw: read pkt from HW.
> + * @get_pfc_cnts: get PFC counters.
> + * @set_coalesce: set Tx/Rx coalesce config in HW.
> + * @get_rx_ring size: get max number of elements the Rx ring can contain.
> + * @handle_eqe: Handle a received event.
> + */
> +struct hbl_en_asic_funcs {
> +	int (*dev_init)(struct hbl_en_device *hdev);
> +	void (*dev_fini)(struct hbl_en_device *hdev);
> +	void (*reenable_rx_irq)(struct hbl_en_port *port);
> +	int (*eth_port_open)(struct hbl_en_port *port);
> +	void (*eth_port_close)(struct hbl_en_port *port);
> +	netdev_tx_t (*write_pkt_to_hw)(struct hbl_en_port *port, struct sk_buff *skb);
> +	int (*read_pkt_from_hw)(struct hbl_en_port *port, void **pkt_addr, u32 *pkt_size);
> +	void (*get_pfc_cnts)(struct hbl_en_port *port, void *ptr);
> +	int (*set_coalesce)(struct hbl_en_port *port);
> +	int (*get_rx_ring_size)(struct hbl_en_port *port);
> +	void (*handle_eqe)(struct hbl_aux_dev *aux_dev, u32 port_idx, struct hbl_cn_eqe *eqe);
> +};
> +
> +/**
> + * struct hbl_en_device - habanalabs Ethernet device structure.
> + * @pdev: pointer to PCI device.
> + * @dev: related kernel basic device structure.
> + * @ports: array of all ports manage common structures.
> + * @aux_dev: pointer to auxiliary device.
> + * @asic_specific: ASIC specific device structure.
> + * @fw_ver: FW version.
> + * @qsfp_eeprom: QSFPD EEPROM info.
> + * @mac_addr: array of all MAC addresses.
> + * @asic_funcs: ASIC specific Ethernet functions.
> + * @asic_type: ASIC specific type.
> + * @ports_mask: mask of available ports.
> + * @auto_neg_mask: mask of port with Autonegotiation enabled.
> + * @port_reset_timeout: max time in seconds for a port reset flow to finish.
> + * @pending_reset_long_timeout: long timeout for pending hard reset to finish in seconds.
> + * @max_frm_len: maximum allowed frame length.
> + * @raw_elem_size: size of element in raw buffers.
> + * @max_raw_mtu: maximum MTU size for raw packets.
> + * @min_raw_mtu: minimum MTU size for raw packets.
> + * @pad_size: the pad size in bytes for the skb to transmit.
> + * @core_dev_id: core device ID.
> + * @max_num_of_ports: max number of available ports;
> + * @in_reset: is the entire NIC currently under reset.
> + * @poll_enable: Enable Rx polling rather than IRQ + NAPI.
> + * @in_teardown: true if the NIC is in teardown (during device remove).
> + * @is_initialized: was the device initialized successfully.
> + * @has_eq: true if event queue is supported.
> + * @dma_map_support: HW supports DMA mapping.
> + */
> +struct hbl_en_device {
> +	struct pci_dev *pdev;
> +	struct device *dev;
> +	struct hbl_en_port *ports;
> +	struct hbl_aux_dev *aux_dev;
> +	void *asic_specific;
> +	char *fw_ver;
> +	char *qsfp_eeprom;
> +	char *mac_addr;
> +	struct hbl_en_asic_funcs asic_funcs;
> +	enum hbl_cn_asic_type asic_type;
> +	u64 ports_mask;
> +	u64 auto_neg_mask;
> +	u32 port_reset_timeout;
> +	u32 pending_reset_long_timeout;
> +	u32 max_frm_len;
> +	u32 raw_elem_size;
> +	u16 max_raw_mtu;
> +	u16 min_raw_mtu;
> +	u16 pad_size;
> +	u16 core_dev_id;
> +	u8 max_num_of_ports;
> +	u8 in_reset;
> +	u8 poll_enable;
> +	u8 in_teardown;
> +	u8 is_initialized;
> +	u8 has_eq;
> +	u8 dma_map_support;
> +};
> +
> +int hbl_en_dev_init(struct hbl_en_device *hdev);
> +void hbl_en_dev_fini(struct hbl_en_device *hdev);
> +
> +const struct ethtool_ops *hbl_en_ethtool_get_ops(struct net_device *ndev);
> +void hbl_en_ethtool_init_coalesce(struct hbl_en_port *port);
> +
> +extern const struct dcbnl_rtnl_ops hbl_en_dcbnl_ops;
> +
> +bool hbl_en_rx_poll_start(struct hbl_en_port *port);
> +void hbl_en_rx_poll_stop(struct hbl_en_port *port);
> +void hbl_en_rx_poll_trigger_init(struct hbl_en_port *port);
> +int hbl_en_port_reset(struct hbl_en_port *port);
> +int hbl_en_port_reset_locked(struct hbl_aux_dev *aux_dev, u32 port_idx);
> +int hbl_en_handle_rx(struct hbl_en_port *port, int budget);
> +dma_addr_t hbl_en_dma_map(struct hbl_en_device *hdev, void *addr, int len);
> +void hbl_en_dma_unmap(struct hbl_en_device *hdev, dma_addr_t dma_addr, int len);
> +
> +#endif /* HABANALABS_EN_H_ */
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
> new file mode 100644
> index 000000000000..5d718579a2b6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +
> +#define PFC_PRIO_MASK_ALL	GENMASK(HBL_EN_PFC_PRIO_NUM - 1, 0)
> +#define PFC_PRIO_MASK_NONE	0
> +
> +#ifdef CONFIG_DCB
> +static int hbl_en_dcbnl_ieee_getpfc(struct net_device *netdev, struct ieee_pfc *pfc)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev;
> +	u32 port_idx;
> +
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_dbg_ratelimited(hdev->dev, "port %d is in reset, can't get PFC", port_idx);
> +		return -EBUSY;
> +	}
> +
> +	pfc->pfc_en = port->pfc_enable ? PFC_PRIO_MASK_ALL : PFC_PRIO_MASK_NONE;
> +	pfc->pfc_cap = HBL_EN_PFC_PRIO_NUM;
> +
> +	hdev->asic_funcs.get_pfc_cnts(port, pfc);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return 0;
> +}
> +
> +static int hbl_en_dcbnl_ieee_setpfc(struct net_device *netdev, struct ieee_pfc *pfc)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	u8 curr_pfc_en;
> +	u32 port_idx;
> +	int rc = 0;
> +
> +	hdev = port->hdev;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	port_idx = port->idx;
> +
> +	if (pfc->pfc_en & ~PFC_PRIO_MASK_ALL) {
> +		dev_dbg_ratelimited(hdev->dev, "PFC supports %d priorities only, port %d\n",
> +				    HBL_EN_PFC_PRIO_NUM, port_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (pfc->pfc_en != PFC_PRIO_MASK_NONE && pfc->pfc_en != PFC_PRIO_MASK_ALL) {
> +		dev_dbg_ratelimited(hdev->dev,
> +				    "PFC should be enabled/disabled on all priorities, port %d\n",
> +				    port_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_dbg_ratelimited(hdev->dev, "port %d is in reset, can't set PFC", port_idx);
> +		return -EBUSY;
> +	}
> +
> +	curr_pfc_en = port->pfc_enable ? PFC_PRIO_MASK_ALL : PFC_PRIO_MASK_NONE;
> +
> +	if (pfc->pfc_en == curr_pfc_en)
> +		goto out;
> +
> +	port->pfc_enable = !port->pfc_enable;
> +
> +	rc = aux_ops->set_pfc(aux_dev, port_idx, port->pfc_enable);
> +
> +out:
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static u8 hbl_en_dcbnl_getdcbx(struct net_device *netdev)
> +{
> +	return DCB_CAP_DCBX_HOST | DCB_CAP_DCBX_VER_IEEE;
> +}
> +
> +static u8 hbl_en_dcbnl_setdcbx(struct net_device *netdev, u8 mode)
> +{
> +	return !(mode == (DCB_CAP_DCBX_HOST | DCB_CAP_DCBX_VER_IEEE));
> +}
> +
> +const struct dcbnl_rtnl_ops hbl_en_dcbnl_ops = {
> +	.ieee_getpfc	= hbl_en_dcbnl_ieee_getpfc,
> +	.ieee_setpfc	= hbl_en_dcbnl_ieee_setpfc,
> +	.getdcbx	= hbl_en_dcbnl_getdcbx,
> +	.setdcbx	= hbl_en_dcbnl_setdcbx
> +};
> +#endif
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
> new file mode 100644
> index 000000000000..23a87d36ded5
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#define pr_fmt(fmt)		"habanalabs_en: " fmt
> +
> +#include "hbl_en.h"
> +
> +#include <linux/module.h>
> +#include <linux/auxiliary_bus.h>
> +
> +#define HBL_DRIVER_AUTHOR	"HabanaLabs Kernel Driver Team"
> +
> +#define HBL_DRIVER_DESC		"HabanaLabs AI accelerators Ethernet driver"
> +
> +MODULE_AUTHOR(HBL_DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(HBL_DRIVER_DESC);
> +MODULE_LICENSE("GPL");
> +
> +static bool poll_enable;
> +
> +module_param(poll_enable, bool, 0444);
> +MODULE_PARM_DESC(poll_enable,
> +		 "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
> +
> +static int hdev_init(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_aux_data *aux_data = aux_dev->aux_data;
> +	struct hbl_en_port *ports, *port;
> +	struct hbl_en_device *hdev;
> +	int rc, i;
> +
> +	hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
> +	if (!hdev)
> +		return -ENOMEM;
> +
> +	ports = kcalloc(aux_data->max_num_of_ports, sizeof(*ports), GFP_KERNEL);
> +	if (!ports) {
> +		rc = -ENOMEM;
> +		goto ports_alloc_fail;
> +	}
> +
> +	aux_dev->priv = hdev;
> +	hdev->aux_dev = aux_dev;
> +	hdev->ports = ports;
> +	hdev->pdev = aux_data->pdev;
> +	hdev->dev = aux_data->dev;
> +	hdev->ports_mask = aux_data->ports_mask;
> +	hdev->auto_neg_mask = aux_data->auto_neg_mask;
> +	hdev->max_num_of_ports = aux_data->max_num_of_ports;
> +	hdev->core_dev_id = aux_data->id;
> +	hdev->fw_ver = aux_data->fw_ver;
> +	hdev->qsfp_eeprom = aux_data->qsfp_eeprom;
> +	hdev->asic_type = aux_data->asic_type;
> +	hdev->pending_reset_long_timeout = aux_data->pending_reset_long_timeout;
> +	hdev->max_frm_len = aux_data->max_frm_len;
> +	hdev->raw_elem_size = aux_data->raw_elem_size;
> +	hdev->max_raw_mtu = aux_data->max_raw_mtu;
> +	hdev->min_raw_mtu = aux_data->min_raw_mtu;
> +	hdev->pad_size = ETH_ZLEN;
> +	hdev->has_eq = aux_data->has_eq;
> +	hdev->dma_map_support = true;
> +	hdev->poll_enable = poll_enable;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +		port->hdev = hdev;
> +		port->idx = i;
> +		port->pfc_enable = true;
> +		port->pflags = PFLAGS_PCS_LINK_CHECK | PFLAGS_PHY_AUTO_NEG_LPBK;
> +		port->mac_addr = aux_data->mac_addr[i];
> +		port->auto_neg_enable = !!(aux_data->auto_neg_mask & BIT(i));
> +	}
> +
> +	return 0;
> +
> +ports_alloc_fail:
> +	kfree(hdev);
> +
> +	return rc;
> +}
> +
> +static void hdev_fini(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	kfree(hdev->ports);
> +	kfree(hdev);
> +	aux_dev->priv = NULL;
> +}
> +
> +static const struct auxiliary_device_id hbl_en_id_table[] = {
> +	{ .name = "habanalabs_cn.en", },
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, hbl_en_id_table);
> +
> +static int hbl_en_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
> +{
> +	struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
> +	struct hbl_en_aux_ops *aux_ops = aux_dev->aux_ops;
> +	struct hbl_en_device *hdev;
> +	ktime_t timeout;
> +	int rc;
> +
> +	rc = hdev_init(aux_dev);
> +	if (rc) {
> +		dev_err(&aux_dev->adev.dev, "Failed to init hdev\n");
> +		return -EIO;
> +	}
> +
> +	hdev = aux_dev->priv;
> +
> +	/* don't allow module unloading while it is attached */
> +	if (!try_module_get(THIS_MODULE)) {
> +		dev_err(hdev->dev, "Failed to increment %s module refcount\n", HBL_EN_NAME);
> +		rc = -EIO;
> +		goto module_get_err;
> +	}
> +
> +	timeout = ktime_add_ms(ktime_get(), hdev->pending_reset_long_timeout * MSEC_PER_SEC);
> +	while (1) {
> +		aux_ops->hw_access_lock(aux_dev);
> +
> +		/* if the device is operational, proceed to actual init while holding the lock in
> +		 * order to prevent concurrent hard reset
> +		 */
> +		if (aux_ops->device_operational(aux_dev))
> +			break;
> +
> +		aux_ops->hw_access_unlock(aux_dev);
> +
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			dev_err(hdev->dev, "Timeout while waiting for hard reset to finish\n");
> +			rc = -EBUSY;
> +			goto timeout_err;
> +		}
> +
> +		dev_notice_once(hdev->dev, "Waiting for hard reset to finish before probing en\n");
> +
> +		msleep_interruptible(MSEC_PER_SEC);
> +	}
> +
> +	rc = hbl_en_dev_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to init en device\n");
> +		goto dev_init_err;
> +	}
> +
> +	aux_ops->hw_access_unlock(aux_dev);
> +
> +	return 0;
> +
> +dev_init_err:
> +	aux_ops->hw_access_unlock(aux_dev);
> +timeout_err:
> +	module_put(THIS_MODULE);
> +module_get_err:
> +	hdev_fini(aux_dev);
> +
> +	return rc;
> +}
> +
> +/* This function can be called only from the CN driver when deleting the aux bus, because we
> + * incremented the module refcount on probing. Hence no need to protect here from hard reset.
> + */
> +static void hbl_en_remove(struct auxiliary_device *adev)
> +{
> +	struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	if (!hdev)
> +		return;
> +
> +	hbl_en_dev_fini(hdev);
> +
> +	/* allow module unloading as now it is detached */
> +	module_put(THIS_MODULE);
> +
> +	hdev_fini(aux_dev);
> +}
> +
> +static struct auxiliary_driver hbl_en_driver = {
> +	.name = "eth",
> +	.probe = hbl_en_probe,
> +	.remove = hbl_en_remove,
> +	.id_table = hbl_en_id_table,
> +};
> +
> +static int __init hbl_en_init(void)
> +{
> +	pr_info("loading driver\n");
> +
> +	return auxiliary_driver_register(&hbl_en_driver);
> +}
> +
> +static void __exit hbl_en_exit(void)
> +{
> +	auxiliary_driver_unregister(&hbl_en_driver);
> +
> +	pr_info("driver removed\n");
> +}
> +
> +module_init(hbl_en_init);
> +module_exit(hbl_en_exit);
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> new file mode 100644
> index 000000000000..1d14d283409b
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> @@ -0,0 +1,452 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +#include <linux/ethtool.h>
> +
> +#define RX_COALESCED_FRAMES_MIN		1
> +#define TX_COALESCED_FRAMES_MIN		1
> +#define TX_COALESCED_FRAMES_MAX		10
> +
> +static const char pflags_str[][ETH_GSTRING_LEN] = {
> +	"pcs-link-check",
> +	"phy-auto-neg-lpbk",
> +};
> +
> +#define NIC_STAT(m) {#m, offsetof(struct hbl_en_port, net_stats.m)}
> +
> +static struct hbl_cn_stat netdev_eth_stats[] = {
> +	NIC_STAT(rx_packets),
> +	NIC_STAT(tx_packets),
> +	NIC_STAT(rx_bytes),
> +	NIC_STAT(tx_bytes),
> +	NIC_STAT(tx_errors),
> +	NIC_STAT(rx_dropped),
> +	NIC_STAT(tx_dropped)
> +};
> +
> +static size_t pflags_str_len = ARRAY_SIZE(pflags_str);
> +static size_t netdev_eth_stats_len = ARRAY_SIZE(netdev_eth_stats);
> +
> +static void hbl_en_ethtool_get_drvinfo(struct net_device *ndev, struct ethtool_drvinfo *drvinfo)
> +{
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +
> +	strscpy(drvinfo->driver, HBL_EN_NAME, sizeof(drvinfo->driver));
> +	strscpy(drvinfo->fw_version, hdev->fw_ver, sizeof(drvinfo->fw_version));
> +	strscpy(drvinfo->bus_info, pci_name(hdev->pdev), sizeof(drvinfo->bus_info));
> +}
> +
> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
> +{
> +	modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
> +	modinfo->type = ETH_MODULE_SFF_8636;
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_get_module_eeprom(struct net_device *ndev, struct ethtool_eeprom *ee,
> +					    u8 *data)
> +{
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 first, last, len;
> +	u8 *qsfp_eeprom;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	qsfp_eeprom = hdev->qsfp_eeprom;
> +
> +	if (ee->len == 0)
> +		return -EINVAL;
> +
> +	first = ee->offset;
> +	last = ee->offset + ee->len;
> +
> +	if (first < ETH_MODULE_SFF_8636_LEN) {
> +		len = min_t(unsigned int, last, ETH_MODULE_SFF_8079_LEN);
> +		len -= first;
> +
> +		memcpy(data, qsfp_eeprom + first, len);
> +	}
> +
> +	return 0;
> +}
> +
> +static u32 hbl_en_ethtool_get_priv_flags(struct net_device *ndev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +
> +	return port->pflags;
> +}
> +
> +static int hbl_en_ethtool_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +
> +	port->pflags = priv_flags;
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_get_link_ksettings(struct net_device *ndev,
> +					     struct ethtool_link_ksettings *cmd)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 port_idx, speed;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	speed = aux_ops->get_speed(aux_dev, port_idx);
> +
> +	cmd->base.speed = speed;
> +	cmd->base.duplex = DUPLEX_FULL;
> +
> +	ethtool_link_ksettings_zero_link_mode(cmd, supported);
> +	ethtool_link_ksettings_zero_link_mode(cmd, advertising);
> +
> +	switch (speed) {
> +	case SPEED_100000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseLR4_ER4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseLR4_ER4_Full);
> +
> +		cmd->base.port = PORT_FIBRE;
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, Backplane);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Backplane);
> +		break;
> +	case SPEED_50000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseKR2_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseKR2_Full);
> +		break;
> +	case SPEED_25000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 25000baseCR_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 25000baseCR_Full);
> +		break;
> +	case SPEED_200000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseKR4_Full);
> +		break;
> +	case SPEED_400000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseKR4_Full);
> +		break;
> +	default:
> +		netdev_err(port->ndev, "unknown speed %d\n", speed);
> +		return -EFAULT;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
> +
> +	if (port->auto_neg_enable) {
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
> +		cmd->base.autoneg = AUTONEG_ENABLE;
> +		if (port->auto_neg_resolved)
> +			ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
> +	} else {
> +		cmd->base.autoneg = AUTONEG_DISABLE;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
> +
> +	if (port->pfc_enable)
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);
> +
> +	return 0;
> +}
> +
> +/* only autoneg is mutable */
> +static bool check_immutable_ksettings(const struct ethtool_link_ksettings *old_cmd,
> +				      const struct ethtool_link_ksettings *new_cmd)
> +{
> +	return (old_cmd->base.speed == new_cmd->base.speed) &&
> +	       (old_cmd->base.duplex == new_cmd->base.duplex) &&
> +	       (old_cmd->base.port == new_cmd->base.port) &&
> +	       (old_cmd->base.phy_address == new_cmd->base.phy_address) &&
> +	       (old_cmd->base.eth_tp_mdix_ctrl == new_cmd->base.eth_tp_mdix_ctrl) &&
> +	       bitmap_equal(old_cmd->link_modes.advertising, new_cmd->link_modes.advertising,
> +			    __ETHTOOL_LINK_MODE_MASK_NBITS);
> +}
> +
> +static int
> +hbl_en_ethtool_set_link_ksettings(struct net_device *ndev, const struct ethtool_link_ksettings *cmd)
> +{
> +	struct ethtool_link_ksettings curr_cmd;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	bool auto_neg;
> +	u32 port_idx;
> +	int rc;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +
> +	memset(&curr_cmd, 0, sizeof(struct ethtool_link_ksettings));
> +
> +	rc = hbl_en_ethtool_get_link_ksettings(ndev, &curr_cmd);
> +	if (rc)
> +		return rc;
> +
> +	if (!check_immutable_ksettings(&curr_cmd, cmd))
> +		return -EOPNOTSUPP;
> +
> +	auto_neg = cmd->base.autoneg == AUTONEG_ENABLE;
> +
> +	if (port->auto_neg_enable == auto_neg)
> +		return 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(port->ndev, "port is in reset, can't update settings\n");
> +		return -EBUSY;
> +	}
> +
> +	if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
> +		netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	port->auto_neg_enable = auto_neg;
> +
> +	if (netif_running(port->ndev)) {
> +		rc = hbl_en_port_reset(port);
> +		if (rc)
> +			netdev_err(port->ndev, "Failed to reset port for settings update, rc %d\n",
> +				   rc);
> +	}
> +
> +out:
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_ethtool_get_sset_count(struct net_device *ndev, int sset)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	switch (sset) {
> +	case ETH_SS_STATS:
> +		return netdev_eth_stats_len + aux_ops->get_cnts_num(aux_dev, port_idx);
> +	case ETH_SS_PRIV_FLAGS:
> +		return pflags_str_len;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +static void hbl_en_ethtool_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int i;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	switch (stringset) {
> +	case ETH_SS_STATS:
> +		for (i = 0; i < netdev_eth_stats_len; i++)
> +			ethtool_puts(&data, netdev_eth_stats[i].str);
> +
> +		aux_ops->get_cnts_names(aux_dev, port_idx, data);
> +		break;
> +	case ETH_SS_PRIV_FLAGS:
> +		for (i = 0; i < pflags_str_len; i++)
> +			ethtool_puts(&data, pflags_str[i]);
> +		break;
> +	}
> +}
> +
> +static void hbl_en_ethtool_get_ethtool_stats(struct net_device *ndev,
> +					     __always_unused struct ethtool_stats *stats, u64 *data)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	u32 port_idx;
> +	char *p;
> +	int i;
> +
> +	hdev = port->hdev;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	port_idx = port->idx;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_info_ratelimited(hdev->dev, "port %d is in reset, can't get ethtool stats",
> +				     port_idx);
> +		return;
> +	}
> +
> +	/* Even though the Ethernet Rx/Tx flow might update the stats in parallel, there is not an
> +	 * absolute need for synchronisation. This is because, missing few counts of these stats is
> +	 * much better than adding a lock to synchronize and increase the overhead of the Rx/Tx
> +	 * flows. In worst case scenario, reader will get stale stats. He will receive updated
> +	 * stats in next read.
> +	 */
> +	for (i = 0; i < netdev_eth_stats_len; i++) {
> +		p = (char *)port + netdev_eth_stats[i].lo_offset;
> +		data[i] = *(u32 *)p;
> +	}
> +
> +	data += i;
> +
> +	aux_ops->get_cnts_values(aux_dev, port_idx, data);
> +
> +	atomic_set(&port->in_reset, 0);
> +}
> +
> +static int hbl_en_ethtool_get_coalesce(struct net_device *ndev,
> +				       struct ethtool_coalesce *coal,
> +				       struct kernel_ethtool_coalesce *kernel_coal,
> +				       struct netlink_ext_ack *extack)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +
> +	coal->tx_max_coalesced_frames = port->tx_max_coalesced_frames;
> +	coal->rx_coalesce_usecs = port->rx_coalesce_usecs;
> +	coal->rx_max_coalesced_frames = port->rx_max_coalesced_frames;
> +
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_set_coalesce(struct net_device *ndev,
> +				       struct ethtool_coalesce *coal,
> +				       struct kernel_ethtool_coalesce *kernel_coal,
> +				       struct netlink_ext_ack *extack)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc, rx_ring_size;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(port->ndev, "port is in reset, can't update settings\n");
> +		return -EBUSY;
> +	}
> +
> +	if (coal->tx_max_coalesced_frames < TX_COALESCED_FRAMES_MIN ||
> +	    coal->tx_max_coalesced_frames > TX_COALESCED_FRAMES_MAX) {
> +		netdev_err(ndev, "tx max_coalesced_frames should be between %d and %d\n",
> +			   TX_COALESCED_FRAMES_MIN, TX_COALESCED_FRAMES_MAX);
> +		rc = -EINVAL;
> +		goto atomic_out;
> +	}
> +
> +	rx_ring_size = hdev->asic_funcs.get_rx_ring_size(port);
> +	if (coal->rx_max_coalesced_frames < RX_COALESCED_FRAMES_MIN ||
> +	    coal->rx_max_coalesced_frames >= rx_ring_size) {
> +		netdev_err(ndev, "rx max_coalesced_frames should be between %d and %d\n",
> +			   RX_COALESCED_FRAMES_MIN, rx_ring_size);
> +		rc = -EINVAL;
> +		goto atomic_out;
> +	}
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +
> +	port->tx_max_coalesced_frames = coal->tx_max_coalesced_frames;
> +	port->rx_coalesce_usecs = coal->rx_coalesce_usecs;
> +	port->rx_max_coalesced_frames = coal->rx_max_coalesced_frames;
> +
> +	rc = hdev->asic_funcs.set_coalesce(port);
> +
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +atomic_out:
> +	atomic_set(&port->in_reset, 0);
> +	return rc;
> +}
> +
> +void hbl_en_ethtool_init_coalesce(struct hbl_en_port *port)
> +{
> +	port->rx_coalesce_usecs = CQ_ARM_TIMEOUT_USEC;
> +	port->rx_max_coalesced_frames = 1;
> +	port->tx_max_coalesced_frames = 1;
> +}
> +
> +static const struct ethtool_ops hbl_en_ethtool_ops_coalesce = {
> +	.supported_coalesce_params = ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_RX_MAX_FRAMES |
> +				     ETHTOOL_COALESCE_TX_MAX_FRAMES,
> +	.get_drvinfo = hbl_en_ethtool_get_drvinfo,
> +	.get_link = ethtool_op_get_link,
> +	.get_module_info = hbl_en_ethtool_get_module_info,
> +	.get_module_eeprom = hbl_en_ethtool_get_module_eeprom,
> +	.get_priv_flags = hbl_en_ethtool_get_priv_flags,
> +	.set_priv_flags = hbl_en_ethtool_set_priv_flags,
> +	.get_link_ksettings = hbl_en_ethtool_get_link_ksettings,
> +	.set_link_ksettings = hbl_en_ethtool_set_link_ksettings,
> +	.get_sset_count = hbl_en_ethtool_get_sset_count,
> +	.get_strings = hbl_en_ethtool_get_strings,
> +	.get_ethtool_stats = hbl_en_ethtool_get_ethtool_stats,
> +	.get_coalesce = hbl_en_ethtool_get_coalesce,
> +	.set_coalesce = hbl_en_ethtool_set_coalesce,
> +};
> +
> +const struct ethtool_ops *hbl_en_ethtool_get_ops(struct net_device *ndev)
> +{
> +	return &hbl_en_ethtool_ops_coalesce;
> +}
> -- 
> 2.34.1
> 
>

Stephen Hemminger June 15, 2024, 12:10 a.m. UTC | #3

On Thu, 13 Jun 2024 11:22:02 +0300
Omer Shpigelman <oshpigelman@habana.ai> wrote:

> +static int hbl_en_ports_reopen(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int rc = 0, i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be that the port was shutdown by 'ip link set down' and there is no need
> +		 * in reopening it.
> +		 * Since we mark the ports as in reset even if they are disabled, we clear the flag
> +		 * here anyway.
> +		 * See hbl_en_ports_stop_prepare() for more info.
> +		 */
> +		if (!netif_running(port->ndev)) {
> +			atomic_set(&port->in_reset, 0);
> +			continue;
> +		}
> +

Rather than duplicating network device state in your own flags, it would be better to use
existing infrastructure. Read Documentation/networking/operstates.rst

Then you could also get rid of the kludge timer stuff in hbl_en_close().

Stephen Hemminger June 15, 2024, 12:16 a.m. UTC | #4

> +
> +/* get the src IP as it is done in devinet_ioctl() */
> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	struct in_device *in_dev;
> +	struct in_ifaddr *ifa;
> +	int rc = 0;
> +
> +	/* for the case where no src IP is configured */
> +	*src_ip = 0;
> +
> +	/* rtnl lock should be acquired in relevant flows before taking configuration lock */
> +	if (!rtnl_is_locked()) {
> +		netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	in_dev = __in_dev_get_rtnl(ndev);
> +	if (!in_dev) {
> +		netdev_err(port->ndev, "Failed to get IPv4 struct\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	ifa = rtnl_dereference(in_dev->ifa_list);
> +
> +	while (ifa) {
> +		if (!strcmp(ndev->name, ifa->ifa_label)) {
> +			/* convert the BE to native and later on it will be
> +			 * written to the HW as LE in QPC_SET
> +			 */
> +			*src_ip = be32_to_cpu(ifa->ifa_local);
> +			break;
> +		}
> +		ifa = rtnl_dereference(ifa->ifa_next);
> +	}
> +out:
> +	return rc;
> +}

Does this device require IPv4? What about users and infrastructures that use IPv6 only?
IPv4 is legacy at this point.

Zhu Yanjun June 15, 2024, 10:55 a.m. UTC | #5

在 2024/6/13 16:22, Omer Shpigelman 写道:
> This ethernet driver is initialized via auxiliary bus by the hbl_cn
> driver.
> It serves mainly for control operations that are needed for AI scaling.
> 
> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> Co-developed-by: Abhilash K V <kvabhilash@habana.ai>
> Signed-off-by: Abhilash K V <kvabhilash@habana.ai>
> Co-developed-by: Andrey Agranovich <aagranovich@habana.ai>
> Signed-off-by: Andrey Agranovich <aagranovich@habana.ai>
> Co-developed-by: Bharat Jauhari <bjauhari@habana.ai>
> Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
> Co-developed-by: David Meriin <dmeriin@habana.ai>
> Signed-off-by: David Meriin <dmeriin@habana.ai>
> Co-developed-by: Sagiv Ozeri <sozeri@habana.ai>
> Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
> Co-developed-by: Zvika Yehudai <zyehudai@habana.ai>
> Signed-off-by: Zvika Yehudai <zyehudai@habana.ai>
> ---
>   MAINTAINERS                                   |    9 +
>   drivers/net/ethernet/intel/Kconfig            |   18 +
>   drivers/net/ethernet/intel/Makefile           |    1 +
>   drivers/net/ethernet/intel/hbl_en/Makefile    |    9 +
>   .../net/ethernet/intel/hbl_en/common/Makefile |    3 +
>   .../net/ethernet/intel/hbl_en/common/hbl_en.c | 1168 +++++++++++++++++
>   .../net/ethernet/intel/hbl_en/common/hbl_en.h |  206 +++
>   .../intel/hbl_en/common/hbl_en_dcbnl.c        |  101 ++
>   .../ethernet/intel/hbl_en/common/hbl_en_drv.c |  211 +++
>   .../intel/hbl_en/common/hbl_en_ethtool.c      |  452 +++++++
>   10 files changed, 2178 insertions(+)
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/Makefile
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/Makefile
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 096439a62129..7301f38e9cfb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9617,6 +9617,15 @@ F:	include/linux/habanalabs/
>   F:	include/linux/net/intel/cn*
>   F:	include/linux/net/intel/gaudi2*
>   
> +HABANALABS ETHERNET DRIVER
> +M:	Omer Shpigelman <oshpigelman@habana.ai>
> +L:	netdev@vger.kernel.org
> +S:	Supported
> +W:	https://www.habana.ai
> +F:	Documentation/networking/device_drivers/ethernet/intel/hbl.rst
> +F:	drivers/net/ethernet/intel/hbl_en/
> +F:	include/linux/net/intel/cn*
> +
>   HACKRF MEDIA DRIVER
>   L:	linux-media@vger.kernel.org
>   S:	Orphan
> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> index 0d1b8a2bae99..5d07349348a0 100644
> --- a/drivers/net/ethernet/intel/Kconfig
> +++ b/drivers/net/ethernet/intel/Kconfig
> @@ -417,4 +417,22 @@ config HABANA_CN
>   	  To compile this driver as a module, choose M here. The module
>   	  will be called habanalabs_cn.
>   
> +config HABANA_EN
> +	tristate "HabanaLabs (an Intel Company) Ethernet driver"
> +	depends on NETDEVICES && ETHERNET && INET
> +	select HABANA_CN
> +	help
> +	  This driver enables Ethernet functionality for the network interfaces
> +	  that are part of the GAUDI ASIC family of AI Accelerators.
> +	  For more information on how to identify your adapter, go to the
> +	  Adapter & Driver ID Guide that can be located at:
> +
> +	  <http://support.intel.com>
> +
> +	  More specific information on configuring the driver is in
> +	  <file:Documentation/networking/device_drivers/ethernet/intel/hbl.rst>.
> +
> +	  To compile this driver as a module, choose M here. The module
> +	  will be called habanalabs_en.
> +
>   endif # NET_VENDOR_INTEL
> diff --git a/drivers/net/ethernet/intel/Makefile b/drivers/net/ethernet/intel/Makefile
> index 10049a28e336..ec62a0227897 100644
> --- a/drivers/net/ethernet/intel/Makefile
> +++ b/drivers/net/ethernet/intel/Makefile
> @@ -20,3 +20,4 @@ obj-$(CONFIG_FM10K) += fm10k/
>   obj-$(CONFIG_ICE) += ice/
>   obj-$(CONFIG_IDPF) += idpf/
>   obj-$(CONFIG_HABANA_CN) += hbl_cn/
> +obj-$(CONFIG_HABANA_EN) += hbl_en/
> diff --git a/drivers/net/ethernet/intel/hbl_en/Makefile b/drivers/net/ethernet/intel/hbl_en/Makefile
> new file mode 100644
> index 000000000000..695497ab93b6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/Makefile
> @@ -0,0 +1,9 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Makefile for HabanaLabs (an Intel Company) Ethernet network driver
> +#
> +
> +obj-$(CONFIG_HABANA_EN) := habanalabs_en.o
> +
> +include $(src)/common/Makefile
> +habanalabs_en-y += $(HBL_EN_COMMON_FILES)
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/Makefile b/drivers/net/ethernet/intel/hbl_en/common/Makefile
> new file mode 100644
> index 000000000000..a3ccb5dbf4a6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +HBL_EN_COMMON_FILES := common/hbl_en_drv.o common/hbl_en.o \
> +	common/hbl_en_ethtool.o common/hbl_en_dcbnl.o
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
> new file mode 100644
> index 000000000000..066be5ac2d84
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
> @@ -0,0 +1,1168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +#include <linux/inetdevice.h>
> +
> +#define TX_TIMEOUT			(5 * HZ)
> +#define PORT_RESET_TIMEOUT_MSEC		(60 * 1000ull) /* 60s */
> +
> +/**
> + * struct hbl_en_tx_pkt_work - used to schedule a work of a Tx packet.
> + * @tx_work: workqueue object to run when packet needs to be sent.
> + * @port: pointer to current port structure.
> + * @skb: copy of the packet to send.
> + */
> +struct hbl_en_tx_pkt_work {
> +	struct work_struct tx_work;
> +	struct hbl_en_port *port;
> +	struct sk_buff *skb;
> +};
> +
> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget);
> +static int hbl_en_port_open(struct hbl_en_port *port);
> +
> +static int hbl_en_ports_reopen(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int rc = 0, i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be that the port was shutdown by 'ip link set down' and there is no need
> +		 * in reopening it.
> +		 * Since we mark the ports as in reset even if they are disabled, we clear the flag
> +		 * here anyway.
> +		 * See hbl_en_ports_stop_prepare() for more info.
> +		 */
> +		if (!netif_running(port->ndev)) {
> +			atomic_set(&port->in_reset, 0);
> +			continue;
> +		}
> +
> +		rc = hbl_en_port_open(port);
> +
> +		atomic_set(&port->in_reset, 0);
> +
> +		if (rc)
> +			break;
> +	}
> +
> +	hdev->in_reset = false;
> +
> +	return rc;
> +}
> +
> +static void hbl_en_port_fini(struct hbl_en_port *port)
> +{
> +	if (port->rx_wq)
> +		destroy_workqueue(port->rx_wq);
> +}
> +
> +static int hbl_en_port_init(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u32 port_idx = port->idx;
> +	char wq_name[32];
> +	int rc;
> +
> +	if (hdev->poll_enable) {
> +		memset(wq_name, 0, sizeof(wq_name));
> +		snprintf(wq_name, sizeof(wq_name) - 1, "hbl%u-port%d-rx-wq", hdev->core_dev_id,
> +			 port_idx);
> +		port->rx_wq = alloc_ordered_workqueue(wq_name, 0);
> +		if (!port->rx_wq) {
> +			dev_err(hdev->dev, "Failed to allocate Rx WQ\n");
> +			rc = -ENOMEM;
> +			goto fail;
> +		}
> +	}
> +
> +	hbl_en_ethtool_init_coalesce(port);
> +
> +	return 0;
> +
> +fail:
> +	hbl_en_port_fini(port);
> +
> +	return rc;
> +}
> +
> +static void _hbl_en_set_port_status(struct hbl_en_port *port, bool up)
> +{
> +	struct net_device *ndev = port->ndev;
> +	u32 port_idx = port->idx;
> +
> +	if (up) {
> +		netif_carrier_on(ndev);
> +		netif_wake_queue(ndev);
> +	} else {
> +		netif_carrier_off(ndev);
> +		netif_stop_queue(ndev);
> +	}
> +
> +	/* Unless link events are getting through the EQ, no need to print about link down events
> +	 * during port reset
> +	 */
> +	if (port->hdev->has_eq || up || !atomic_read(&port->in_reset))
> +		netdev_info(port->ndev, "link %s, port %d\n", up ? "up" : "down", port_idx);
> +}
> +
> +static void hbl_en_set_port_status(struct hbl_aux_dev *aux_dev, u32 port_idx, bool up)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	_hbl_en_set_port_status(port, up);
> +}
> +
> +static bool hbl_en_is_port_open(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return port->is_initialized;
> +}
> +
> +/* get the src IP as it is done in devinet_ioctl() */
> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	struct in_device *in_dev;
> +	struct in_ifaddr *ifa;
> +	int rc = 0;
> +
> +	/* for the case where no src IP is configured */
> +	*src_ip = 0;
> +
> +	/* rtnl lock should be acquired in relevant flows before taking configuration lock */
> +	if (!rtnl_is_locked()) {
> +		netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	in_dev = __in_dev_get_rtnl(ndev);
> +	if (!in_dev) {
> +		netdev_err(port->ndev, "Failed to get IPv4 struct\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	ifa = rtnl_dereference(in_dev->ifa_list);
> +
> +	while (ifa) {
> +		if (!strcmp(ndev->name, ifa->ifa_label)) {
> +			/* convert the BE to native and later on it will be
> +			 * written to the HW as LE in QPC_SET
> +			 */
> +			*src_ip = be32_to_cpu(ifa->ifa_local);
> +			break;
> +		}
> +		ifa = rtnl_dereference(ifa->ifa_next);
> +	}
> +out:
> +	return rc;
> +}
> +
> +static void hbl_en_reset_stats(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	port->net_stats.rx_packets = 0;
> +	port->net_stats.tx_packets = 0;
> +	port->net_stats.rx_bytes = 0;
> +	port->net_stats.tx_bytes = 0;
> +	port->net_stats.tx_errors = 0;
> +	atomic64_set(&port->net_stats.rx_dropped, 0);
> +	atomic64_set(&port->net_stats.tx_dropped, 0);
> +}
> +
> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	u32 mtu;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(ndev, "port is in reset, can't get MTU\n");
> +		return 0;
> +	}
> +
> +	mtu = ndev->mtu;
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return mtu;
> +}
> +
> +static u32 hbl_en_get_pflags(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return port->pflags;
> +}
> +
> +static void hbl_en_set_dev_lpbk(struct hbl_aux_dev *aux_dev, u32 port_idx, bool enable)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +
> +	if (enable)
> +		ndev->features |= NETIF_F_LOOPBACK;
> +	else
> +		ndev->features &= ~NETIF_F_LOOPBACK;
> +}
> +
> +/* This function should be called after ctrl_lock was taken */

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/dev-tools/sparse.rst?h=v6.10-rc3#n64

"
__must_hold - The specified lock is held on function entry and exit.
"
Add "__must_hold" to confirm "The specified lock is held on function 
entry and exit." ?

Zhu Yanjun
> +static int hbl_en_port_open_locked(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct net_device *ndev = port->ndev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (port->is_initialized)
> +		return 0;
> +
> +	if (!hdev->poll_enable)
> +		netif_napi_add(ndev, &port->napi, hbl_en_napi_poll);
> +
> +	rc = aux_ops->port_hw_init(aux_dev, port_idx);
> +	if (rc) {
> +		netdev_err(ndev, "Failed to configure the HW, rc %d\n", rc);
> +		goto hw_init_fail;
> +	}
> +
> +	if (!hdev->poll_enable)
> +		napi_enable(&port->napi);
> +
> +	rc = hdev->asic_funcs.eth_port_open(port);
> +	if (rc) {
> +		netdev_err(ndev, "Failed to init H/W, rc %d\n", rc);
> +		goto port_open_fail;
> +	}
> +
> +	rc = aux_ops->update_mtu(aux_dev, port_idx, ndev->mtu);
> +	if (rc) {
> +		netdev_err(ndev, "MTU update failed, rc %d\n", rc);
> +		goto update_mtu_fail;
> +	}
> +
> +	rc = aux_ops->phy_init(aux_dev, port_idx);
> +	if (rc) {
> +		netdev_err(ndev, "PHY init failed, rc %d\n", rc);
> +		goto phy_init_fail;
> +	}
> +
> +	netif_start_queue(ndev);
> +
> +	port->is_initialized = true;
> +
> +	return 0;
> +
> +phy_init_fail:
> +	/* no need to revert the MTU change, it will be updated on next port open */
> +update_mtu_fail:
> +	hdev->asic_funcs.eth_port_close(port);
> +port_open_fail:
> +	if (!hdev->poll_enable)
> +		napi_disable(&port->napi);
> +
> +	aux_ops->port_hw_fini(aux_dev, port_idx);
> +hw_init_fail:
> +	if (!hdev->poll_enable)
> +		netif_napi_del(&port->napi);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_port_open(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +	rc = hbl_en_port_open_locked(port);
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_open(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't open it\n");
> +		return -EBUSY;
> +	}
> +
> +	rc = hbl_en_port_open(port);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static void hbl_en_port_close_locked(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (!port->is_initialized)
> +		return;
> +
> +	port->is_initialized = false;
> +
> +	/* verify that the port is marked as closed before continuing */
> +	mb();
> +
> +	/* Print if not in hard reset flow e.g. from ip cmd */
> +	if (!hdev->in_reset && netif_carrier_ok(port->ndev))
> +		netdev_info(port->ndev, "port was closed\n");
> +
> +	/* disable the PHY here so no link changes will occur from this point forward */
> +	aux_ops->phy_fini(aux_dev, port_idx);
> +
> +	/* disable Tx SW flow */
> +	netif_carrier_off(port->ndev);
> +	netif_tx_disable(port->ndev);
> +
> +	/* stop Tx/Rx HW */
> +	aux_ops->port_hw_fini(aux_dev, port_idx);
> +
> +	/* disable Tx/Rx QPs */
> +	hdev->asic_funcs.eth_port_close(port);
> +
> +	/* stop Rx SW flow */
> +	if (hdev->poll_enable) {
> +		hbl_en_rx_poll_stop(port);
> +	} else {
> +		napi_disable(&port->napi);
> +		netif_napi_del(&port->napi);
> +	}
> +
> +	/* Explicitly count the port close operations as we don't get a link event for this.
> +	 * Upon port open we receive a link event, hence no additional action required.
> +	 */
> +	aux_ops->port_toggle_count(aux_dev, port_idx);
> +}
> +
> +static void hbl_en_port_close(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +	hbl_en_port_close_locked(port);
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static int __hbl_en_port_reset_locked(struct hbl_en_port *port)
> +{
> +	hbl_en_port_close_locked(port);
> +
> +	return hbl_en_port_open_locked(port);
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +int hbl_en_port_reset_locked(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return __hbl_en_port_reset_locked(port);
> +}
> +
> +int hbl_en_port_reset(struct hbl_en_port *port)
> +{
> +	hbl_en_port_close(port);
> +
> +	/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +	msleep(20);
> +
> +	return hbl_en_port_open(port);
> +}
> +
> +static int hbl_en_close(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	ktime_t timeout;
> +
> +	/* Looks like the return value of this function is not checked, so we can't just return
> +	 * EBUSY if the port is under reset. We need to wait until the reset is finished and then
> +	 * close the port. Otherwise the netdev will set the port as closed although port_close()
> +	 * wasn't called. Only if we waited long enough and the reset hasn't finished, we can return
> +	 * an error without actually closing the port as it is a fatal flow anyway.
> +	 */
> +	timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +	while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		/* If this is called from unregister_netdev() then the port was already closed and
> +		 * hence we can safely return.
> +		 * We could have just check the port_open boolean, but that might hide some future
> +		 * bugs. Hence it is better to use a dedicated flag for that.
> +		 */
> +		if (READ_ONCE(hdev->in_teardown))
> +			return 0;
> +
> +		usleep_range(50, 200);
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			netdev_crit(netdev,
> +				    "Timeout while waiting for port to finish reset, can't close it\n"
> +				    );
> +			return -EBUSY;
> +		}
> +	}
> +
> +	hbl_en_port_close(port);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return 0;
> +}
> +
> +/**
> + * hbl_en_ports_stop_prepare() - stop the Rx and Tx and synchronize with other reset flows.
> + * @aux_dev: habanalabs auxiliary device structure.
> + *
> + * This function makes sure that during the reset no packets will be processed and that
> + * ndo_open/ndo_close do not open/close the ports.
> + * A hard reset might occur right after the driver was loaded, which means before the ports
> + * initialization was finished. Therefore, even if the ports are not yet open, we mark it as in
> + * reset in order to avoid races. We clear the in reset flag later on when reopening the ports.
> + */
> +static void hbl_en_ports_stop_prepare(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	ktime_t timeout;
> +	int i;
> +
> +	/* Check if the ports where initialized. If not, we shouldn't mark them as in reset because
> +	 * they will fail to get opened.
> +	 */
> +	if (!hdev->is_initialized || hdev->in_reset)
> +		return;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* This function is competing with reset from ethtool/ip, so try to take the
> +		 * in_reset atomic and if we are already in a middle of reset, wait until reset
> +		 * function is finished.
> +		 * Reset function is designed to always finish (could take up to a few seconds in
> +		 * worst case).
> +		 * We mark also closed ports as in reset so they won't be able to get opened while
> +		 * the device in under reset.
> +		 */
> +
> +		timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +		while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +			usleep_range(50, 200);
> +			if (ktime_compare(ktime_get(), timeout) > 0) {
> +				netdev_crit(port->ndev,
> +					    "Timeout while waiting for port %d to finish reset\n",
> +					    port->idx);
> +				break;
> +			}
> +		}
> +	}
> +
> +	hdev->in_reset = true;
> +}
> +
> +static void hbl_en_ports_stop(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		if (netif_running(port->ndev))
> +			hbl_en_port_close(port);
> +	}
> +}
> +
> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't change MTU\n");
> +		return -EBUSY;
> +	}
> +
> +	if (netif_running(port->ndev)) {
> +		hbl_en_port_close(port);
> +
> +		/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +		msleep(20);
> +
> +		netdev->mtu = new_mtu;
> +
> +		rc = hbl_en_port_open(port);
> +		if (rc)
> +			netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
> +	} else {
> +		netdev->mtu = new_mtu;
> +	}
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +/* Swap source and destination MAC addresses */
> +static inline void swap_l2(char *buf)
> +{
> +	u16 *eth_hdr, tmp;
> +
> +	eth_hdr = (u16 *)buf;
> +	tmp = eth_hdr[0];
> +	eth_hdr[0] = eth_hdr[3];
> +	eth_hdr[3] = tmp;
> +	tmp = eth_hdr[1];
> +	eth_hdr[1] = eth_hdr[4];
> +	eth_hdr[4] = tmp;
> +	tmp = eth_hdr[2];
> +	eth_hdr[2] = eth_hdr[5];
> +	eth_hdr[5] = tmp;
> +}
> +
> +/* Swap source and destination IP addresses
> + */
> +static inline void swap_l3(char *buf)
> +{
> +	u32 tmp;
> +
> +	/* skip the Ethernet header and the IP header till source IP address */
> +	buf += ETH_HLEN + 12;
> +	tmp = ((u32 *)buf)[0];
> +	((u32 *)buf)[0] = ((u32 *)buf)[1];
> +	((u32 *)buf)[1] = tmp;
> +}
> +
> +static void do_tx_swap(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u16 *tmp_buff = (u16 *)skb->data;
> +	u32 port_idx = port->idx;
> +
> +	/* First, let's print the SKB we got */
> +	dev_dbg_ratelimited(hdev->dev,
> +			    "Send [P%d]: dst-mac:%04x%04x%04x, src-mac:%04x%04x%04x, eth-type:%04x, len:%u\n",
> +			    port_idx, swab16(tmp_buff[0]), swab16(tmp_buff[1]), swab16(tmp_buff[2]),
> +			    swab16(tmp_buff[3]), swab16(tmp_buff[4]), swab16(tmp_buff[5]),
> +			    swab16(tmp_buff[6]), skb->len);
> +
> +	/* Before submit it to HW, in case this is ipv4 pkt, swap eth/ip addresses.
> +	 * that way, we may send ECMP (ping) to ourselves in LB cases.
> +	 */
> +	swap_l2(skb->data);
> +	if (swab16(tmp_buff[6]) == ETH_P_IP)
> +		swap_l3(skb->data);
> +}
> +
> +static bool is_pkt_swap_enabled(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	return aux_ops->is_eth_lpbk(aux_dev);
> +}
> +
> +static bool is_tx_disabled(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	return aux_ops->get_mac_lpbk(aux_dev, port_idx) && !is_pkt_swap_enabled(hdev);
> +}
> +
> +static netdev_tx_t hbl_en_handle_tx(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	netdev_tx_t ret;
> +
> +	if (skb->len <= 0 || is_tx_disabled(port))
> +		goto free_skb;
> +
> +	if (skb->len > hdev->max_frm_len) {
> +		netdev_err(port->ndev, "Tx pkt size %uB exceeds maximum of %uB\n", skb->len,
> +			   hdev->max_frm_len);
> +		goto free_skb;
> +	}
> +
> +	if (is_pkt_swap_enabled(hdev))
> +		do_tx_swap(port, skb);
> +
> +	/* Pad the ethernet packets to the minimum frame size as the NIC hw doesn't do it.
> +	 * eth_skb_pad() frees the packet on failure, so just increment the dropped counter and
> +	 * return as success to avoid a retry.
> +	 */
> +	if (skb_put_padto(skb, hdev->pad_size)) {
> +		dev_err_ratelimited(hdev->dev, "Padding failed, the skb is dropped\n");
> +		atomic64_inc(&port->net_stats.tx_dropped);
> +		return NETDEV_TX_OK;
> +	}
> +
> +	ret = hdev->asic_funcs.write_pkt_to_hw(port, skb);
> +	if (ret == NETDEV_TX_OK) {
> +		port->net_stats.tx_packets++;
> +		port->net_stats.tx_bytes += skb->len;
> +	}
> +
> +	return ret;
> +
> +free_skb:
> +	dev_kfree_skb_any(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static netdev_tx_t hbl_en_start_xmit(struct sk_buff *skb, struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev;
> +
> +	hdev = port->hdev;
> +
> +	return hbl_en_handle_tx(port, skb);
> +}
> +
> +static int hbl_en_set_port_mac_loopback(struct hbl_en_port *port, bool enable)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct net_device *ndev = port->ndev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	rc = aux_ops->set_mac_lpbk(aux_dev, port_idx, enable);
> +	if (rc)
> +		return rc;
> +
> +	netdev_info(ndev, "port %u: mac loopback is %s\n", port_idx,
> +		    enable ? "enabled" : "disabled");
> +
> +	if (netif_running(ndev)) {
> +		rc = hbl_en_port_reset(port);
> +		if (rc) {
> +			netdev_err(ndev, "Failed to reset port %u, rc %d\n", port_idx, rc);
> +			return rc;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int hbl_en_set_features(struct net_device *netdev, netdev_features_t features)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	netdev_features_t changed;
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port %d is in reset, can't update settings", port->idx);
> +		return -EBUSY;
> +	}
> +
> +	changed = netdev->features ^ features;
> +
> +	if (changed & NETIF_F_LOOPBACK)
> +		rc = hbl_en_set_port_mac_loopback(port, !!(features & NETIF_F_LOOPBACK));
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static void hbl_en_handle_tx_timeout(struct net_device *netdev, unsigned int txqueue)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +
> +	port->net_stats.tx_errors++;
> +	atomic64_inc(&port->net_stats.tx_dropped);
> +}
> +
> +static void hbl_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(dev);
> +
> +	stats->rx_bytes = port->net_stats.rx_bytes;
> +	stats->tx_bytes = port->net_stats.tx_bytes;
> +	stats->rx_packets = port->net_stats.rx_packets;
> +	stats->tx_packets = port->net_stats.tx_packets;
> +	stats->tx_errors = port->net_stats.tx_errors;
> +	stats->tx_dropped = (u64)atomic64_read(&port->net_stats.tx_dropped);
> +	stats->rx_dropped = (u64)atomic64_read(&port->net_stats.rx_dropped);
> +}
> +
> +static const struct net_device_ops hbl_en_netdev_ops = {
> +	.ndo_open = hbl_en_open,
> +	.ndo_stop = hbl_en_close,
> +	.ndo_start_xmit = hbl_en_start_xmit,
> +	.ndo_validate_addr = eth_validate_addr,
> +	.ndo_change_mtu = hbl_en_change_mtu,
> +	.ndo_set_features = hbl_en_set_features,
> +	.ndo_get_stats64 = hbl_en_get_stats64,
> +	.ndo_tx_timeout = hbl_en_handle_tx_timeout,
> +};
> +
> +static void hbl_en_set_ops(struct net_device *ndev)
> +{
> +	ndev->netdev_ops = &hbl_en_netdev_ops;
> +	ndev->ethtool_ops = hbl_en_ethtool_get_ops(ndev);
> +#ifdef CONFIG_DCB
> +	ndev->dcbnl_ops = &hbl_en_dcbnl_ops;
> +#endif
> +}
> +
> +static int hbl_en_port_register(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	struct hbl_en_port **ptr;
> +	struct net_device *ndev;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	ndev = alloc_etherdev(sizeof(struct hbl_en_port *));
> +	if (!ndev) {
> +		dev_err(hdev->dev, "netdevice %d alloc failed\n", port_idx);
> +		return -ENOMEM;
> +	}
> +
> +	port->ndev = ndev;
> +	SET_NETDEV_DEV(ndev, &hdev->pdev->dev);
> +	ptr = netdev_priv(ndev);
> +	*ptr = port;
> +
> +	/* necessary for creating multiple interfaces */
> +	ndev->dev_port = port_idx;
> +
> +	hbl_en_set_ops(ndev);
> +
> +	ndev->watchdog_timeo = TX_TIMEOUT;
> +	ndev->min_mtu = hdev->min_raw_mtu;
> +	ndev->max_mtu = hdev->max_raw_mtu;
> +
> +	/* Add loopback capability to the device. */
> +	ndev->hw_features |= NETIF_F_LOOPBACK;
> +
> +	/* If this port was set to loopback, set it also to the ndev features */
> +	if (aux_ops->get_mac_lpbk(aux_dev, port_idx))
> +		ndev->features |= NETIF_F_LOOPBACK;
> +
> +	eth_hw_addr_set(ndev, port->mac_addr);
> +
> +	/* It's more an intelligent poll wherein, we enable the Rx completion EQE event and then
> +	 * start the poll from there.
> +	 * Inside the polling thread, we read packets from hardware and then reschedule the poll
> +	 * only if there are more packets to be processed. Else we re-enable the CQ Arm interrupt
> +	 * and exit the poll.
> +	 */
> +	if (hdev->poll_enable)
> +		hbl_en_rx_poll_trigger_init(port);
> +
> +	netif_carrier_off(ndev);
> +
> +	rc = register_netdev(ndev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Could not register netdevice %d\n", port_idx);
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	if (ndev) {
> +		free_netdev(ndev);
> +		port->ndev = NULL;
> +	}
> +
> +	return rc;
> +}
> +
> +static void dump_swap_pkt(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u16 *tmp_buff = (u16 *)skb->data;
> +	u32 port_idx = port->idx;
> +
> +	/* The SKB is ready now (before stripping-out the L2), print its content */
> +	dev_dbg_ratelimited(hdev->dev,
> +			    "Recv [P%d]: dst-mac:%04x%04x%04x, src-mac:%04x%04x%04x, eth-type:%04x, len:%u\n",
> +			    port_idx, swab16(tmp_buff[0]), swab16(tmp_buff[1]), swab16(tmp_buff[2]),
> +			    swab16(tmp_buff[3]), swab16(tmp_buff[4]), swab16(tmp_buff[5]),
> +			    swab16(tmp_buff[6]), skb->len);
> +}
> +
> +int hbl_en_handle_rx(struct hbl_en_port *port, int budget)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	enum hbl_en_eth_pkt_status pkt_status;
> +	struct net_device *ndev = port->ndev;
> +	int rc, pkt_count = 0;
> +	struct sk_buff *skb;
> +	void *pkt_addr;
> +	u32 pkt_size;
> +
> +	if (!netif_carrier_ok(ndev))
> +		return 0;
> +
> +	while (pkt_count < budget) {
> +		pkt_status = hdev->asic_funcs.read_pkt_from_hw(port, &pkt_addr, &pkt_size);
> +
> +		if (pkt_status == ETH_PKT_NONE)
> +			break;
> +
> +		pkt_count++;
> +
> +		if (pkt_status == ETH_PKT_DROP) {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +			continue;
> +		}
> +
> +		if (hdev->poll_enable)
> +			skb = __netdev_alloc_skb_ip_align(ndev, pkt_size, GFP_KERNEL);
> +		else
> +			skb = napi_alloc_skb(&port->napi, pkt_size);
> +
> +		if (!skb) {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +			break;
> +		}
> +
> +		skb_copy_to_linear_data(skb, pkt_addr, pkt_size);
> +		skb_put(skb, pkt_size);
> +
> +		if (is_pkt_swap_enabled(hdev))
> +			dump_swap_pkt(port, skb);
> +
> +		skb->protocol = eth_type_trans(skb, ndev);
> +
> +		/* Zero the packet buffer memory to avoid leak in case of wrong
> +		 * size is used when next packet populates the same memory
> +		 */
> +		memset(pkt_addr, 0, pkt_size);
> +
> +		/* polling is done in thread context and hence BH should be disabled */
> +		if (hdev->poll_enable)
> +			local_bh_disable();
> +
> +		rc = netif_receive_skb(skb);
> +
> +		if (hdev->poll_enable)
> +			local_bh_enable();
> +
> +		if (rc == NET_RX_SUCCESS) {
> +			port->net_stats.rx_packets++;
> +			port->net_stats.rx_bytes += pkt_size;
> +		} else {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +		}
> +	}
> +
> +	return pkt_count;
> +}
> +
> +static bool __hbl_en_rx_poll_schedule(struct hbl_en_port *port, unsigned long delay)
> +{
> +	return queue_delayed_work(port->rx_wq, &port->rx_poll_work, delay);
> +}
> +
> +static void hbl_en_rx_poll_work(struct work_struct *work)
> +{
> +	struct hbl_en_port *port = container_of(work, struct hbl_en_port, rx_poll_work.work);
> +	struct hbl_en_device *hdev = port->hdev;
> +	int pkt_count;
> +
> +	pkt_count = hbl_en_handle_rx(port, NAPI_POLL_WEIGHT);
> +
> +	/* Reschedule the poll if we have consumed budget which means we still have packets to
> +	 * process. Else re-enable the Rx IRQs and exit the work.
> +	 */
> +	if (pkt_count < NAPI_POLL_WEIGHT)
> +		hdev->asic_funcs.reenable_rx_irq(port);
> +	else
> +		__hbl_en_rx_poll_schedule(port, 0);
> +}
> +
> +/* Rx poll init and trigger routines are used in event-driven setups where
> + * Rx polling is initialized once during init or open and started/triggered by the event handler.
> + */
> +void hbl_en_rx_poll_trigger_init(struct hbl_en_port *port)
> +{
> +	INIT_DELAYED_WORK(&port->rx_poll_work, hbl_en_rx_poll_work);
> +}
> +
> +bool hbl_en_rx_poll_start(struct hbl_en_port *port)
> +{
> +	return __hbl_en_rx_poll_schedule(port, msecs_to_jiffies(1));
> +}
> +
> +void hbl_en_rx_poll_stop(struct hbl_en_port *port)
> +{
> +	cancel_delayed_work_sync(&port->rx_poll_work);
> +}
> +
> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget)
> +{
> +	struct hbl_en_port *port = container_of(napi, struct hbl_en_port, napi);
> +	struct hbl_en_device *hdev = port->hdev;
> +	int pkt_count;
> +
> +	/* exit if we are called by netpoll as we free the Tx ring via EQ (if enabled) */
> +	if (!budget)
> +		return 0;
> +
> +	pkt_count = hbl_en_handle_rx(port, budget);
> +
> +	/* If budget not fully consumed, exit the polling mode */
> +	if (pkt_count < budget) {
> +		napi_complete_done(napi, pkt_count);
> +		hdev->asic_funcs.reenable_rx_irq(port);
> +	}
> +
> +	return pkt_count;
> +}
> +
> +static void hbl_en_port_unregister(struct hbl_en_port *port)
> +{
> +	struct net_device *ndev = port->ndev;
> +
> +	unregister_netdev(ndev);
> +	free_netdev(ndev);
> +	port->ndev = NULL;
> +}
> +
> +static int hbl_en_set_asic_funcs(struct hbl_en_device *hdev)
> +{
> +	switch (hdev->asic_type) {
> +	case HBL_ASIC_GAUDI2:
> +	default:
> +		dev_err(hdev->dev, "Unrecognized ASIC type %d\n", hdev->asic_type);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static void hbl_en_handle_eqe(struct hbl_aux_dev *aux_dev, u32 port, struct hbl_cn_eqe *eqe)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	hdev->asic_funcs.handle_eqe(aux_dev, port, eqe);
> +}
> +
> +static void hbl_en_set_aux_ops(struct hbl_en_device *hdev, bool enable)
> +{
> +	struct hbl_en_aux_ops *aux_ops = hdev->aux_dev->aux_ops;
> +
> +	if (enable) {
> +		aux_ops->ports_reopen = hbl_en_ports_reopen;
> +		aux_ops->ports_stop_prepare = hbl_en_ports_stop_prepare;
> +		aux_ops->ports_stop = hbl_en_ports_stop;
> +		aux_ops->set_port_status = hbl_en_set_port_status;
> +		aux_ops->is_port_open = hbl_en_is_port_open;
> +		aux_ops->get_src_ip = hbl_en_get_src_ip;
> +		aux_ops->reset_stats = hbl_en_reset_stats;
> +		aux_ops->get_mtu = hbl_en_get_mtu;
> +		aux_ops->get_pflags = hbl_en_get_pflags;
> +		aux_ops->set_dev_lpbk = hbl_en_set_dev_lpbk;
> +		aux_ops->handle_eqe = hbl_en_handle_eqe;
> +	} else {
> +		aux_ops->ports_reopen = NULL;
> +		aux_ops->ports_stop_prepare = NULL;
> +		aux_ops->ports_stop = NULL;
> +		aux_ops->set_port_status = NULL;
> +		aux_ops->is_port_open = NULL;
> +		aux_ops->get_src_ip = NULL;
> +		aux_ops->reset_stats = NULL;
> +		aux_ops->get_mtu = NULL;
> +		aux_ops->get_pflags = NULL;
> +		aux_ops->set_dev_lpbk = NULL;
> +		aux_ops->handle_eqe = NULL;
> +	}
> +}
> +
> +int hbl_en_dev_init(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_asic_funcs *asic_funcs = &hdev->asic_funcs;
> +	struct hbl_en_port *port;
> +	int rc, i, port_cnt = 0;
> +
> +	/* must be called before the call to dev_init() */
> +	rc = hbl_en_set_asic_funcs(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to set aux ops\n");
> +		return rc;
> +	}
> +
> +	rc = asic_funcs->dev_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "device init failed\n");
> +		return rc;
> +	}
> +
> +	/* init the function pointers here before calling hbl_en_port_register which sets up
> +	 * net_device_ops, and its ops might start getting called.
> +	 * If any failure is encountered, these will be made NULL and the core driver won't call
> +	 * them.
> +	 */
> +	hbl_en_set_aux_ops(hdev, true);
> +
> +	/* Port register depends on the above initialization so it must be called here and not
> +	 * before that.
> +	 */
> +	for (i = 0; i < hdev->max_num_of_ports; i++, port_cnt++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		rc = hbl_en_port_init(port);
> +		if (rc) {
> +			dev_err(hdev->dev, "port init failed\n");
> +			goto unregister_ports;
> +		}
> +
> +		rc = hbl_en_port_register(port);
> +		if (rc) {
> +			dev_err(hdev->dev, "port register failed\n");
> +
> +			hbl_en_port_fini(port);
> +			goto unregister_ports;
> +		}
> +	}
> +
> +	hdev->is_initialized = true;
> +
> +	return 0;
> +
> +unregister_ports:
> +	for (i = 0; i < port_cnt; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		hbl_en_port_unregister(port);
> +		hbl_en_port_fini(port);
> +	}
> +
> +	hbl_en_set_aux_ops(hdev, false);
> +
> +	asic_funcs->dev_fini(hdev);
> +
> +	return rc;
> +}
> +
> +void hbl_en_dev_fini(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_asic_funcs *asic_funcs = &hdev->asic_funcs;
> +	struct hbl_en_port *port;
> +	int i;
> +
> +	hdev->in_teardown = true;
> +
> +	if (!hdev->is_initialized)
> +		return;
> +
> +	hdev->is_initialized = false;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be this cleanup flow is called after a failed init flow.
> +		 * Hence we need to check that we indeed have a netdev to unregister.
> +		 */
> +		if (!port->ndev)
> +			continue;
> +
> +		hbl_en_port_unregister(port);
> +		hbl_en_port_fini(port);
> +	}
> +
> +	hbl_en_set_aux_ops(hdev, false);
> +
> +	asic_funcs->dev_fini(hdev);
> +}
> +
> +dma_addr_t hbl_en_dma_map(struct hbl_en_device *hdev, void *addr, int len)
> +{
> +	dma_addr_t dma_addr;
> +
> +	if (hdev->dma_map_support)
> +		dma_addr = dma_map_single(&hdev->pdev->dev, addr, len, DMA_TO_DEVICE);
> +	else
> +		dma_addr = virt_to_phys(addr);
> +
> +	return dma_addr;
> +}
> +
> +void hbl_en_dma_unmap(struct hbl_en_device *hdev, dma_addr_t dma_addr, int len)
> +{
> +	if (hdev->dma_map_support)
> +		dma_unmap_single(&hdev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
> +}
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
> new file mode 100644
> index 000000000000..15504c1f3cfb
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
> @@ -0,0 +1,206 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#ifndef HABANALABS_EN_H_
> +#define HABANALABS_EN_H_
> +
> +#include <linux/net/intel/cn.h>
> +
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
> +
> +#define HBL_EN_NAME			"habanalabs_en"
> +
> +#define HBL_EN_PORT(aux_dev, idx)	(&(((struct hbl_en_device *)(aux_dev)->priv)->ports[(idx)]))
> +
> +#define hbl_netdev_priv(ndev) \
> +({ \
> +	typecheck(struct net_device *, ndev); \
> +	*(struct hbl_en_port **)netdev_priv(ndev); \
> +})
> +
> +/**
> + * enum hbl_en_eth_pkt_status - status of Rx Ethernet packet.
> + * ETH_PKT_OK: packet was received successfully.
> + * ETH_PKT_DROP: packet should be dropped.
> + * ETH_PKT_NONE: no available packet.
> + */
> +enum hbl_en_eth_pkt_status {
> +	ETH_PKT_OK,
> +	ETH_PKT_DROP,
> +	ETH_PKT_NONE
> +};
> +
> +/**
> + * struct hbl_en_net_stats - stats of Ethernet interface.
> + * rx_packets: number of packets received.
> + * tx_packets: number of packets sent.
> + * rx_bytes: total bytes of data received.
> + * tx_bytes: total bytes of data sent.
> + * tx_errors: number of errors in the TX.
> + * rx_dropped: number of packets dropped by the RX.
> + * tx_dropped: number of packets dropped by the TX.
> + */
> +struct hbl_en_net_stats {
> +	u64 rx_packets;
> +	u64 tx_packets;
> +	u64 rx_bytes;
> +	u64 tx_bytes;
> +	u64 tx_errors;
> +	atomic64_t rx_dropped;
> +	atomic64_t tx_dropped;
> +};
> +
> +/**
> + * struct hbl_en_port - manage port common structure.
> + * @hdev: habanalabs Ethernet device structure.
> + * @ndev: network device.
> + * @rx_wq: WQ for Rx poll when we cannot schedule NAPI poll.
> + * @mac_addr: HW MAC addresses.
> + * @asic_specific: ASIC specific port structure.
> + * @napi: New API structure.
> + * @rx_poll_work: Rx work for polling mode.
> + * @net_stats: statistics of the ethernet interface.
> + * @in_reset: true if the NIC was marked as in reset, false otherwise. Used to avoid an additional
> + *            stopping of the NIC if a hard reset was re-initiated.
> + * @pflags: ethtool private flags bit mask.
> + * @idx: index of this specific port.
> + * @rx_max_coalesced_frames: Maximum number of packets to receive before an RX interrupt.
> + * @tx_max_coalesced_frames: Maximum number of packets to be sent before a TX interrupt.
> + * @rx_coalesce_usecs: How many usecs to delay an RX interrupt after a packet arrives.
> + * @is_initialized: true if the port H/W is initialized, false otherwise.
> + * @pfc_enable: true if this port supports Priority Flow Control, false otherwise.
> + * @auto_neg_enable: is autoneg enabled.
> + * @auto_neg_resolved: was autoneg phase finished successfully.
> + */
> +struct hbl_en_port {
> +	struct hbl_en_device *hdev;
> +	struct net_device *ndev;
> +	struct workqueue_struct *rx_wq;
> +	char *mac_addr;
> +	void *asic_specific;
> +	struct napi_struct napi;
> +	struct delayed_work rx_poll_work;
> +	struct hbl_en_net_stats net_stats;
> +	atomic_t in_reset;
> +	u32 pflags;
> +	u32 idx;
> +	u32 rx_max_coalesced_frames;
> +	u32 tx_max_coalesced_frames;
> +	u16 rx_coalesce_usecs;
> +	u8 is_initialized;
> +	u8 pfc_enable;
> +	u8 auto_neg_enable;
> +	u8 auto_neg_resolved;
> +};
> +
> +/**
> + * struct hbl_en_asic_funcs - ASIC specific Ethernet functions.
> + * @dev_init: device init.
> + * @dev_fini: device cleanup.
> + * @reenable_rx_irq: re-enable Rx interrupts.
> + * @eth_port_open: initialize and open the Ethernet port.
> + * @eth_port_close: close the Ethernet port.
> + * @write_pkt_to_hw: write skb to HW.
> + * @read_pkt_from_hw: read pkt from HW.
> + * @get_pfc_cnts: get PFC counters.
> + * @set_coalesce: set Tx/Rx coalesce config in HW.
> + * @get_rx_ring size: get max number of elements the Rx ring can contain.
> + * @handle_eqe: Handle a received event.
> + */
> +struct hbl_en_asic_funcs {
> +	int (*dev_init)(struct hbl_en_device *hdev);
> +	void (*dev_fini)(struct hbl_en_device *hdev);
> +	void (*reenable_rx_irq)(struct hbl_en_port *port);
> +	int (*eth_port_open)(struct hbl_en_port *port);
> +	void (*eth_port_close)(struct hbl_en_port *port);
> +	netdev_tx_t (*write_pkt_to_hw)(struct hbl_en_port *port, struct sk_buff *skb);
> +	int (*read_pkt_from_hw)(struct hbl_en_port *port, void **pkt_addr, u32 *pkt_size);
> +	void (*get_pfc_cnts)(struct hbl_en_port *port, void *ptr);
> +	int (*set_coalesce)(struct hbl_en_port *port);
> +	int (*get_rx_ring_size)(struct hbl_en_port *port);
> +	void (*handle_eqe)(struct hbl_aux_dev *aux_dev, u32 port_idx, struct hbl_cn_eqe *eqe);
> +};
> +
> +/**
> + * struct hbl_en_device - habanalabs Ethernet device structure.
> + * @pdev: pointer to PCI device.
> + * @dev: related kernel basic device structure.
> + * @ports: array of all ports manage common structures.
> + * @aux_dev: pointer to auxiliary device.
> + * @asic_specific: ASIC specific device structure.
> + * @fw_ver: FW version.
> + * @qsfp_eeprom: QSFPD EEPROM info.
> + * @mac_addr: array of all MAC addresses.
> + * @asic_funcs: ASIC specific Ethernet functions.
> + * @asic_type: ASIC specific type.
> + * @ports_mask: mask of available ports.
> + * @auto_neg_mask: mask of port with Autonegotiation enabled.
> + * @port_reset_timeout: max time in seconds for a port reset flow to finish.
> + * @pending_reset_long_timeout: long timeout for pending hard reset to finish in seconds.
> + * @max_frm_len: maximum allowed frame length.
> + * @raw_elem_size: size of element in raw buffers.
> + * @max_raw_mtu: maximum MTU size for raw packets.
> + * @min_raw_mtu: minimum MTU size for raw packets.
> + * @pad_size: the pad size in bytes for the skb to transmit.
> + * @core_dev_id: core device ID.
> + * @max_num_of_ports: max number of available ports;
> + * @in_reset: is the entire NIC currently under reset.
> + * @poll_enable: Enable Rx polling rather than IRQ + NAPI.
> + * @in_teardown: true if the NIC is in teardown (during device remove).
> + * @is_initialized: was the device initialized successfully.
> + * @has_eq: true if event queue is supported.
> + * @dma_map_support: HW supports DMA mapping.
> + */
> +struct hbl_en_device {
> +	struct pci_dev *pdev;
> +	struct device *dev;
> +	struct hbl_en_port *ports;
> +	struct hbl_aux_dev *aux_dev;
> +	void *asic_specific;
> +	char *fw_ver;
> +	char *qsfp_eeprom;
> +	char *mac_addr;
> +	struct hbl_en_asic_funcs asic_funcs;
> +	enum hbl_cn_asic_type asic_type;
> +	u64 ports_mask;
> +	u64 auto_neg_mask;
> +	u32 port_reset_timeout;
> +	u32 pending_reset_long_timeout;
> +	u32 max_frm_len;
> +	u32 raw_elem_size;
> +	u16 max_raw_mtu;
> +	u16 min_raw_mtu;
> +	u16 pad_size;
> +	u16 core_dev_id;
> +	u8 max_num_of_ports;
> +	u8 in_reset;
> +	u8 poll_enable;
> +	u8 in_teardown;
> +	u8 is_initialized;
> +	u8 has_eq;
> +	u8 dma_map_support;
> +};
> +
> +int hbl_en_dev_init(struct hbl_en_device *hdev);
> +void hbl_en_dev_fini(struct hbl_en_device *hdev);
> +
> +const struct ethtool_ops *hbl_en_ethtool_get_ops(struct net_device *ndev);
> +void hbl_en_ethtool_init_coalesce(struct hbl_en_port *port);
> +
> +extern const struct dcbnl_rtnl_ops hbl_en_dcbnl_ops;
> +
> +bool hbl_en_rx_poll_start(struct hbl_en_port *port);
> +void hbl_en_rx_poll_stop(struct hbl_en_port *port);
> +void hbl_en_rx_poll_trigger_init(struct hbl_en_port *port);
> +int hbl_en_port_reset(struct hbl_en_port *port);
> +int hbl_en_port_reset_locked(struct hbl_aux_dev *aux_dev, u32 port_idx);
> +int hbl_en_handle_rx(struct hbl_en_port *port, int budget);
> +dma_addr_t hbl_en_dma_map(struct hbl_en_device *hdev, void *addr, int len);
> +void hbl_en_dma_unmap(struct hbl_en_device *hdev, dma_addr_t dma_addr, int len);
> +
> +#endif /* HABANALABS_EN_H_ */
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
> new file mode 100644
> index 000000000000..5d718579a2b6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +
> +#define PFC_PRIO_MASK_ALL	GENMASK(HBL_EN_PFC_PRIO_NUM - 1, 0)
> +#define PFC_PRIO_MASK_NONE	0
> +
> +#ifdef CONFIG_DCB
> +static int hbl_en_dcbnl_ieee_getpfc(struct net_device *netdev, struct ieee_pfc *pfc)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev;
> +	u32 port_idx;
> +
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_dbg_ratelimited(hdev->dev, "port %d is in reset, can't get PFC", port_idx);
> +		return -EBUSY;
> +	}
> +
> +	pfc->pfc_en = port->pfc_enable ? PFC_PRIO_MASK_ALL : PFC_PRIO_MASK_NONE;
> +	pfc->pfc_cap = HBL_EN_PFC_PRIO_NUM;
> +
> +	hdev->asic_funcs.get_pfc_cnts(port, pfc);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return 0;
> +}
> +
> +static int hbl_en_dcbnl_ieee_setpfc(struct net_device *netdev, struct ieee_pfc *pfc)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	u8 curr_pfc_en;
> +	u32 port_idx;
> +	int rc = 0;
> +
> +	hdev = port->hdev;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	port_idx = port->idx;
> +
> +	if (pfc->pfc_en & ~PFC_PRIO_MASK_ALL) {
> +		dev_dbg_ratelimited(hdev->dev, "PFC supports %d priorities only, port %d\n",
> +				    HBL_EN_PFC_PRIO_NUM, port_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (pfc->pfc_en != PFC_PRIO_MASK_NONE && pfc->pfc_en != PFC_PRIO_MASK_ALL) {
> +		dev_dbg_ratelimited(hdev->dev,
> +				    "PFC should be enabled/disabled on all priorities, port %d\n",
> +				    port_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_dbg_ratelimited(hdev->dev, "port %d is in reset, can't set PFC", port_idx);
> +		return -EBUSY;
> +	}
> +
> +	curr_pfc_en = port->pfc_enable ? PFC_PRIO_MASK_ALL : PFC_PRIO_MASK_NONE;
> +
> +	if (pfc->pfc_en == curr_pfc_en)
> +		goto out;
> +
> +	port->pfc_enable = !port->pfc_enable;
> +
> +	rc = aux_ops->set_pfc(aux_dev, port_idx, port->pfc_enable);
> +
> +out:
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static u8 hbl_en_dcbnl_getdcbx(struct net_device *netdev)
> +{
> +	return DCB_CAP_DCBX_HOST | DCB_CAP_DCBX_VER_IEEE;
> +}
> +
> +static u8 hbl_en_dcbnl_setdcbx(struct net_device *netdev, u8 mode)
> +{
> +	return !(mode == (DCB_CAP_DCBX_HOST | DCB_CAP_DCBX_VER_IEEE));
> +}
> +
> +const struct dcbnl_rtnl_ops hbl_en_dcbnl_ops = {
> +	.ieee_getpfc	= hbl_en_dcbnl_ieee_getpfc,
> +	.ieee_setpfc	= hbl_en_dcbnl_ieee_setpfc,
> +	.getdcbx	= hbl_en_dcbnl_getdcbx,
> +	.setdcbx	= hbl_en_dcbnl_setdcbx
> +};
> +#endif
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
> new file mode 100644
> index 000000000000..23a87d36ded5
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#define pr_fmt(fmt)		"habanalabs_en: " fmt
> +
> +#include "hbl_en.h"
> +
> +#include <linux/module.h>
> +#include <linux/auxiliary_bus.h>
> +
> +#define HBL_DRIVER_AUTHOR	"HabanaLabs Kernel Driver Team"
> +
> +#define HBL_DRIVER_DESC		"HabanaLabs AI accelerators Ethernet driver"
> +
> +MODULE_AUTHOR(HBL_DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(HBL_DRIVER_DESC);
> +MODULE_LICENSE("GPL");
> +
> +static bool poll_enable;
> +
> +module_param(poll_enable, bool, 0444);
> +MODULE_PARM_DESC(poll_enable,
> +		 "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
> +
> +static int hdev_init(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_aux_data *aux_data = aux_dev->aux_data;
> +	struct hbl_en_port *ports, *port;
> +	struct hbl_en_device *hdev;
> +	int rc, i;
> +
> +	hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
> +	if (!hdev)
> +		return -ENOMEM;
> +
> +	ports = kcalloc(aux_data->max_num_of_ports, sizeof(*ports), GFP_KERNEL);
> +	if (!ports) {
> +		rc = -ENOMEM;
> +		goto ports_alloc_fail;
> +	}
> +
> +	aux_dev->priv = hdev;
> +	hdev->aux_dev = aux_dev;
> +	hdev->ports = ports;
> +	hdev->pdev = aux_data->pdev;
> +	hdev->dev = aux_data->dev;
> +	hdev->ports_mask = aux_data->ports_mask;
> +	hdev->auto_neg_mask = aux_data->auto_neg_mask;
> +	hdev->max_num_of_ports = aux_data->max_num_of_ports;
> +	hdev->core_dev_id = aux_data->id;
> +	hdev->fw_ver = aux_data->fw_ver;
> +	hdev->qsfp_eeprom = aux_data->qsfp_eeprom;
> +	hdev->asic_type = aux_data->asic_type;
> +	hdev->pending_reset_long_timeout = aux_data->pending_reset_long_timeout;
> +	hdev->max_frm_len = aux_data->max_frm_len;
> +	hdev->raw_elem_size = aux_data->raw_elem_size;
> +	hdev->max_raw_mtu = aux_data->max_raw_mtu;
> +	hdev->min_raw_mtu = aux_data->min_raw_mtu;
> +	hdev->pad_size = ETH_ZLEN;
> +	hdev->has_eq = aux_data->has_eq;
> +	hdev->dma_map_support = true;
> +	hdev->poll_enable = poll_enable;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +		port->hdev = hdev;
> +		port->idx = i;
> +		port->pfc_enable = true;
> +		port->pflags = PFLAGS_PCS_LINK_CHECK | PFLAGS_PHY_AUTO_NEG_LPBK;
> +		port->mac_addr = aux_data->mac_addr[i];
> +		port->auto_neg_enable = !!(aux_data->auto_neg_mask & BIT(i));
> +	}
> +
> +	return 0;
> +
> +ports_alloc_fail:
> +	kfree(hdev);
> +
> +	return rc;
> +}
> +
> +static void hdev_fini(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	kfree(hdev->ports);
> +	kfree(hdev);
> +	aux_dev->priv = NULL;
> +}
> +
> +static const struct auxiliary_device_id hbl_en_id_table[] = {
> +	{ .name = "habanalabs_cn.en", },
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, hbl_en_id_table);
> +
> +static int hbl_en_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
> +{
> +	struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
> +	struct hbl_en_aux_ops *aux_ops = aux_dev->aux_ops;
> +	struct hbl_en_device *hdev;
> +	ktime_t timeout;
> +	int rc;
> +
> +	rc = hdev_init(aux_dev);
> +	if (rc) {
> +		dev_err(&aux_dev->adev.dev, "Failed to init hdev\n");
> +		return -EIO;
> +	}
> +
> +	hdev = aux_dev->priv;
> +
> +	/* don't allow module unloading while it is attached */
> +	if (!try_module_get(THIS_MODULE)) {
> +		dev_err(hdev->dev, "Failed to increment %s module refcount\n", HBL_EN_NAME);
> +		rc = -EIO;
> +		goto module_get_err;
> +	}
> +
> +	timeout = ktime_add_ms(ktime_get(), hdev->pending_reset_long_timeout * MSEC_PER_SEC);
> +	while (1) {
> +		aux_ops->hw_access_lock(aux_dev);
> +
> +		/* if the device is operational, proceed to actual init while holding the lock in
> +		 * order to prevent concurrent hard reset
> +		 */
> +		if (aux_ops->device_operational(aux_dev))
> +			break;
> +
> +		aux_ops->hw_access_unlock(aux_dev);
> +
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			dev_err(hdev->dev, "Timeout while waiting for hard reset to finish\n");
> +			rc = -EBUSY;
> +			goto timeout_err;
> +		}
> +
> +		dev_notice_once(hdev->dev, "Waiting for hard reset to finish before probing en\n");
> +
> +		msleep_interruptible(MSEC_PER_SEC);
> +	}
> +
> +	rc = hbl_en_dev_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to init en device\n");
> +		goto dev_init_err;
> +	}
> +
> +	aux_ops->hw_access_unlock(aux_dev);
> +
> +	return 0;
> +
> +dev_init_err:
> +	aux_ops->hw_access_unlock(aux_dev);
> +timeout_err:
> +	module_put(THIS_MODULE);
> +module_get_err:
> +	hdev_fini(aux_dev);
> +
> +	return rc;
> +}
> +
> +/* This function can be called only from the CN driver when deleting the aux bus, because we
> + * incremented the module refcount on probing. Hence no need to protect here from hard reset.
> + */
> +static void hbl_en_remove(struct auxiliary_device *adev)
> +{
> +	struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	if (!hdev)
> +		return;
> +
> +	hbl_en_dev_fini(hdev);
> +
> +	/* allow module unloading as now it is detached */
> +	module_put(THIS_MODULE);
> +
> +	hdev_fini(aux_dev);
> +}
> +
> +static struct auxiliary_driver hbl_en_driver = {
> +	.name = "eth",
> +	.probe = hbl_en_probe,
> +	.remove = hbl_en_remove,
> +	.id_table = hbl_en_id_table,
> +};
> +
> +static int __init hbl_en_init(void)
> +{
> +	pr_info("loading driver\n");
> +
> +	return auxiliary_driver_register(&hbl_en_driver);
> +}
> +
> +static void __exit hbl_en_exit(void)
> +{
> +	auxiliary_driver_unregister(&hbl_en_driver);
> +
> +	pr_info("driver removed\n");
> +}
> +
> +module_init(hbl_en_init);
> +module_exit(hbl_en_exit);
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> new file mode 100644
> index 000000000000..1d14d283409b
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> @@ -0,0 +1,452 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +#include <linux/ethtool.h>
> +
> +#define RX_COALESCED_FRAMES_MIN		1
> +#define TX_COALESCED_FRAMES_MIN		1
> +#define TX_COALESCED_FRAMES_MAX		10
> +
> +static const char pflags_str[][ETH_GSTRING_LEN] = {
> +	"pcs-link-check",
> +	"phy-auto-neg-lpbk",
> +};
> +
> +#define NIC_STAT(m) {#m, offsetof(struct hbl_en_port, net_stats.m)}
> +
> +static struct hbl_cn_stat netdev_eth_stats[] = {
> +	NIC_STAT(rx_packets),
> +	NIC_STAT(tx_packets),
> +	NIC_STAT(rx_bytes),
> +	NIC_STAT(tx_bytes),
> +	NIC_STAT(tx_errors),
> +	NIC_STAT(rx_dropped),
> +	NIC_STAT(tx_dropped)
> +};
> +
> +static size_t pflags_str_len = ARRAY_SIZE(pflags_str);
> +static size_t netdev_eth_stats_len = ARRAY_SIZE(netdev_eth_stats);
> +
> +static void hbl_en_ethtool_get_drvinfo(struct net_device *ndev, struct ethtool_drvinfo *drvinfo)
> +{
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +
> +	strscpy(drvinfo->driver, HBL_EN_NAME, sizeof(drvinfo->driver));
> +	strscpy(drvinfo->fw_version, hdev->fw_ver, sizeof(drvinfo->fw_version));
> +	strscpy(drvinfo->bus_info, pci_name(hdev->pdev), sizeof(drvinfo->bus_info));
> +}
> +
> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
> +{
> +	modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
> +	modinfo->type = ETH_MODULE_SFF_8636;
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_get_module_eeprom(struct net_device *ndev, struct ethtool_eeprom *ee,
> +					    u8 *data)
> +{
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 first, last, len;
> +	u8 *qsfp_eeprom;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	qsfp_eeprom = hdev->qsfp_eeprom;
> +
> +	if (ee->len == 0)
> +		return -EINVAL;
> +
> +	first = ee->offset;
> +	last = ee->offset + ee->len;
> +
> +	if (first < ETH_MODULE_SFF_8636_LEN) {
> +		len = min_t(unsigned int, last, ETH_MODULE_SFF_8079_LEN);
> +		len -= first;
> +
> +		memcpy(data, qsfp_eeprom + first, len);
> +	}
> +
> +	return 0;
> +}
> +
> +static u32 hbl_en_ethtool_get_priv_flags(struct net_device *ndev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +
> +	return port->pflags;
> +}
> +
> +static int hbl_en_ethtool_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +
> +	port->pflags = priv_flags;
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_get_link_ksettings(struct net_device *ndev,
> +					     struct ethtool_link_ksettings *cmd)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 port_idx, speed;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	speed = aux_ops->get_speed(aux_dev, port_idx);
> +
> +	cmd->base.speed = speed;
> +	cmd->base.duplex = DUPLEX_FULL;
> +
> +	ethtool_link_ksettings_zero_link_mode(cmd, supported);
> +	ethtool_link_ksettings_zero_link_mode(cmd, advertising);
> +
> +	switch (speed) {
> +	case SPEED_100000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseLR4_ER4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseLR4_ER4_Full);
> +
> +		cmd->base.port = PORT_FIBRE;
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, Backplane);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Backplane);
> +		break;
> +	case SPEED_50000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseKR2_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseKR2_Full);
> +		break;
> +	case SPEED_25000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 25000baseCR_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 25000baseCR_Full);
> +		break;
> +	case SPEED_200000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseKR4_Full);
> +		break;
> +	case SPEED_400000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseKR4_Full);
> +		break;
> +	default:
> +		netdev_err(port->ndev, "unknown speed %d\n", speed);
> +		return -EFAULT;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
> +
> +	if (port->auto_neg_enable) {
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
> +		cmd->base.autoneg = AUTONEG_ENABLE;
> +		if (port->auto_neg_resolved)
> +			ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
> +	} else {
> +		cmd->base.autoneg = AUTONEG_DISABLE;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
> +
> +	if (port->pfc_enable)
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);
> +
> +	return 0;
> +}
> +
> +/* only autoneg is mutable */
> +static bool check_immutable_ksettings(const struct ethtool_link_ksettings *old_cmd,
> +				      const struct ethtool_link_ksettings *new_cmd)
> +{
> +	return (old_cmd->base.speed == new_cmd->base.speed) &&
> +	       (old_cmd->base.duplex == new_cmd->base.duplex) &&
> +	       (old_cmd->base.port == new_cmd->base.port) &&
> +	       (old_cmd->base.phy_address == new_cmd->base.phy_address) &&
> +	       (old_cmd->base.eth_tp_mdix_ctrl == new_cmd->base.eth_tp_mdix_ctrl) &&
> +	       bitmap_equal(old_cmd->link_modes.advertising, new_cmd->link_modes.advertising,
> +			    __ETHTOOL_LINK_MODE_MASK_NBITS);
> +}
> +
> +static int
> +hbl_en_ethtool_set_link_ksettings(struct net_device *ndev, const struct ethtool_link_ksettings *cmd)
> +{
> +	struct ethtool_link_ksettings curr_cmd;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	bool auto_neg;
> +	u32 port_idx;
> +	int rc;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +
> +	memset(&curr_cmd, 0, sizeof(struct ethtool_link_ksettings));
> +
> +	rc = hbl_en_ethtool_get_link_ksettings(ndev, &curr_cmd);
> +	if (rc)
> +		return rc;
> +
> +	if (!check_immutable_ksettings(&curr_cmd, cmd))
> +		return -EOPNOTSUPP;
> +
> +	auto_neg = cmd->base.autoneg == AUTONEG_ENABLE;
> +
> +	if (port->auto_neg_enable == auto_neg)
> +		return 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(port->ndev, "port is in reset, can't update settings\n");
> +		return -EBUSY;
> +	}
> +
> +	if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
> +		netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	port->auto_neg_enable = auto_neg;
> +
> +	if (netif_running(port->ndev)) {
> +		rc = hbl_en_port_reset(port);
> +		if (rc)
> +			netdev_err(port->ndev, "Failed to reset port for settings update, rc %d\n",
> +				   rc);
> +	}
> +
> +out:
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_ethtool_get_sset_count(struct net_device *ndev, int sset)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	switch (sset) {
> +	case ETH_SS_STATS:
> +		return netdev_eth_stats_len + aux_ops->get_cnts_num(aux_dev, port_idx);
> +	case ETH_SS_PRIV_FLAGS:
> +		return pflags_str_len;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +static void hbl_en_ethtool_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int i;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	switch (stringset) {
> +	case ETH_SS_STATS:
> +		for (i = 0; i < netdev_eth_stats_len; i++)
> +			ethtool_puts(&data, netdev_eth_stats[i].str);
> +
> +		aux_ops->get_cnts_names(aux_dev, port_idx, data);
> +		break;
> +	case ETH_SS_PRIV_FLAGS:
> +		for (i = 0; i < pflags_str_len; i++)
> +			ethtool_puts(&data, pflags_str[i]);
> +		break;
> +	}
> +}
> +
> +static void hbl_en_ethtool_get_ethtool_stats(struct net_device *ndev,
> +					     __always_unused struct ethtool_stats *stats, u64 *data)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	u32 port_idx;
> +	char *p;
> +	int i;
> +
> +	hdev = port->hdev;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	port_idx = port->idx;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_info_ratelimited(hdev->dev, "port %d is in reset, can't get ethtool stats",
> +				     port_idx);
> +		return;
> +	}
> +
> +	/* Even though the Ethernet Rx/Tx flow might update the stats in parallel, there is not an
> +	 * absolute need for synchronisation. This is because, missing few counts of these stats is
> +	 * much better than adding a lock to synchronize and increase the overhead of the Rx/Tx
> +	 * flows. In worst case scenario, reader will get stale stats. He will receive updated
> +	 * stats in next read.
> +	 */
> +	for (i = 0; i < netdev_eth_stats_len; i++) {
> +		p = (char *)port + netdev_eth_stats[i].lo_offset;
> +		data[i] = *(u32 *)p;
> +	}
> +
> +	data += i;
> +
> +	aux_ops->get_cnts_values(aux_dev, port_idx, data);
> +
> +	atomic_set(&port->in_reset, 0);
> +}
> +
> +static int hbl_en_ethtool_get_coalesce(struct net_device *ndev,
> +				       struct ethtool_coalesce *coal,
> +				       struct kernel_ethtool_coalesce *kernel_coal,
> +				       struct netlink_ext_ack *extack)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +
> +	coal->tx_max_coalesced_frames = port->tx_max_coalesced_frames;
> +	coal->rx_coalesce_usecs = port->rx_coalesce_usecs;
> +	coal->rx_max_coalesced_frames = port->rx_max_coalesced_frames;
> +
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_set_coalesce(struct net_device *ndev,
> +				       struct ethtool_coalesce *coal,
> +				       struct kernel_ethtool_coalesce *kernel_coal,
> +				       struct netlink_ext_ack *extack)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc, rx_ring_size;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(port->ndev, "port is in reset, can't update settings\n");
> +		return -EBUSY;
> +	}
> +
> +	if (coal->tx_max_coalesced_frames < TX_COALESCED_FRAMES_MIN ||
> +	    coal->tx_max_coalesced_frames > TX_COALESCED_FRAMES_MAX) {
> +		netdev_err(ndev, "tx max_coalesced_frames should be between %d and %d\n",
> +			   TX_COALESCED_FRAMES_MIN, TX_COALESCED_FRAMES_MAX);
> +		rc = -EINVAL;
> +		goto atomic_out;
> +	}
> +
> +	rx_ring_size = hdev->asic_funcs.get_rx_ring_size(port);
> +	if (coal->rx_max_coalesced_frames < RX_COALESCED_FRAMES_MIN ||
> +	    coal->rx_max_coalesced_frames >= rx_ring_size) {
> +		netdev_err(ndev, "rx max_coalesced_frames should be between %d and %d\n",
> +			   RX_COALESCED_FRAMES_MIN, rx_ring_size);
> +		rc = -EINVAL;
> +		goto atomic_out;
> +	}
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +
> +	port->tx_max_coalesced_frames = coal->tx_max_coalesced_frames;
> +	port->rx_coalesce_usecs = coal->rx_coalesce_usecs;
> +	port->rx_max_coalesced_frames = coal->rx_max_coalesced_frames;
> +
> +	rc = hdev->asic_funcs.set_coalesce(port);
> +
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +atomic_out:
> +	atomic_set(&port->in_reset, 0);
> +	return rc;
> +}
> +
> +void hbl_en_ethtool_init_coalesce(struct hbl_en_port *port)
> +{
> +	port->rx_coalesce_usecs = CQ_ARM_TIMEOUT_USEC;
> +	port->rx_max_coalesced_frames = 1;
> +	port->tx_max_coalesced_frames = 1;
> +}
> +
> +static const struct ethtool_ops hbl_en_ethtool_ops_coalesce = {
> +	.supported_coalesce_params = ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_RX_MAX_FRAMES |
> +				     ETHTOOL_COALESCE_TX_MAX_FRAMES,
> +	.get_drvinfo = hbl_en_ethtool_get_drvinfo,
> +	.get_link = ethtool_op_get_link,
> +	.get_module_info = hbl_en_ethtool_get_module_info,
> +	.get_module_eeprom = hbl_en_ethtool_get_module_eeprom,
> +	.get_priv_flags = hbl_en_ethtool_get_priv_flags,
> +	.set_priv_flags = hbl_en_ethtool_set_priv_flags,
> +	.get_link_ksettings = hbl_en_ethtool_get_link_ksettings,
> +	.set_link_ksettings = hbl_en_ethtool_set_link_ksettings,
> +	.get_sset_count = hbl_en_ethtool_get_sset_count,
> +	.get_strings = hbl_en_ethtool_get_strings,
> +	.get_ethtool_stats = hbl_en_ethtool_get_ethtool_stats,
> +	.get_coalesce = hbl_en_ethtool_get_coalesce,
> +	.set_coalesce = hbl_en_ethtool_set_coalesce,
> +};
> +
> +const struct ethtool_ops *hbl_en_ethtool_get_ops(struct net_device *ndev)
> +{
> +	return &hbl_en_ethtool_ops_coalesce;
> +}

Zhu Yanjun June 15, 2024, 5:13 p.m. UTC | #6

在 2024/6/13 16:22, Omer Shpigelman 写道:
> This ethernet driver is initialized via auxiliary bus by the hbl_cn
> driver.
> It serves mainly for control operations that are needed for AI scaling.
> 
> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> Co-developed-by: Abhilash K V <kvabhilash@habana.ai>
> Signed-off-by: Abhilash K V <kvabhilash@habana.ai>
> Co-developed-by: Andrey Agranovich <aagranovich@habana.ai>
> Signed-off-by: Andrey Agranovich <aagranovich@habana.ai>
> Co-developed-by: Bharat Jauhari <bjauhari@habana.ai>
> Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
> Co-developed-by: David Meriin <dmeriin@habana.ai>
> Signed-off-by: David Meriin <dmeriin@habana.ai>
> Co-developed-by: Sagiv Ozeri <sozeri@habana.ai>
> Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
> Co-developed-by: Zvika Yehudai <zyehudai@habana.ai>
> Signed-off-by: Zvika Yehudai <zyehudai@habana.ai>
> ---
>   MAINTAINERS                                   |    9 +
>   drivers/net/ethernet/intel/Kconfig            |   18 +
>   drivers/net/ethernet/intel/Makefile           |    1 +
>   drivers/net/ethernet/intel/hbl_en/Makefile    |    9 +
>   .../net/ethernet/intel/hbl_en/common/Makefile |    3 +
>   .../net/ethernet/intel/hbl_en/common/hbl_en.c | 1168 +++++++++++++++++
>   .../net/ethernet/intel/hbl_en/common/hbl_en.h |  206 +++
>   .../intel/hbl_en/common/hbl_en_dcbnl.c        |  101 ++
>   .../ethernet/intel/hbl_en/common/hbl_en_drv.c |  211 +++
>   .../intel/hbl_en/common/hbl_en_ethtool.c      |  452 +++++++
>   10 files changed, 2178 insertions(+)
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/Makefile
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/Makefile
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
>   create mode 100644 drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 096439a62129..7301f38e9cfb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9617,6 +9617,15 @@ F:	include/linux/habanalabs/
>   F:	include/linux/net/intel/cn*
>   F:	include/linux/net/intel/gaudi2*
>   
> +HABANALABS ETHERNET DRIVER
> +M:	Omer Shpigelman <oshpigelman@habana.ai>
> +L:	netdev@vger.kernel.org
> +S:	Supported
> +W:	https://www.habana.ai
> +F:	Documentation/networking/device_drivers/ethernet/intel/hbl.rst
> +F:	drivers/net/ethernet/intel/hbl_en/
> +F:	include/linux/net/intel/cn*
> +
>   HACKRF MEDIA DRIVER
>   L:	linux-media@vger.kernel.org
>   S:	Orphan
> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> index 0d1b8a2bae99..5d07349348a0 100644
> --- a/drivers/net/ethernet/intel/Kconfig
> +++ b/drivers/net/ethernet/intel/Kconfig
> @@ -417,4 +417,22 @@ config HABANA_CN
>   	  To compile this driver as a module, choose M here. The module
>   	  will be called habanalabs_cn.
>   
> +config HABANA_EN
> +	tristate "HabanaLabs (an Intel Company) Ethernet driver"
> +	depends on NETDEVICES && ETHERNET && INET
> +	select HABANA_CN
> +	help
> +	  This driver enables Ethernet functionality for the network interfaces
> +	  that are part of the GAUDI ASIC family of AI Accelerators.
> +	  For more information on how to identify your adapter, go to the
> +	  Adapter & Driver ID Guide that can be located at:
> +
> +	  <http://support.intel.com>
> +
> +	  More specific information on configuring the driver is in
> +	  <file:Documentation/networking/device_drivers/ethernet/intel/hbl.rst>.
> +
> +	  To compile this driver as a module, choose M here. The module
> +	  will be called habanalabs_en.
> +
>   endif # NET_VENDOR_INTEL
> diff --git a/drivers/net/ethernet/intel/Makefile b/drivers/net/ethernet/intel/Makefile
> index 10049a28e336..ec62a0227897 100644
> --- a/drivers/net/ethernet/intel/Makefile
> +++ b/drivers/net/ethernet/intel/Makefile
> @@ -20,3 +20,4 @@ obj-$(CONFIG_FM10K) += fm10k/
>   obj-$(CONFIG_ICE) += ice/
>   obj-$(CONFIG_IDPF) += idpf/
>   obj-$(CONFIG_HABANA_CN) += hbl_cn/
> +obj-$(CONFIG_HABANA_EN) += hbl_en/
> diff --git a/drivers/net/ethernet/intel/hbl_en/Makefile b/drivers/net/ethernet/intel/hbl_en/Makefile
> new file mode 100644
> index 000000000000..695497ab93b6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/Makefile
> @@ -0,0 +1,9 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Makefile for HabanaLabs (an Intel Company) Ethernet network driver
> +#
> +
> +obj-$(CONFIG_HABANA_EN) := habanalabs_en.o
> +
> +include $(src)/common/Makefile
> +habanalabs_en-y += $(HBL_EN_COMMON_FILES)
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/Makefile b/drivers/net/ethernet/intel/hbl_en/common/Makefile
> new file mode 100644
> index 000000000000..a3ccb5dbf4a6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +HBL_EN_COMMON_FILES := common/hbl_en_drv.o common/hbl_en.o \
> +	common/hbl_en_ethtool.o common/hbl_en_dcbnl.o
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
> new file mode 100644
> index 000000000000..066be5ac2d84
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.c
> @@ -0,0 +1,1168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +#include <linux/inetdevice.h>
> +
> +#define TX_TIMEOUT			(5 * HZ)
> +#define PORT_RESET_TIMEOUT_MSEC		(60 * 1000ull) /* 60s */
> +
> +/**
> + * struct hbl_en_tx_pkt_work - used to schedule a work of a Tx packet.
> + * @tx_work: workqueue object to run when packet needs to be sent.
> + * @port: pointer to current port structure.
> + * @skb: copy of the packet to send.
> + */
> +struct hbl_en_tx_pkt_work {
> +	struct work_struct tx_work;
> +	struct hbl_en_port *port;
> +	struct sk_buff *skb;
> +};
> +
> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget);
> +static int hbl_en_port_open(struct hbl_en_port *port);
> +
> +static int hbl_en_ports_reopen(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int rc = 0, i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be that the port was shutdown by 'ip link set down' and there is no need
> +		 * in reopening it.
> +		 * Since we mark the ports as in reset even if they are disabled, we clear the flag
> +		 * here anyway.
> +		 * See hbl_en_ports_stop_prepare() for more info.
> +		 */
> +		if (!netif_running(port->ndev)) {
> +			atomic_set(&port->in_reset, 0);
> +			continue;
> +		}
> +
> +		rc = hbl_en_port_open(port);
> +
> +		atomic_set(&port->in_reset, 0);
> +
> +		if (rc)
> +			break;
> +	}
> +
> +	hdev->in_reset = false;
> +
> +	return rc;
> +}
> +
> +static void hbl_en_port_fini(struct hbl_en_port *port)
> +{
> +	if (port->rx_wq)
> +		destroy_workqueue(port->rx_wq);
> +}
> +
> +static int hbl_en_port_init(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u32 port_idx = port->idx;
> +	char wq_name[32];
> +	int rc;
> +
> +	if (hdev->poll_enable) {
> +		memset(wq_name, 0, sizeof(wq_name));
> +		snprintf(wq_name, sizeof(wq_name) - 1, "hbl%u-port%d-rx-wq", hdev->core_dev_id,
> +			 port_idx);
> +		port->rx_wq = alloc_ordered_workqueue(wq_name, 0);
> +		if (!port->rx_wq) {
> +			dev_err(hdev->dev, "Failed to allocate Rx WQ\n");
> +			rc = -ENOMEM;
> +			goto fail;
> +		}
> +	}
> +
> +	hbl_en_ethtool_init_coalesce(port);
> +
> +	return 0;
> +
> +fail:
> +	hbl_en_port_fini(port);
> +
> +	return rc;
> +}
> +
> +static void _hbl_en_set_port_status(struct hbl_en_port *port, bool up)
> +{
> +	struct net_device *ndev = port->ndev;
> +	u32 port_idx = port->idx;
> +
> +	if (up) {
> +		netif_carrier_on(ndev);
> +		netif_wake_queue(ndev);
> +	} else {
> +		netif_carrier_off(ndev);
> +		netif_stop_queue(ndev);
> +	}
> +
> +	/* Unless link events are getting through the EQ, no need to print about link down events
> +	 * during port reset
> +	 */
> +	if (port->hdev->has_eq || up || !atomic_read(&port->in_reset))
> +		netdev_info(port->ndev, "link %s, port %d\n", up ? "up" : "down", port_idx);
> +}
> +
> +static void hbl_en_set_port_status(struct hbl_aux_dev *aux_dev, u32 port_idx, bool up)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	_hbl_en_set_port_status(port, up);
> +}
> +
> +static bool hbl_en_is_port_open(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return port->is_initialized;
> +}
> +
> +/* get the src IP as it is done in devinet_ioctl() */
> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	struct in_device *in_dev;
> +	struct in_ifaddr *ifa;
> +	int rc = 0;
> +
> +	/* for the case where no src IP is configured */
> +	*src_ip = 0;
> +
> +	/* rtnl lock should be acquired in relevant flows before taking configuration lock */
> +	if (!rtnl_is_locked()) {
> +		netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	in_dev = __in_dev_get_rtnl(ndev);
> +	if (!in_dev) {
> +		netdev_err(port->ndev, "Failed to get IPv4 struct\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	ifa = rtnl_dereference(in_dev->ifa_list);
> +
> +	while (ifa) {
> +		if (!strcmp(ndev->name, ifa->ifa_label)) {
> +			/* convert the BE to native and later on it will be
> +			 * written to the HW as LE in QPC_SET
> +			 */
> +			*src_ip = be32_to_cpu(ifa->ifa_local);
> +			break;
> +		}
> +		ifa = rtnl_dereference(ifa->ifa_next);
> +	}
> +out:
> +	return rc;
> +}
> +
> +static void hbl_en_reset_stats(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	port->net_stats.rx_packets = 0;
> +	port->net_stats.tx_packets = 0;
> +	port->net_stats.rx_bytes = 0;
> +	port->net_stats.tx_bytes = 0;
> +	port->net_stats.tx_errors = 0;
> +	atomic64_set(&port->net_stats.rx_dropped, 0);
> +	atomic64_set(&port->net_stats.tx_dropped, 0);

per-cpu variable is better?

Zhu Yanjun

> +}
> +
> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +	u32 mtu;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(ndev, "port is in reset, can't get MTU\n");
> +		return 0;
> +	}
> +
> +	mtu = ndev->mtu;
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return mtu;
> +}
> +
> +static u32 hbl_en_get_pflags(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return port->pflags;
> +}
> +
> +static void hbl_en_set_dev_lpbk(struct hbl_aux_dev *aux_dev, u32 port_idx, bool enable)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +	struct net_device *ndev = port->ndev;
> +
> +	if (enable)
> +		ndev->features |= NETIF_F_LOOPBACK;
> +	else
> +		ndev->features &= ~NETIF_F_LOOPBACK;
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static int hbl_en_port_open_locked(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct net_device *ndev = port->ndev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (port->is_initialized)
> +		return 0;
> +
> +	if (!hdev->poll_enable)
> +		netif_napi_add(ndev, &port->napi, hbl_en_napi_poll);
> +
> +	rc = aux_ops->port_hw_init(aux_dev, port_idx);
> +	if (rc) {
> +		netdev_err(ndev, "Failed to configure the HW, rc %d\n", rc);
> +		goto hw_init_fail;
> +	}
> +
> +	if (!hdev->poll_enable)
> +		napi_enable(&port->napi);
> +
> +	rc = hdev->asic_funcs.eth_port_open(port);
> +	if (rc) {
> +		netdev_err(ndev, "Failed to init H/W, rc %d\n", rc);
> +		goto port_open_fail;
> +	}
> +
> +	rc = aux_ops->update_mtu(aux_dev, port_idx, ndev->mtu);
> +	if (rc) {
> +		netdev_err(ndev, "MTU update failed, rc %d\n", rc);
> +		goto update_mtu_fail;
> +	}
> +
> +	rc = aux_ops->phy_init(aux_dev, port_idx);
> +	if (rc) {
> +		netdev_err(ndev, "PHY init failed, rc %d\n", rc);
> +		goto phy_init_fail;
> +	}
> +
> +	netif_start_queue(ndev);
> +
> +	port->is_initialized = true;
> +
> +	return 0;
> +
> +phy_init_fail:
> +	/* no need to revert the MTU change, it will be updated on next port open */
> +update_mtu_fail:
> +	hdev->asic_funcs.eth_port_close(port);
> +port_open_fail:
> +	if (!hdev->poll_enable)
> +		napi_disable(&port->napi);
> +
> +	aux_ops->port_hw_fini(aux_dev, port_idx);
> +hw_init_fail:
> +	if (!hdev->poll_enable)
> +		netif_napi_del(&port->napi);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_port_open(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +	rc = hbl_en_port_open_locked(port);
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_open(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't open it\n");
> +		return -EBUSY;
> +	}
> +
> +	rc = hbl_en_port_open(port);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static void hbl_en_port_close_locked(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (!port->is_initialized)
> +		return;
> +
> +	port->is_initialized = false;
> +
> +	/* verify that the port is marked as closed before continuing */
> +	mb();
> +
> +	/* Print if not in hard reset flow e.g. from ip cmd */
> +	if (!hdev->in_reset && netif_carrier_ok(port->ndev))
> +		netdev_info(port->ndev, "port was closed\n");
> +
> +	/* disable the PHY here so no link changes will occur from this point forward */
> +	aux_ops->phy_fini(aux_dev, port_idx);
> +
> +	/* disable Tx SW flow */
> +	netif_carrier_off(port->ndev);
> +	netif_tx_disable(port->ndev);
> +
> +	/* stop Tx/Rx HW */
> +	aux_ops->port_hw_fini(aux_dev, port_idx);
> +
> +	/* disable Tx/Rx QPs */
> +	hdev->asic_funcs.eth_port_close(port);
> +
> +	/* stop Rx SW flow */
> +	if (hdev->poll_enable) {
> +		hbl_en_rx_poll_stop(port);
> +	} else {
> +		napi_disable(&port->napi);
> +		netif_napi_del(&port->napi);
> +	}
> +
> +	/* Explicitly count the port close operations as we don't get a link event for this.
> +	 * Upon port open we receive a link event, hence no additional action required.
> +	 */
> +	aux_ops->port_toggle_count(aux_dev, port_idx);
> +}
> +
> +static void hbl_en_port_close(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +	hbl_en_port_close_locked(port);
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +static int __hbl_en_port_reset_locked(struct hbl_en_port *port)
> +{
> +	hbl_en_port_close_locked(port);
> +
> +	return hbl_en_port_open_locked(port);
> +}
> +
> +/* This function should be called after ctrl_lock was taken */
> +int hbl_en_port_reset_locked(struct hbl_aux_dev *aux_dev, u32 port_idx)
> +{
> +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> +
> +	return __hbl_en_port_reset_locked(port);
> +}
> +
> +int hbl_en_port_reset(struct hbl_en_port *port)
> +{
> +	hbl_en_port_close(port);
> +
> +	/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +	msleep(20);
> +
> +	return hbl_en_port_open(port);
> +}
> +
> +static int hbl_en_close(struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	ktime_t timeout;
> +
> +	/* Looks like the return value of this function is not checked, so we can't just return
> +	 * EBUSY if the port is under reset. We need to wait until the reset is finished and then
> +	 * close the port. Otherwise the netdev will set the port as closed although port_close()
> +	 * wasn't called. Only if we waited long enough and the reset hasn't finished, we can return
> +	 * an error without actually closing the port as it is a fatal flow anyway.
> +	 */
> +	timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +	while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		/* If this is called from unregister_netdev() then the port was already closed and
> +		 * hence we can safely return.
> +		 * We could have just check the port_open boolean, but that might hide some future
> +		 * bugs. Hence it is better to use a dedicated flag for that.
> +		 */
> +		if (READ_ONCE(hdev->in_teardown))
> +			return 0;
> +
> +		usleep_range(50, 200);
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			netdev_crit(netdev,
> +				    "Timeout while waiting for port to finish reset, can't close it\n"
> +				    );
> +			return -EBUSY;
> +		}
> +	}
> +
> +	hbl_en_port_close(port);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return 0;
> +}
> +
> +/**
> + * hbl_en_ports_stop_prepare() - stop the Rx and Tx and synchronize with other reset flows.
> + * @aux_dev: habanalabs auxiliary device structure.
> + *
> + * This function makes sure that during the reset no packets will be processed and that
> + * ndo_open/ndo_close do not open/close the ports.
> + * A hard reset might occur right after the driver was loaded, which means before the ports
> + * initialization was finished. Therefore, even if the ports are not yet open, we mark it as in
> + * reset in order to avoid races. We clear the in reset flag later on when reopening the ports.
> + */
> +static void hbl_en_ports_stop_prepare(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	ktime_t timeout;
> +	int i;
> +
> +	/* Check if the ports where initialized. If not, we shouldn't mark them as in reset because
> +	 * they will fail to get opened.
> +	 */
> +	if (!hdev->is_initialized || hdev->in_reset)
> +		return;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* This function is competing with reset from ethtool/ip, so try to take the
> +		 * in_reset atomic and if we are already in a middle of reset, wait until reset
> +		 * function is finished.
> +		 * Reset function is designed to always finish (could take up to a few seconds in
> +		 * worst case).
> +		 * We mark also closed ports as in reset so they won't be able to get opened while
> +		 * the device in under reset.
> +		 */
> +
> +		timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
> +		while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +			usleep_range(50, 200);
> +			if (ktime_compare(ktime_get(), timeout) > 0) {
> +				netdev_crit(port->ndev,
> +					    "Timeout while waiting for port %d to finish reset\n",
> +					    port->idx);
> +				break;
> +			}
> +		}
> +	}
> +
> +	hdev->in_reset = true;
> +}
> +
> +static void hbl_en_ports_stop(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +	struct hbl_en_port *port;
> +	int i;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		if (netif_running(port->ndev))
> +			hbl_en_port_close(port);
> +	}
> +}
> +
> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port is in reset, can't change MTU\n");
> +		return -EBUSY;
> +	}
> +
> +	if (netif_running(port->ndev)) {
> +		hbl_en_port_close(port);
> +
> +		/* Sleep in order to let obsolete events to be dropped before re-opening the port */
> +		msleep(20);
> +
> +		netdev->mtu = new_mtu;
> +
> +		rc = hbl_en_port_open(port);
> +		if (rc)
> +			netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
> +	} else {
> +		netdev->mtu = new_mtu;
> +	}
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +/* Swap source and destination MAC addresses */
> +static inline void swap_l2(char *buf)
> +{
> +	u16 *eth_hdr, tmp;
> +
> +	eth_hdr = (u16 *)buf;
> +	tmp = eth_hdr[0];
> +	eth_hdr[0] = eth_hdr[3];
> +	eth_hdr[3] = tmp;
> +	tmp = eth_hdr[1];
> +	eth_hdr[1] = eth_hdr[4];
> +	eth_hdr[4] = tmp;
> +	tmp = eth_hdr[2];
> +	eth_hdr[2] = eth_hdr[5];
> +	eth_hdr[5] = tmp;
> +}
> +
> +/* Swap source and destination IP addresses
> + */
> +static inline void swap_l3(char *buf)
> +{
> +	u32 tmp;
> +
> +	/* skip the Ethernet header and the IP header till source IP address */
> +	buf += ETH_HLEN + 12;
> +	tmp = ((u32 *)buf)[0];
> +	((u32 *)buf)[0] = ((u32 *)buf)[1];
> +	((u32 *)buf)[1] = tmp;
> +}
> +
> +static void do_tx_swap(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u16 *tmp_buff = (u16 *)skb->data;
> +	u32 port_idx = port->idx;
> +
> +	/* First, let's print the SKB we got */
> +	dev_dbg_ratelimited(hdev->dev,
> +			    "Send [P%d]: dst-mac:%04x%04x%04x, src-mac:%04x%04x%04x, eth-type:%04x, len:%u\n",
> +			    port_idx, swab16(tmp_buff[0]), swab16(tmp_buff[1]), swab16(tmp_buff[2]),
> +			    swab16(tmp_buff[3]), swab16(tmp_buff[4]), swab16(tmp_buff[5]),
> +			    swab16(tmp_buff[6]), skb->len);
> +
> +	/* Before submit it to HW, in case this is ipv4 pkt, swap eth/ip addresses.
> +	 * that way, we may send ECMP (ping) to ourselves in LB cases.
> +	 */
> +	swap_l2(skb->data);
> +	if (swab16(tmp_buff[6]) == ETH_P_IP)
> +		swap_l3(skb->data);
> +}
> +
> +static bool is_pkt_swap_enabled(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	return aux_ops->is_eth_lpbk(aux_dev);
> +}
> +
> +static bool is_tx_disabled(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	return aux_ops->get_mac_lpbk(aux_dev, port_idx) && !is_pkt_swap_enabled(hdev);
> +}
> +
> +static netdev_tx_t hbl_en_handle_tx(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	netdev_tx_t ret;
> +
> +	if (skb->len <= 0 || is_tx_disabled(port))
> +		goto free_skb;
> +
> +	if (skb->len > hdev->max_frm_len) {
> +		netdev_err(port->ndev, "Tx pkt size %uB exceeds maximum of %uB\n", skb->len,
> +			   hdev->max_frm_len);
> +		goto free_skb;
> +	}
> +
> +	if (is_pkt_swap_enabled(hdev))
> +		do_tx_swap(port, skb);
> +
> +	/* Pad the ethernet packets to the minimum frame size as the NIC hw doesn't do it.
> +	 * eth_skb_pad() frees the packet on failure, so just increment the dropped counter and
> +	 * return as success to avoid a retry.
> +	 */
> +	if (skb_put_padto(skb, hdev->pad_size)) {
> +		dev_err_ratelimited(hdev->dev, "Padding failed, the skb is dropped\n");
> +		atomic64_inc(&port->net_stats.tx_dropped);
> +		return NETDEV_TX_OK;
> +	}
> +
> +	ret = hdev->asic_funcs.write_pkt_to_hw(port, skb);
> +	if (ret == NETDEV_TX_OK) {
> +		port->net_stats.tx_packets++;
> +		port->net_stats.tx_bytes += skb->len;
> +	}
> +
> +	return ret;
> +
> +free_skb:
> +	dev_kfree_skb_any(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static netdev_tx_t hbl_en_start_xmit(struct sk_buff *skb, struct net_device *netdev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev;
> +
> +	hdev = port->hdev;
> +
> +	return hbl_en_handle_tx(port, skb);
> +}
> +
> +static int hbl_en_set_port_mac_loopback(struct hbl_en_port *port, bool enable)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct net_device *ndev = port->ndev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	rc = aux_ops->set_mac_lpbk(aux_dev, port_idx, enable);
> +	if (rc)
> +		return rc;
> +
> +	netdev_info(ndev, "port %u: mac loopback is %s\n", port_idx,
> +		    enable ? "enabled" : "disabled");
> +
> +	if (netif_running(ndev)) {
> +		rc = hbl_en_port_reset(port);
> +		if (rc) {
> +			netdev_err(ndev, "Failed to reset port %u, rc %d\n", port_idx, rc);
> +			return rc;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int hbl_en_set_features(struct net_device *netdev, netdev_features_t features)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	netdev_features_t changed;
> +	int rc = 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(netdev, "port %d is in reset, can't update settings", port->idx);
> +		return -EBUSY;
> +	}
> +
> +	changed = netdev->features ^ features;
> +
> +	if (changed & NETIF_F_LOOPBACK)
> +		rc = hbl_en_set_port_mac_loopback(port, !!(features & NETIF_F_LOOPBACK));
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static void hbl_en_handle_tx_timeout(struct net_device *netdev, unsigned int txqueue)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +
> +	port->net_stats.tx_errors++;
> +	atomic64_inc(&port->net_stats.tx_dropped);
> +}
> +
> +static void hbl_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(dev);
> +
> +	stats->rx_bytes = port->net_stats.rx_bytes;
> +	stats->tx_bytes = port->net_stats.tx_bytes;
> +	stats->rx_packets = port->net_stats.rx_packets;
> +	stats->tx_packets = port->net_stats.tx_packets;
> +	stats->tx_errors = port->net_stats.tx_errors;
> +	stats->tx_dropped = (u64)atomic64_read(&port->net_stats.tx_dropped);
> +	stats->rx_dropped = (u64)atomic64_read(&port->net_stats.rx_dropped);
> +}
> +
> +static const struct net_device_ops hbl_en_netdev_ops = {
> +	.ndo_open = hbl_en_open,
> +	.ndo_stop = hbl_en_close,
> +	.ndo_start_xmit = hbl_en_start_xmit,
> +	.ndo_validate_addr = eth_validate_addr,
> +	.ndo_change_mtu = hbl_en_change_mtu,
> +	.ndo_set_features = hbl_en_set_features,
> +	.ndo_get_stats64 = hbl_en_get_stats64,
> +	.ndo_tx_timeout = hbl_en_handle_tx_timeout,
> +};
> +
> +static void hbl_en_set_ops(struct net_device *ndev)
> +{
> +	ndev->netdev_ops = &hbl_en_netdev_ops;
> +	ndev->ethtool_ops = hbl_en_ethtool_get_ops(ndev);
> +#ifdef CONFIG_DCB
> +	ndev->dcbnl_ops = &hbl_en_dcbnl_ops;
> +#endif
> +}
> +
> +static int hbl_en_port_register(struct hbl_en_port *port)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	struct hbl_en_port **ptr;
> +	struct net_device *ndev;
> +	int rc;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	ndev = alloc_etherdev(sizeof(struct hbl_en_port *));
> +	if (!ndev) {
> +		dev_err(hdev->dev, "netdevice %d alloc failed\n", port_idx);
> +		return -ENOMEM;
> +	}
> +
> +	port->ndev = ndev;
> +	SET_NETDEV_DEV(ndev, &hdev->pdev->dev);
> +	ptr = netdev_priv(ndev);
> +	*ptr = port;
> +
> +	/* necessary for creating multiple interfaces */
> +	ndev->dev_port = port_idx;
> +
> +	hbl_en_set_ops(ndev);
> +
> +	ndev->watchdog_timeo = TX_TIMEOUT;
> +	ndev->min_mtu = hdev->min_raw_mtu;
> +	ndev->max_mtu = hdev->max_raw_mtu;
> +
> +	/* Add loopback capability to the device. */
> +	ndev->hw_features |= NETIF_F_LOOPBACK;
> +
> +	/* If this port was set to loopback, set it also to the ndev features */
> +	if (aux_ops->get_mac_lpbk(aux_dev, port_idx))
> +		ndev->features |= NETIF_F_LOOPBACK;
> +
> +	eth_hw_addr_set(ndev, port->mac_addr);
> +
> +	/* It's more an intelligent poll wherein, we enable the Rx completion EQE event and then
> +	 * start the poll from there.
> +	 * Inside the polling thread, we read packets from hardware and then reschedule the poll
> +	 * only if there are more packets to be processed. Else we re-enable the CQ Arm interrupt
> +	 * and exit the poll.
> +	 */
> +	if (hdev->poll_enable)
> +		hbl_en_rx_poll_trigger_init(port);
> +
> +	netif_carrier_off(ndev);
> +
> +	rc = register_netdev(ndev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Could not register netdevice %d\n", port_idx);
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	if (ndev) {
> +		free_netdev(ndev);
> +		port->ndev = NULL;
> +	}
> +
> +	return rc;
> +}
> +
> +static void dump_swap_pkt(struct hbl_en_port *port, struct sk_buff *skb)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	u16 *tmp_buff = (u16 *)skb->data;
> +	u32 port_idx = port->idx;
> +
> +	/* The SKB is ready now (before stripping-out the L2), print its content */
> +	dev_dbg_ratelimited(hdev->dev,
> +			    "Recv [P%d]: dst-mac:%04x%04x%04x, src-mac:%04x%04x%04x, eth-type:%04x, len:%u\n",
> +			    port_idx, swab16(tmp_buff[0]), swab16(tmp_buff[1]), swab16(tmp_buff[2]),
> +			    swab16(tmp_buff[3]), swab16(tmp_buff[4]), swab16(tmp_buff[5]),
> +			    swab16(tmp_buff[6]), skb->len);
> +}
> +
> +int hbl_en_handle_rx(struct hbl_en_port *port, int budget)
> +{
> +	struct hbl_en_device *hdev = port->hdev;
> +	enum hbl_en_eth_pkt_status pkt_status;
> +	struct net_device *ndev = port->ndev;
> +	int rc, pkt_count = 0;
> +	struct sk_buff *skb;
> +	void *pkt_addr;
> +	u32 pkt_size;
> +
> +	if (!netif_carrier_ok(ndev))
> +		return 0;
> +
> +	while (pkt_count < budget) {
> +		pkt_status = hdev->asic_funcs.read_pkt_from_hw(port, &pkt_addr, &pkt_size);
> +
> +		if (pkt_status == ETH_PKT_NONE)
> +			break;
> +
> +		pkt_count++;
> +
> +		if (pkt_status == ETH_PKT_DROP) {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +			continue;
> +		}
> +
> +		if (hdev->poll_enable)
> +			skb = __netdev_alloc_skb_ip_align(ndev, pkt_size, GFP_KERNEL);
> +		else
> +			skb = napi_alloc_skb(&port->napi, pkt_size);
> +
> +		if (!skb) {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +			break;
> +		}
> +
> +		skb_copy_to_linear_data(skb, pkt_addr, pkt_size);
> +		skb_put(skb, pkt_size);
> +
> +		if (is_pkt_swap_enabled(hdev))
> +			dump_swap_pkt(port, skb);
> +
> +		skb->protocol = eth_type_trans(skb, ndev);
> +
> +		/* Zero the packet buffer memory to avoid leak in case of wrong
> +		 * size is used when next packet populates the same memory
> +		 */
> +		memset(pkt_addr, 0, pkt_size);
> +
> +		/* polling is done in thread context and hence BH should be disabled */
> +		if (hdev->poll_enable)
> +			local_bh_disable();
> +
> +		rc = netif_receive_skb(skb);
> +
> +		if (hdev->poll_enable)
> +			local_bh_enable();
> +
> +		if (rc == NET_RX_SUCCESS) {
> +			port->net_stats.rx_packets++;
> +			port->net_stats.rx_bytes += pkt_size;
> +		} else {
> +			atomic64_inc(&port->net_stats.rx_dropped);
> +		}
> +	}
> +
> +	return pkt_count;
> +}
> +
> +static bool __hbl_en_rx_poll_schedule(struct hbl_en_port *port, unsigned long delay)
> +{
> +	return queue_delayed_work(port->rx_wq, &port->rx_poll_work, delay);
> +}
> +
> +static void hbl_en_rx_poll_work(struct work_struct *work)
> +{
> +	struct hbl_en_port *port = container_of(work, struct hbl_en_port, rx_poll_work.work);
> +	struct hbl_en_device *hdev = port->hdev;
> +	int pkt_count;
> +
> +	pkt_count = hbl_en_handle_rx(port, NAPI_POLL_WEIGHT);
> +
> +	/* Reschedule the poll if we have consumed budget which means we still have packets to
> +	 * process. Else re-enable the Rx IRQs and exit the work.
> +	 */
> +	if (pkt_count < NAPI_POLL_WEIGHT)
> +		hdev->asic_funcs.reenable_rx_irq(port);
> +	else
> +		__hbl_en_rx_poll_schedule(port, 0);
> +}
> +
> +/* Rx poll init and trigger routines are used in event-driven setups where
> + * Rx polling is initialized once during init or open and started/triggered by the event handler.
> + */
> +void hbl_en_rx_poll_trigger_init(struct hbl_en_port *port)
> +{
> +	INIT_DELAYED_WORK(&port->rx_poll_work, hbl_en_rx_poll_work);
> +}
> +
> +bool hbl_en_rx_poll_start(struct hbl_en_port *port)
> +{
> +	return __hbl_en_rx_poll_schedule(port, msecs_to_jiffies(1));
> +}
> +
> +void hbl_en_rx_poll_stop(struct hbl_en_port *port)
> +{
> +	cancel_delayed_work_sync(&port->rx_poll_work);
> +}
> +
> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget)
> +{
> +	struct hbl_en_port *port = container_of(napi, struct hbl_en_port, napi);
> +	struct hbl_en_device *hdev = port->hdev;
> +	int pkt_count;
> +
> +	/* exit if we are called by netpoll as we free the Tx ring via EQ (if enabled) */
> +	if (!budget)
> +		return 0;
> +
> +	pkt_count = hbl_en_handle_rx(port, budget);
> +
> +	/* If budget not fully consumed, exit the polling mode */
> +	if (pkt_count < budget) {
> +		napi_complete_done(napi, pkt_count);
> +		hdev->asic_funcs.reenable_rx_irq(port);
> +	}
> +
> +	return pkt_count;
> +}
> +
> +static void hbl_en_port_unregister(struct hbl_en_port *port)
> +{
> +	struct net_device *ndev = port->ndev;
> +
> +	unregister_netdev(ndev);
> +	free_netdev(ndev);
> +	port->ndev = NULL;
> +}
> +
> +static int hbl_en_set_asic_funcs(struct hbl_en_device *hdev)
> +{
> +	switch (hdev->asic_type) {
> +	case HBL_ASIC_GAUDI2:
> +	default:
> +		dev_err(hdev->dev, "Unrecognized ASIC type %d\n", hdev->asic_type);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static void hbl_en_handle_eqe(struct hbl_aux_dev *aux_dev, u32 port, struct hbl_cn_eqe *eqe)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	hdev->asic_funcs.handle_eqe(aux_dev, port, eqe);
> +}
> +
> +static void hbl_en_set_aux_ops(struct hbl_en_device *hdev, bool enable)
> +{
> +	struct hbl_en_aux_ops *aux_ops = hdev->aux_dev->aux_ops;
> +
> +	if (enable) {
> +		aux_ops->ports_reopen = hbl_en_ports_reopen;
> +		aux_ops->ports_stop_prepare = hbl_en_ports_stop_prepare;
> +		aux_ops->ports_stop = hbl_en_ports_stop;
> +		aux_ops->set_port_status = hbl_en_set_port_status;
> +		aux_ops->is_port_open = hbl_en_is_port_open;
> +		aux_ops->get_src_ip = hbl_en_get_src_ip;
> +		aux_ops->reset_stats = hbl_en_reset_stats;
> +		aux_ops->get_mtu = hbl_en_get_mtu;
> +		aux_ops->get_pflags = hbl_en_get_pflags;
> +		aux_ops->set_dev_lpbk = hbl_en_set_dev_lpbk;
> +		aux_ops->handle_eqe = hbl_en_handle_eqe;
> +	} else {
> +		aux_ops->ports_reopen = NULL;
> +		aux_ops->ports_stop_prepare = NULL;
> +		aux_ops->ports_stop = NULL;
> +		aux_ops->set_port_status = NULL;
> +		aux_ops->is_port_open = NULL;
> +		aux_ops->get_src_ip = NULL;
> +		aux_ops->reset_stats = NULL;
> +		aux_ops->get_mtu = NULL;
> +		aux_ops->get_pflags = NULL;
> +		aux_ops->set_dev_lpbk = NULL;
> +		aux_ops->handle_eqe = NULL;
> +	}
> +}
> +
> +int hbl_en_dev_init(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_asic_funcs *asic_funcs = &hdev->asic_funcs;
> +	struct hbl_en_port *port;
> +	int rc, i, port_cnt = 0;
> +
> +	/* must be called before the call to dev_init() */
> +	rc = hbl_en_set_asic_funcs(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to set aux ops\n");
> +		return rc;
> +	}
> +
> +	rc = asic_funcs->dev_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "device init failed\n");
> +		return rc;
> +	}
> +
> +	/* init the function pointers here before calling hbl_en_port_register which sets up
> +	 * net_device_ops, and its ops might start getting called.
> +	 * If any failure is encountered, these will be made NULL and the core driver won't call
> +	 * them.
> +	 */
> +	hbl_en_set_aux_ops(hdev, true);
> +
> +	/* Port register depends on the above initialization so it must be called here and not
> +	 * before that.
> +	 */
> +	for (i = 0; i < hdev->max_num_of_ports; i++, port_cnt++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		rc = hbl_en_port_init(port);
> +		if (rc) {
> +			dev_err(hdev->dev, "port init failed\n");
> +			goto unregister_ports;
> +		}
> +
> +		rc = hbl_en_port_register(port);
> +		if (rc) {
> +			dev_err(hdev->dev, "port register failed\n");
> +
> +			hbl_en_port_fini(port);
> +			goto unregister_ports;
> +		}
> +	}
> +
> +	hdev->is_initialized = true;
> +
> +	return 0;
> +
> +unregister_ports:
> +	for (i = 0; i < port_cnt; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		hbl_en_port_unregister(port);
> +		hbl_en_port_fini(port);
> +	}
> +
> +	hbl_en_set_aux_ops(hdev, false);
> +
> +	asic_funcs->dev_fini(hdev);
> +
> +	return rc;
> +}
> +
> +void hbl_en_dev_fini(struct hbl_en_device *hdev)
> +{
> +	struct hbl_en_asic_funcs *asic_funcs = &hdev->asic_funcs;
> +	struct hbl_en_port *port;
> +	int i;
> +
> +	hdev->in_teardown = true;
> +
> +	if (!hdev->is_initialized)
> +		return;
> +
> +	hdev->is_initialized = false;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +
> +		/* It could be this cleanup flow is called after a failed init flow.
> +		 * Hence we need to check that we indeed have a netdev to unregister.
> +		 */
> +		if (!port->ndev)
> +			continue;
> +
> +		hbl_en_port_unregister(port);
> +		hbl_en_port_fini(port);
> +	}
> +
> +	hbl_en_set_aux_ops(hdev, false);
> +
> +	asic_funcs->dev_fini(hdev);
> +}
> +
> +dma_addr_t hbl_en_dma_map(struct hbl_en_device *hdev, void *addr, int len)
> +{
> +	dma_addr_t dma_addr;
> +
> +	if (hdev->dma_map_support)
> +		dma_addr = dma_map_single(&hdev->pdev->dev, addr, len, DMA_TO_DEVICE);
> +	else
> +		dma_addr = virt_to_phys(addr);
> +
> +	return dma_addr;
> +}
> +
> +void hbl_en_dma_unmap(struct hbl_en_device *hdev, dma_addr_t dma_addr, int len)
> +{
> +	if (hdev->dma_map_support)
> +		dma_unmap_single(&hdev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
> +}
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
> new file mode 100644
> index 000000000000..15504c1f3cfb
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en.h
> @@ -0,0 +1,206 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#ifndef HABANALABS_EN_H_
> +#define HABANALABS_EN_H_
> +
> +#include <linux/net/intel/cn.h>
> +
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
> +
> +#define HBL_EN_NAME			"habanalabs_en"
> +
> +#define HBL_EN_PORT(aux_dev, idx)	(&(((struct hbl_en_device *)(aux_dev)->priv)->ports[(idx)]))
> +
> +#define hbl_netdev_priv(ndev) \
> +({ \
> +	typecheck(struct net_device *, ndev); \
> +	*(struct hbl_en_port **)netdev_priv(ndev); \
> +})
> +
> +/**
> + * enum hbl_en_eth_pkt_status - status of Rx Ethernet packet.
> + * ETH_PKT_OK: packet was received successfully.
> + * ETH_PKT_DROP: packet should be dropped.
> + * ETH_PKT_NONE: no available packet.
> + */
> +enum hbl_en_eth_pkt_status {
> +	ETH_PKT_OK,
> +	ETH_PKT_DROP,
> +	ETH_PKT_NONE
> +};
> +
> +/**
> + * struct hbl_en_net_stats - stats of Ethernet interface.
> + * rx_packets: number of packets received.
> + * tx_packets: number of packets sent.
> + * rx_bytes: total bytes of data received.
> + * tx_bytes: total bytes of data sent.
> + * tx_errors: number of errors in the TX.
> + * rx_dropped: number of packets dropped by the RX.
> + * tx_dropped: number of packets dropped by the TX.
> + */
> +struct hbl_en_net_stats {
> +	u64 rx_packets;
> +	u64 tx_packets;
> +	u64 rx_bytes;
> +	u64 tx_bytes;
> +	u64 tx_errors;
> +	atomic64_t rx_dropped;
> +	atomic64_t tx_dropped;
> +};
> +
> +/**
> + * struct hbl_en_port - manage port common structure.
> + * @hdev: habanalabs Ethernet device structure.
> + * @ndev: network device.
> + * @rx_wq: WQ for Rx poll when we cannot schedule NAPI poll.
> + * @mac_addr: HW MAC addresses.
> + * @asic_specific: ASIC specific port structure.
> + * @napi: New API structure.
> + * @rx_poll_work: Rx work for polling mode.
> + * @net_stats: statistics of the ethernet interface.
> + * @in_reset: true if the NIC was marked as in reset, false otherwise. Used to avoid an additional
> + *            stopping of the NIC if a hard reset was re-initiated.
> + * @pflags: ethtool private flags bit mask.
> + * @idx: index of this specific port.
> + * @rx_max_coalesced_frames: Maximum number of packets to receive before an RX interrupt.
> + * @tx_max_coalesced_frames: Maximum number of packets to be sent before a TX interrupt.
> + * @rx_coalesce_usecs: How many usecs to delay an RX interrupt after a packet arrives.
> + * @is_initialized: true if the port H/W is initialized, false otherwise.
> + * @pfc_enable: true if this port supports Priority Flow Control, false otherwise.
> + * @auto_neg_enable: is autoneg enabled.
> + * @auto_neg_resolved: was autoneg phase finished successfully.
> + */
> +struct hbl_en_port {
> +	struct hbl_en_device *hdev;
> +	struct net_device *ndev;
> +	struct workqueue_struct *rx_wq;
> +	char *mac_addr;
> +	void *asic_specific;
> +	struct napi_struct napi;
> +	struct delayed_work rx_poll_work;
> +	struct hbl_en_net_stats net_stats;
> +	atomic_t in_reset;
> +	u32 pflags;
> +	u32 idx;
> +	u32 rx_max_coalesced_frames;
> +	u32 tx_max_coalesced_frames;
> +	u16 rx_coalesce_usecs;
> +	u8 is_initialized;
> +	u8 pfc_enable;
> +	u8 auto_neg_enable;
> +	u8 auto_neg_resolved;
> +};
> +
> +/**
> + * struct hbl_en_asic_funcs - ASIC specific Ethernet functions.
> + * @dev_init: device init.
> + * @dev_fini: device cleanup.
> + * @reenable_rx_irq: re-enable Rx interrupts.
> + * @eth_port_open: initialize and open the Ethernet port.
> + * @eth_port_close: close the Ethernet port.
> + * @write_pkt_to_hw: write skb to HW.
> + * @read_pkt_from_hw: read pkt from HW.
> + * @get_pfc_cnts: get PFC counters.
> + * @set_coalesce: set Tx/Rx coalesce config in HW.
> + * @get_rx_ring size: get max number of elements the Rx ring can contain.
> + * @handle_eqe: Handle a received event.
> + */
> +struct hbl_en_asic_funcs {
> +	int (*dev_init)(struct hbl_en_device *hdev);
> +	void (*dev_fini)(struct hbl_en_device *hdev);
> +	void (*reenable_rx_irq)(struct hbl_en_port *port);
> +	int (*eth_port_open)(struct hbl_en_port *port);
> +	void (*eth_port_close)(struct hbl_en_port *port);
> +	netdev_tx_t (*write_pkt_to_hw)(struct hbl_en_port *port, struct sk_buff *skb);
> +	int (*read_pkt_from_hw)(struct hbl_en_port *port, void **pkt_addr, u32 *pkt_size);
> +	void (*get_pfc_cnts)(struct hbl_en_port *port, void *ptr);
> +	int (*set_coalesce)(struct hbl_en_port *port);
> +	int (*get_rx_ring_size)(struct hbl_en_port *port);
> +	void (*handle_eqe)(struct hbl_aux_dev *aux_dev, u32 port_idx, struct hbl_cn_eqe *eqe);
> +};
> +
> +/**
> + * struct hbl_en_device - habanalabs Ethernet device structure.
> + * @pdev: pointer to PCI device.
> + * @dev: related kernel basic device structure.
> + * @ports: array of all ports manage common structures.
> + * @aux_dev: pointer to auxiliary device.
> + * @asic_specific: ASIC specific device structure.
> + * @fw_ver: FW version.
> + * @qsfp_eeprom: QSFPD EEPROM info.
> + * @mac_addr: array of all MAC addresses.
> + * @asic_funcs: ASIC specific Ethernet functions.
> + * @asic_type: ASIC specific type.
> + * @ports_mask: mask of available ports.
> + * @auto_neg_mask: mask of port with Autonegotiation enabled.
> + * @port_reset_timeout: max time in seconds for a port reset flow to finish.
> + * @pending_reset_long_timeout: long timeout for pending hard reset to finish in seconds.
> + * @max_frm_len: maximum allowed frame length.
> + * @raw_elem_size: size of element in raw buffers.
> + * @max_raw_mtu: maximum MTU size for raw packets.
> + * @min_raw_mtu: minimum MTU size for raw packets.
> + * @pad_size: the pad size in bytes for the skb to transmit.
> + * @core_dev_id: core device ID.
> + * @max_num_of_ports: max number of available ports;
> + * @in_reset: is the entire NIC currently under reset.
> + * @poll_enable: Enable Rx polling rather than IRQ + NAPI.
> + * @in_teardown: true if the NIC is in teardown (during device remove).
> + * @is_initialized: was the device initialized successfully.
> + * @has_eq: true if event queue is supported.
> + * @dma_map_support: HW supports DMA mapping.
> + */
> +struct hbl_en_device {
> +	struct pci_dev *pdev;
> +	struct device *dev;
> +	struct hbl_en_port *ports;
> +	struct hbl_aux_dev *aux_dev;
> +	void *asic_specific;
> +	char *fw_ver;
> +	char *qsfp_eeprom;
> +	char *mac_addr;
> +	struct hbl_en_asic_funcs asic_funcs;
> +	enum hbl_cn_asic_type asic_type;
> +	u64 ports_mask;
> +	u64 auto_neg_mask;
> +	u32 port_reset_timeout;
> +	u32 pending_reset_long_timeout;
> +	u32 max_frm_len;
> +	u32 raw_elem_size;
> +	u16 max_raw_mtu;
> +	u16 min_raw_mtu;
> +	u16 pad_size;
> +	u16 core_dev_id;
> +	u8 max_num_of_ports;
> +	u8 in_reset;
> +	u8 poll_enable;
> +	u8 in_teardown;
> +	u8 is_initialized;
> +	u8 has_eq;
> +	u8 dma_map_support;
> +};
> +
> +int hbl_en_dev_init(struct hbl_en_device *hdev);
> +void hbl_en_dev_fini(struct hbl_en_device *hdev);
> +
> +const struct ethtool_ops *hbl_en_ethtool_get_ops(struct net_device *ndev);
> +void hbl_en_ethtool_init_coalesce(struct hbl_en_port *port);
> +
> +extern const struct dcbnl_rtnl_ops hbl_en_dcbnl_ops;
> +
> +bool hbl_en_rx_poll_start(struct hbl_en_port *port);
> +void hbl_en_rx_poll_stop(struct hbl_en_port *port);
> +void hbl_en_rx_poll_trigger_init(struct hbl_en_port *port);
> +int hbl_en_port_reset(struct hbl_en_port *port);
> +int hbl_en_port_reset_locked(struct hbl_aux_dev *aux_dev, u32 port_idx);
> +int hbl_en_handle_rx(struct hbl_en_port *port, int budget);
> +dma_addr_t hbl_en_dma_map(struct hbl_en_device *hdev, void *addr, int len);
> +void hbl_en_dma_unmap(struct hbl_en_device *hdev, dma_addr_t dma_addr, int len);
> +
> +#endif /* HABANALABS_EN_H_ */
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
> new file mode 100644
> index 000000000000..5d718579a2b6
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_dcbnl.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +
> +#define PFC_PRIO_MASK_ALL	GENMASK(HBL_EN_PFC_PRIO_NUM - 1, 0)
> +#define PFC_PRIO_MASK_NONE	0
> +
> +#ifdef CONFIG_DCB
> +static int hbl_en_dcbnl_ieee_getpfc(struct net_device *netdev, struct ieee_pfc *pfc)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_device *hdev;
> +	u32 port_idx;
> +
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_dbg_ratelimited(hdev->dev, "port %d is in reset, can't get PFC", port_idx);
> +		return -EBUSY;
> +	}
> +
> +	pfc->pfc_en = port->pfc_enable ? PFC_PRIO_MASK_ALL : PFC_PRIO_MASK_NONE;
> +	pfc->pfc_cap = HBL_EN_PFC_PRIO_NUM;
> +
> +	hdev->asic_funcs.get_pfc_cnts(port, pfc);
> +
> +	atomic_set(&port->in_reset, 0);
> +
> +	return 0;
> +}
> +
> +static int hbl_en_dcbnl_ieee_setpfc(struct net_device *netdev, struct ieee_pfc *pfc)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(netdev);
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	u8 curr_pfc_en;
> +	u32 port_idx;
> +	int rc = 0;
> +
> +	hdev = port->hdev;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	port_idx = port->idx;
> +
> +	if (pfc->pfc_en & ~PFC_PRIO_MASK_ALL) {
> +		dev_dbg_ratelimited(hdev->dev, "PFC supports %d priorities only, port %d\n",
> +				    HBL_EN_PFC_PRIO_NUM, port_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (pfc->pfc_en != PFC_PRIO_MASK_NONE && pfc->pfc_en != PFC_PRIO_MASK_ALL) {
> +		dev_dbg_ratelimited(hdev->dev,
> +				    "PFC should be enabled/disabled on all priorities, port %d\n",
> +				    port_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_dbg_ratelimited(hdev->dev, "port %d is in reset, can't set PFC", port_idx);
> +		return -EBUSY;
> +	}
> +
> +	curr_pfc_en = port->pfc_enable ? PFC_PRIO_MASK_ALL : PFC_PRIO_MASK_NONE;
> +
> +	if (pfc->pfc_en == curr_pfc_en)
> +		goto out;
> +
> +	port->pfc_enable = !port->pfc_enable;
> +
> +	rc = aux_ops->set_pfc(aux_dev, port_idx, port->pfc_enable);
> +
> +out:
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static u8 hbl_en_dcbnl_getdcbx(struct net_device *netdev)
> +{
> +	return DCB_CAP_DCBX_HOST | DCB_CAP_DCBX_VER_IEEE;
> +}
> +
> +static u8 hbl_en_dcbnl_setdcbx(struct net_device *netdev, u8 mode)
> +{
> +	return !(mode == (DCB_CAP_DCBX_HOST | DCB_CAP_DCBX_VER_IEEE));
> +}
> +
> +const struct dcbnl_rtnl_ops hbl_en_dcbnl_ops = {
> +	.ieee_getpfc	= hbl_en_dcbnl_ieee_getpfc,
> +	.ieee_setpfc	= hbl_en_dcbnl_ieee_setpfc,
> +	.getdcbx	= hbl_en_dcbnl_getdcbx,
> +	.setdcbx	= hbl_en_dcbnl_setdcbx
> +};
> +#endif
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
> new file mode 100644
> index 000000000000..23a87d36ded5
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_drv.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#define pr_fmt(fmt)		"habanalabs_en: " fmt
> +
> +#include "hbl_en.h"
> +
> +#include <linux/module.h>
> +#include <linux/auxiliary_bus.h>
> +
> +#define HBL_DRIVER_AUTHOR	"HabanaLabs Kernel Driver Team"
> +
> +#define HBL_DRIVER_DESC		"HabanaLabs AI accelerators Ethernet driver"
> +
> +MODULE_AUTHOR(HBL_DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(HBL_DRIVER_DESC);
> +MODULE_LICENSE("GPL");
> +
> +static bool poll_enable;
> +
> +module_param(poll_enable, bool, 0444);
> +MODULE_PARM_DESC(poll_enable,
> +		 "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
> +
> +static int hdev_init(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_aux_data *aux_data = aux_dev->aux_data;
> +	struct hbl_en_port *ports, *port;
> +	struct hbl_en_device *hdev;
> +	int rc, i;
> +
> +	hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
> +	if (!hdev)
> +		return -ENOMEM;
> +
> +	ports = kcalloc(aux_data->max_num_of_ports, sizeof(*ports), GFP_KERNEL);
> +	if (!ports) {
> +		rc = -ENOMEM;
> +		goto ports_alloc_fail;
> +	}
> +
> +	aux_dev->priv = hdev;
> +	hdev->aux_dev = aux_dev;
> +	hdev->ports = ports;
> +	hdev->pdev = aux_data->pdev;
> +	hdev->dev = aux_data->dev;
> +	hdev->ports_mask = aux_data->ports_mask;
> +	hdev->auto_neg_mask = aux_data->auto_neg_mask;
> +	hdev->max_num_of_ports = aux_data->max_num_of_ports;
> +	hdev->core_dev_id = aux_data->id;
> +	hdev->fw_ver = aux_data->fw_ver;
> +	hdev->qsfp_eeprom = aux_data->qsfp_eeprom;
> +	hdev->asic_type = aux_data->asic_type;
> +	hdev->pending_reset_long_timeout = aux_data->pending_reset_long_timeout;
> +	hdev->max_frm_len = aux_data->max_frm_len;
> +	hdev->raw_elem_size = aux_data->raw_elem_size;
> +	hdev->max_raw_mtu = aux_data->max_raw_mtu;
> +	hdev->min_raw_mtu = aux_data->min_raw_mtu;
> +	hdev->pad_size = ETH_ZLEN;
> +	hdev->has_eq = aux_data->has_eq;
> +	hdev->dma_map_support = true;
> +	hdev->poll_enable = poll_enable;
> +
> +	for (i = 0; i < hdev->max_num_of_ports; i++) {
> +		if (!(hdev->ports_mask & BIT(i)))
> +			continue;
> +
> +		port = &hdev->ports[i];
> +		port->hdev = hdev;
> +		port->idx = i;
> +		port->pfc_enable = true;
> +		port->pflags = PFLAGS_PCS_LINK_CHECK | PFLAGS_PHY_AUTO_NEG_LPBK;
> +		port->mac_addr = aux_data->mac_addr[i];
> +		port->auto_neg_enable = !!(aux_data->auto_neg_mask & BIT(i));
> +	}
> +
> +	return 0;
> +
> +ports_alloc_fail:
> +	kfree(hdev);
> +
> +	return rc;
> +}
> +
> +static void hdev_fini(struct hbl_aux_dev *aux_dev)
> +{
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	kfree(hdev->ports);
> +	kfree(hdev);
> +	aux_dev->priv = NULL;
> +}
> +
> +static const struct auxiliary_device_id hbl_en_id_table[] = {
> +	{ .name = "habanalabs_cn.en", },
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, hbl_en_id_table);
> +
> +static int hbl_en_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
> +{
> +	struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
> +	struct hbl_en_aux_ops *aux_ops = aux_dev->aux_ops;
> +	struct hbl_en_device *hdev;
> +	ktime_t timeout;
> +	int rc;
> +
> +	rc = hdev_init(aux_dev);
> +	if (rc) {
> +		dev_err(&aux_dev->adev.dev, "Failed to init hdev\n");
> +		return -EIO;
> +	}
> +
> +	hdev = aux_dev->priv;
> +
> +	/* don't allow module unloading while it is attached */
> +	if (!try_module_get(THIS_MODULE)) {
> +		dev_err(hdev->dev, "Failed to increment %s module refcount\n", HBL_EN_NAME);
> +		rc = -EIO;
> +		goto module_get_err;
> +	}
> +
> +	timeout = ktime_add_ms(ktime_get(), hdev->pending_reset_long_timeout * MSEC_PER_SEC);
> +	while (1) {
> +		aux_ops->hw_access_lock(aux_dev);
> +
> +		/* if the device is operational, proceed to actual init while holding the lock in
> +		 * order to prevent concurrent hard reset
> +		 */
> +		if (aux_ops->device_operational(aux_dev))
> +			break;
> +
> +		aux_ops->hw_access_unlock(aux_dev);
> +
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			dev_err(hdev->dev, "Timeout while waiting for hard reset to finish\n");
> +			rc = -EBUSY;
> +			goto timeout_err;
> +		}
> +
> +		dev_notice_once(hdev->dev, "Waiting for hard reset to finish before probing en\n");
> +
> +		msleep_interruptible(MSEC_PER_SEC);
> +	}
> +
> +	rc = hbl_en_dev_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to init en device\n");
> +		goto dev_init_err;
> +	}
> +
> +	aux_ops->hw_access_unlock(aux_dev);
> +
> +	return 0;
> +
> +dev_init_err:
> +	aux_ops->hw_access_unlock(aux_dev);
> +timeout_err:
> +	module_put(THIS_MODULE);
> +module_get_err:
> +	hdev_fini(aux_dev);
> +
> +	return rc;
> +}
> +
> +/* This function can be called only from the CN driver when deleting the aux bus, because we
> + * incremented the module refcount on probing. Hence no need to protect here from hard reset.
> + */
> +static void hbl_en_remove(struct auxiliary_device *adev)
> +{
> +	struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
> +	struct hbl_en_device *hdev = aux_dev->priv;
> +
> +	if (!hdev)
> +		return;
> +
> +	hbl_en_dev_fini(hdev);
> +
> +	/* allow module unloading as now it is detached */
> +	module_put(THIS_MODULE);
> +
> +	hdev_fini(aux_dev);
> +}
> +
> +static struct auxiliary_driver hbl_en_driver = {
> +	.name = "eth",
> +	.probe = hbl_en_probe,
> +	.remove = hbl_en_remove,
> +	.id_table = hbl_en_id_table,
> +};
> +
> +static int __init hbl_en_init(void)
> +{
> +	pr_info("loading driver\n");
> +
> +	return auxiliary_driver_register(&hbl_en_driver);
> +}
> +
> +static void __exit hbl_en_exit(void)
> +{
> +	auxiliary_driver_unregister(&hbl_en_driver);
> +
> +	pr_info("driver removed\n");
> +}
> +
> +module_init(hbl_en_init);
> +module_exit(hbl_en_exit);
> diff --git a/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> new file mode 100644
> index 000000000000..1d14d283409b
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/hbl_en/common/hbl_en_ethtool.c
> @@ -0,0 +1,452 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2020-2024 HabanaLabs, Ltd.
> + * Copyright (C) 2023-2024, Intel Corporation.
> + * All Rights Reserved.
> + */
> +
> +#include "hbl_en.h"
> +#include <linux/ethtool.h>
> +
> +#define RX_COALESCED_FRAMES_MIN		1
> +#define TX_COALESCED_FRAMES_MIN		1
> +#define TX_COALESCED_FRAMES_MAX		10
> +
> +static const char pflags_str[][ETH_GSTRING_LEN] = {
> +	"pcs-link-check",
> +	"phy-auto-neg-lpbk",
> +};
> +
> +#define NIC_STAT(m) {#m, offsetof(struct hbl_en_port, net_stats.m)}
> +
> +static struct hbl_cn_stat netdev_eth_stats[] = {
> +	NIC_STAT(rx_packets),
> +	NIC_STAT(tx_packets),
> +	NIC_STAT(rx_bytes),
> +	NIC_STAT(tx_bytes),
> +	NIC_STAT(tx_errors),
> +	NIC_STAT(rx_dropped),
> +	NIC_STAT(tx_dropped)
> +};
> +
> +static size_t pflags_str_len = ARRAY_SIZE(pflags_str);
> +static size_t netdev_eth_stats_len = ARRAY_SIZE(netdev_eth_stats);
> +
> +static void hbl_en_ethtool_get_drvinfo(struct net_device *ndev, struct ethtool_drvinfo *drvinfo)
> +{
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +
> +	strscpy(drvinfo->driver, HBL_EN_NAME, sizeof(drvinfo->driver));
> +	strscpy(drvinfo->fw_version, hdev->fw_ver, sizeof(drvinfo->fw_version));
> +	strscpy(drvinfo->bus_info, pci_name(hdev->pdev), sizeof(drvinfo->bus_info));
> +}
> +
> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
> +{
> +	modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
> +	modinfo->type = ETH_MODULE_SFF_8636;
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_get_module_eeprom(struct net_device *ndev, struct ethtool_eeprom *ee,
> +					    u8 *data)
> +{
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 first, last, len;
> +	u8 *qsfp_eeprom;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	qsfp_eeprom = hdev->qsfp_eeprom;
> +
> +	if (ee->len == 0)
> +		return -EINVAL;
> +
> +	first = ee->offset;
> +	last = ee->offset + ee->len;
> +
> +	if (first < ETH_MODULE_SFF_8636_LEN) {
> +		len = min_t(unsigned int, last, ETH_MODULE_SFF_8079_LEN);
> +		len -= first;
> +
> +		memcpy(data, qsfp_eeprom + first, len);
> +	}
> +
> +	return 0;
> +}
> +
> +static u32 hbl_en_ethtool_get_priv_flags(struct net_device *ndev)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +
> +	return port->pflags;
> +}
> +
> +static int hbl_en_ethtool_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +
> +	port->pflags = priv_flags;
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_get_link_ksettings(struct net_device *ndev,
> +					     struct ethtool_link_ksettings *cmd)
> +{
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	u32 port_idx, speed;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	speed = aux_ops->get_speed(aux_dev, port_idx);
> +
> +	cmd->base.speed = speed;
> +	cmd->base.duplex = DUPLEX_FULL;
> +
> +	ethtool_link_ksettings_zero_link_mode(cmd, supported);
> +	ethtool_link_ksettings_zero_link_mode(cmd, advertising);
> +
> +	switch (speed) {
> +	case SPEED_100000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseLR4_ER4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseSR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseKR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseLR4_ER4_Full);
> +
> +		cmd->base.port = PORT_FIBRE;
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, Backplane);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Backplane);
> +		break;
> +	case SPEED_50000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseKR2_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseSR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseCR2_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseKR2_Full);
> +		break;
> +	case SPEED_25000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 25000baseCR_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 25000baseCR_Full);
> +		break;
> +	case SPEED_200000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseKR4_Full);
> +		break;
> +	case SPEED_400000:
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseKR4_Full);
> +
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseCR4_Full);
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseKR4_Full);
> +		break;
> +	default:
> +		netdev_err(port->ndev, "unknown speed %d\n", speed);
> +		return -EFAULT;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
> +
> +	if (port->auto_neg_enable) {
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
> +		cmd->base.autoneg = AUTONEG_ENABLE;
> +		if (port->auto_neg_resolved)
> +			ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
> +	} else {
> +		cmd->base.autoneg = AUTONEG_DISABLE;
> +	}
> +
> +	ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
> +
> +	if (port->pfc_enable)
> +		ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);
> +
> +	return 0;
> +}
> +
> +/* only autoneg is mutable */
> +static bool check_immutable_ksettings(const struct ethtool_link_ksettings *old_cmd,
> +				      const struct ethtool_link_ksettings *new_cmd)
> +{
> +	return (old_cmd->base.speed == new_cmd->base.speed) &&
> +	       (old_cmd->base.duplex == new_cmd->base.duplex) &&
> +	       (old_cmd->base.port == new_cmd->base.port) &&
> +	       (old_cmd->base.phy_address == new_cmd->base.phy_address) &&
> +	       (old_cmd->base.eth_tp_mdix_ctrl == new_cmd->base.eth_tp_mdix_ctrl) &&
> +	       bitmap_equal(old_cmd->link_modes.advertising, new_cmd->link_modes.advertising,
> +			    __ETHTOOL_LINK_MODE_MASK_NBITS);
> +}
> +
> +static int
> +hbl_en_ethtool_set_link_ksettings(struct net_device *ndev, const struct ethtool_link_ksettings *cmd)
> +{
> +	struct ethtool_link_ksettings curr_cmd;
> +	struct hbl_en_device *hdev;
> +	struct hbl_en_port *port;
> +	bool auto_neg;
> +	u32 port_idx;
> +	int rc;
> +
> +	port = hbl_netdev_priv(ndev);
> +	hdev = port->hdev;
> +	port_idx = port->idx;
> +
> +	memset(&curr_cmd, 0, sizeof(struct ethtool_link_ksettings));
> +
> +	rc = hbl_en_ethtool_get_link_ksettings(ndev, &curr_cmd);
> +	if (rc)
> +		return rc;
> +
> +	if (!check_immutable_ksettings(&curr_cmd, cmd))
> +		return -EOPNOTSUPP;
> +
> +	auto_neg = cmd->base.autoneg == AUTONEG_ENABLE;
> +
> +	if (port->auto_neg_enable == auto_neg)
> +		return 0;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(port->ndev, "port is in reset, can't update settings\n");
> +		return -EBUSY;
> +	}
> +
> +	if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
> +		netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	port->auto_neg_enable = auto_neg;
> +
> +	if (netif_running(port->ndev)) {
> +		rc = hbl_en_port_reset(port);
> +		if (rc)
> +			netdev_err(port->ndev, "Failed to reset port for settings update, rc %d\n",
> +				   rc);
> +	}
> +
> +out:
> +	atomic_set(&port->in_reset, 0);
> +
> +	return rc;
> +}
> +
> +static int hbl_en_ethtool_get_sset_count(struct net_device *ndev, int sset)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	switch (sset) {
> +	case ETH_SS_STATS:
> +		return netdev_eth_stats_len + aux_ops->get_cnts_num(aux_dev, port_idx);
> +	case ETH_SS_PRIV_FLAGS:
> +		return pflags_str_len;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +static void hbl_en_ethtool_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int i;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	switch (stringset) {
> +	case ETH_SS_STATS:
> +		for (i = 0; i < netdev_eth_stats_len; i++)
> +			ethtool_puts(&data, netdev_eth_stats[i].str);
> +
> +		aux_ops->get_cnts_names(aux_dev, port_idx, data);
> +		break;
> +	case ETH_SS_PRIV_FLAGS:
> +		for (i = 0; i < pflags_str_len; i++)
> +			ethtool_puts(&data, pflags_str[i]);
> +		break;
> +	}
> +}
> +
> +static void hbl_en_ethtool_get_ethtool_stats(struct net_device *ndev,
> +					     __always_unused struct ethtool_stats *stats, u64 *data)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	struct hbl_en_device *hdev;
> +	u32 port_idx;
> +	char *p;
> +	int i;
> +
> +	hdev = port->hdev;
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +	port_idx = port->idx;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		dev_info_ratelimited(hdev->dev, "port %d is in reset, can't get ethtool stats",
> +				     port_idx);
> +		return;
> +	}
> +
> +	/* Even though the Ethernet Rx/Tx flow might update the stats in parallel, there is not an
> +	 * absolute need for synchronisation. This is because, missing few counts of these stats is
> +	 * much better than adding a lock to synchronize and increase the overhead of the Rx/Tx
> +	 * flows. In worst case scenario, reader will get stale stats. He will receive updated
> +	 * stats in next read.
> +	 */
> +	for (i = 0; i < netdev_eth_stats_len; i++) {
> +		p = (char *)port + netdev_eth_stats[i].lo_offset;
> +		data[i] = *(u32 *)p;
> +	}
> +
> +	data += i;
> +
> +	aux_ops->get_cnts_values(aux_dev, port_idx, data);
> +
> +	atomic_set(&port->in_reset, 0);
> +}
> +
> +static int hbl_en_ethtool_get_coalesce(struct net_device *ndev,
> +				       struct ethtool_coalesce *coal,
> +				       struct kernel_ethtool_coalesce *kernel_coal,
> +				       struct netlink_ext_ack *extack)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +
> +	coal->tx_max_coalesced_frames = port->tx_max_coalesced_frames;
> +	coal->rx_coalesce_usecs = port->rx_coalesce_usecs;
> +	coal->rx_max_coalesced_frames = port->rx_max_coalesced_frames;
> +
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +	return 0;
> +}
> +
> +static int hbl_en_ethtool_set_coalesce(struct net_device *ndev,
> +				       struct ethtool_coalesce *coal,
> +				       struct kernel_ethtool_coalesce *kernel_coal,
> +				       struct netlink_ext_ack *extack)
> +{
> +	struct hbl_en_port *port = hbl_netdev_priv(ndev);
> +	struct hbl_en_device *hdev = port->hdev;
> +	struct hbl_en_aux_ops *aux_ops;
> +	struct hbl_aux_dev *aux_dev;
> +	u32 port_idx = port->idx;
> +	int rc, rx_ring_size;
> +
> +	aux_dev = hdev->aux_dev;
> +	aux_ops = aux_dev->aux_ops;
> +
> +	if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> +		netdev_err(port->ndev, "port is in reset, can't update settings\n");
> +		return -EBUSY;
> +	}
> +
> +	if (coal->tx_max_coalesced_frames < TX_COALESCED_FRAMES_MIN ||
> +	    coal->tx_max_coalesced_frames > TX_COALESCED_FRAMES_MAX) {
> +		netdev_err(ndev, "tx max_coalesced_frames should be between %d and %d\n",
> +			   TX_COALESCED_FRAMES_MIN, TX_COALESCED_FRAMES_MAX);
> +		rc = -EINVAL;
> +		goto atomic_out;
> +	}
> +
> +	rx_ring_size = hdev->asic_funcs.get_rx_ring_size(port);
> +	if (coal->rx_max_coalesced_frames < RX_COALESCED_FRAMES_MIN ||
> +	    coal->rx_max_coalesced_frames >= rx_ring_size) {
> +		netdev_err(ndev, "rx max_coalesced_frames should be between %d and %d\n",
> +			   RX_COALESCED_FRAMES_MIN, rx_ring_size);
> +		rc = -EINVAL;
> +		goto atomic_out;
> +	}
> +
> +	aux_ops->ctrl_lock(aux_dev, port_idx);
> +
> +	port->tx_max_coalesced_frames = coal->tx_max_coalesced_frames;
> +	port->rx_coalesce_usecs = coal->rx_coalesce_usecs;
> +	port->rx_max_coalesced_frames = coal->rx_max_coalesced_frames;
> +
> +	rc = hdev->asic_funcs.set_coalesce(port);
> +
> +	aux_ops->ctrl_unlock(aux_dev, port_idx);
> +
> +atomic_out:
> +	atomic_set(&port->in_reset, 0);
> +	return rc;
> +}
> +
> +void hbl_en_ethtool_init_coalesce(struct hbl_en_port *port)
> +{
> +	port->rx_coalesce_usecs = CQ_ARM_TIMEOUT_USEC;
> +	port->rx_max_coalesced_frames = 1;
> +	port->tx_max_coalesced_frames = 1;
> +}
> +
> +static const struct ethtool_ops hbl_en_ethtool_ops_coalesce = {
> +	.supported_coalesce_params = ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_RX_MAX_FRAMES |
> +				     ETHTOOL_COALESCE_TX_MAX_FRAMES,
> +	.get_drvinfo = hbl_en_ethtool_get_drvinfo,
> +	.get_link = ethtool_op_get_link,
> +	.get_module_info = hbl_en_ethtool_get_module_info,
> +	.get_module_eeprom = hbl_en_ethtool_get_module_eeprom,
> +	.get_priv_flags = hbl_en_ethtool_get_priv_flags,
> +	.set_priv_flags = hbl_en_ethtool_set_priv_flags,
> +	.get_link_ksettings = hbl_en_ethtool_get_link_ksettings,
> +	.set_link_ksettings = hbl_en_ethtool_set_link_ksettings,
> +	.get_sset_count = hbl_en_ethtool_get_sset_count,
> +	.get_strings = hbl_en_ethtool_get_strings,
> +	.get_ethtool_stats = hbl_en_ethtool_get_ethtool_stats,
> +	.get_coalesce = hbl_en_ethtool_get_coalesce,
> +	.set_coalesce = hbl_en_ethtool_set_coalesce,
> +};
> +
> +const struct ethtool_ops *hbl_en_ethtool_get_ops(struct net_device *ndev)
> +{
> +	return &hbl_en_ethtool_ops_coalesce;
> +}

Andrew Lunn June 16, 2024, 1:04 a.m. UTC | #7

On Fri, Jun 14, 2024 at 03:48:43PM -0700, Joe Damato wrote:
> On Thu, Jun 13, 2024 at 11:22:02AM +0300, Omer Shpigelman wrote:
> > This ethernet driver is initialized via auxiliary bus by the hbl_cn
> > driver.
> > It serves mainly for control operations that are needed for AI scaling.
> > 
> > Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> > Co-developed-by: Abhilash K V <kvabhilash@habana.ai>
> > Signed-off-by: Abhilash K V <kvabhilash@habana.ai>
> > Co-developed-by: Andrey Agranovich <aagranovich@habana.ai>
> > Signed-off-by: Andrey Agranovich <aagranovich@habana.ai>
> > Co-developed-by: Bharat Jauhari <bjauhari@habana.ai>
> > Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
> > Co-developed-by: David Meriin <dmeriin@habana.ai>
> > Signed-off-by: David Meriin <dmeriin@habana.ai>
> > Co-developed-by: Sagiv Ozeri <sozeri@habana.ai>
> > Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
> > Co-developed-by: Zvika Yehudai <zyehudai@habana.ai>

Hi Joe

Please trim emails to include just the relevant context when
replying. It is hard to see your comments, and so it is likely some
will be missed, and you will need to make the same comment on v2.

     Andrew

Andrew Lunn June 16, 2024, 1:08 a.m. UTC | #8

> > +static void hbl_en_reset_stats(struct hbl_aux_dev *aux_dev, u32 port_idx)
> > +{
> > +	struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> > +
> > +	port->net_stats.rx_packets = 0;
> > +	port->net_stats.tx_packets = 0;
> > +	port->net_stats.rx_bytes = 0;
> > +	port->net_stats.tx_bytes = 0;
> > +	port->net_stats.tx_errors = 0;
> > +	atomic64_set(&port->net_stats.rx_dropped, 0);
> > +	atomic64_set(&port->net_stats.tx_dropped, 0);
> 
> per-cpu variable is better?

Please trim replies to just the needed context. Is this the only
comment in this 2300 line email? Do i need to keep searching for more
comments?

	Andrew

Omer Shpigelman June 18, 2024, 6:58 a.m. UTC | #9

On 6/14/24 00:49, Andrew Lunn wrote:
> [Some people who received this message don't often get email from andrew@lunn.ch. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
>> +static int hbl_en_napi_poll(struct napi_struct *napi, int budget);
>> +static int hbl_en_port_open(struct hbl_en_port *port);
> 
> When you do the Intel internal review, i expect this is crop up. No
> forward declarations please. Put the code in the right order so they
> are not needed.
> 

I'll try to get rid of these forward declarations by re-odering the functions.

>> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
>> +{
>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>> +     struct net_device *ndev = port->ndev;
>> +     struct in_device *in_dev;
>> +     struct in_ifaddr *ifa;
>> +     int rc = 0;
>> +
>> +     /* for the case where no src IP is configured */
>> +     *src_ip = 0;
>> +
>> +     /* rtnl lock should be acquired in relevant flows before taking configuration lock */
>> +     if (!rtnl_is_locked()) {
>> +             netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
>> +             rc = -EFAULT;
>> +             goto out;
>> +     }
> 
> You will find all other drivers just do:
> 
>         ASSERT_RTNL().
> 
> If your locking is broken, you are probably dead anyway, so you might
> as well keep going and try to explode in the most interesting way
> possible.
> 

Thanks, I'll switch to ASSERT_RTNL().

>> +static void hbl_en_reset_stats(struct hbl_aux_dev *aux_dev, u32 port_idx)
>> +{
>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>> +
>> +     port->net_stats.rx_packets = 0;
>> +     port->net_stats.tx_packets = 0;
>> +     port->net_stats.rx_bytes = 0;
>> +     port->net_stats.tx_bytes = 0;
>> +     port->net_stats.tx_errors = 0;
>> +     atomic64_set(&port->net_stats.rx_dropped, 0);
>> +     atomic64_set(&port->net_stats.tx_dropped, 0);
> 
> Why atomic64_set? Atomics are expensive, so you should not be using
> them. netdev has other cheaper methods, which other Intel developers
> should be happy to tell you all about.
> 

We used atomic64_set as these counters are updated also from non-netdev
flow in case of HW errors.
I can switch to use u64_stats_sync is that's the intention.
I'm about to start a review process with Intel developers regardless of
this issue and I'll bring this up too.

>> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
>> +{
>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>> +     struct net_device *ndev = port->ndev;
>> +     u32 mtu;
>> +
>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>> +             netdev_err(ndev, "port is in reset, can't get MTU\n");
>> +             return 0;
>> +     }
>> +
>> +     mtu = ndev->mtu;
> 
> I think you need a better error message. All this does is access
> ndev->mtu. What does it matter if the port is in reset? You don't
> access it.
> 

This function is called from the CN driver to get the current MTU in order
to configure it to the HW, for exmaple when configuring an IB QP. The MTU
value might be changed by user while we execute this function. Such an MTU
change requires port reset.
Hence, if the port is under reset we cannot be sure what is the MTU value.
Since the user should not change the MTU while QPs are being configured
(but we cannot block this flow either), we report an error because the MTU
value cannot be retrieved.
The other option to read the MTU value without checking for an in-progress
reset flow but in that case the MTU value might be incorrect.

>> +static int hbl_en_close(struct net_device *netdev)
>> +{
>> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
>> +     struct hbl_en_device *hdev = port->hdev;
>> +     ktime_t timeout;
>> +
>> +     /* Looks like the return value of this function is not checked, so we can't just return
>> +      * EBUSY if the port is under reset. We need to wait until the reset is finished and then
>> +      * close the port. Otherwise the netdev will set the port as closed although port_close()
>> +      * wasn't called. Only if we waited long enough and the reset hasn't finished, we can return
>> +      * an error without actually closing the port as it is a fatal flow anyway.
>> +      */
>> +     timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
>> +     while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>> +             /* If this is called from unregister_netdev() then the port was already closed and
>> +              * hence we can safely return.
>> +              * We could have just check the port_open boolean, but that might hide some future
>> +              * bugs. Hence it is better to use a dedicated flag for that.
>> +              */
>> +             if (READ_ONCE(hdev->in_teardown))
>> +                     return 0;
>> +
>> +             usleep_range(50, 200);
>> +             if (ktime_compare(ktime_get(), timeout) > 0) {
>> +                     netdev_crit(netdev,
>> +                                 "Timeout while waiting for port to finish reset, can't close it\n"
>> +                                 );
>> +                     return -EBUSY;
>> +             }
> 
> This has the usual bug. Please look at include/linux/iopoll.h.
> 

I'll take a look, thanks.

>> +             timeout = ktime_add_ms(ktime_get(), PORT_RESET_TIMEOUT_MSEC);
>> +             while (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>> +                     usleep_range(50, 200);
>> +                     if (ktime_compare(ktime_get(), timeout) > 0) {
>> +                             netdev_crit(port->ndev,
>> +                                         "Timeout while waiting for port %d to finish reset\n",
>> +                                         port->idx);
>> +                             break;
>> +                     }
>> +             }
> 
> and again. Don't roll your own timeout loops like this, use the core
> version.
> 

I will look for some core alternative.

>> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
>> +{
>> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
>> +     int rc = 0;
>> +
>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>> +             netdev_err(netdev, "port is in reset, can't change MTU\n");
>> +             return -EBUSY;
>> +     }
>> +
>> +     if (netif_running(port->ndev)) {
>> +             hbl_en_port_close(port);
>> +
>> +             /* Sleep in order to let obsolete events to be dropped before re-opening the port */
>> +             msleep(20);
>> +
>> +             netdev->mtu = new_mtu;
>> +
>> +             rc = hbl_en_port_open(port);
>> +             if (rc)
>> +                     netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
> 
> Does that mean the port is FUBAR?
> 
> Most operations like this are expected to roll back to the previous
> working configuration on failure. So if changing the MTU requires new
> buffers in your ring, you should first allocate the new buffers, then
> free the old buffers, so that if allocation fails, you still have
> buffers, and the device can continue operating.
> 

A failure in opening a port is a fatal error. It shouldn't happen. This is
not something we wish to recover from.
This kind of an error indicates a severe system error that will usually
require a driver removal and reload anyway.

>> +module_param(poll_enable, bool, 0444);
>> +MODULE_PARM_DESC(poll_enable,
>> +              "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
> 
> Module parameters are not liked. This probably needs to go away.
> 

I see that various vendors under net/ethernet/* use module parameters.
Can't we add another one?

>> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
>> +{
>> +     modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
>> +     modinfo->type = ETH_MODULE_SFF_8636;
> 
> Is this an SFF, not an SFP? How else can you know what module it is
> without doing an I2C transfer to ask the module what it is?
> 

The current type is SFF and it is unlikely to be changed.

>> +static int hbl_en_ethtool_get_module_eeprom(struct net_device *ndev, struct ethtool_eeprom *ee,
>> +                                         u8 *data)
>> +{
> 
> This is the old API. Please update to the new API so there is access
> to all the pages of the SFF/SFP.
> 

Are you referring to get_module_eeprom_by_page()? if so, then it is not
supported by our FW, we read the entire data on device load.
However, I can hide that behind the new API and return only the
requested page if that's the intention.

>> +static int hbl_en_ethtool_get_link_ksettings(struct net_device *ndev,
>> +                                          struct ethtool_link_ksettings *cmd)
>> +{
>> +     struct hbl_en_aux_ops *aux_ops;
>> +     struct hbl_aux_dev *aux_dev;
>> +     struct hbl_en_device *hdev;
>> +     struct hbl_en_port *port;
>> +     u32 port_idx, speed;
>> +
>> +     port = hbl_netdev_priv(ndev);
>> +     hdev = port->hdev;
>> +     port_idx = port->idx;
>> +     aux_dev = hdev->aux_dev;
>> +     aux_ops = aux_dev->aux_ops;
>> +     speed = aux_ops->get_speed(aux_dev, port_idx);
>> +
>> +     cmd->base.speed = speed;
>> +     cmd->base.duplex = DUPLEX_FULL;
>> +
>> +     ethtool_link_ksettings_zero_link_mode(cmd, supported);
>> +     ethtool_link_ksettings_zero_link_mode(cmd, advertising);
>> +
>> +     switch (speed) {
>> +     case SPEED_100000:
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseCR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseSR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseKR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 100000baseLR4_ER4_Full);
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseCR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseSR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseKR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 100000baseLR4_ER4_Full);
>> +
>> +             cmd->base.port = PORT_FIBRE;
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, FIBRE);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, FIBRE);
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, Backplane);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, Backplane);
>> +             break;
>> +     case SPEED_50000:
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseSR2_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseCR2_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 50000baseKR2_Full);
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseSR2_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseCR2_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 50000baseKR2_Full);
>> +             break;
>> +     case SPEED_25000:
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 25000baseCR_Full);
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 25000baseCR_Full);
>> +             break;
>> +     case SPEED_200000:
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseCR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 200000baseKR4_Full);
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseCR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 200000baseKR4_Full);
>> +             break;
>> +     case SPEED_400000:
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseCR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, supported, 400000baseKR4_Full);
>> +
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseCR4_Full);
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, 400000baseKR4_Full);
>> +             break;
>> +     default:
>> +             netdev_err(port->ndev, "unknown speed %d\n", speed);
>> +             return -EFAULT;
>> +     }
>> +
>> +     ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
>> +
>> +     if (port->auto_neg_enable) {
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
>> +             cmd->base.autoneg = AUTONEG_ENABLE;
>> +             if (port->auto_neg_resolved)
>> +                     ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
> 
> That looks odd. Care to explain?
> 

The HW of all of our ports supports autoneg.
But in addition, the ports are divided to two groups:
internal: ports which are connected to other Gaudi2 ports in the same server.
external: ports which are connected to an external switch.
Only internal ports use autoneg.
The ports mask which sets each port as internal/external is fetched from
the FW on device load.

>> +     } else {
>> +             cmd->base.autoneg = AUTONEG_DISABLE;
>> +     }
>> +
>> +     ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
>> +
>> +     if (port->pfc_enable)
>> +             ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);
> 
> And is suspect that is wrong. Everybody gets pause wrong. Please
> double check my previous posts about pause.
> 

Our HW supports Pause frames.
But, PFC can be disabled via lldptool for exmaple, so we won't advertise
it.
I'll try to find more info about it, but can you please share what's wrong
with the curent code?
BTW I will change it to Asym_Pause as we support Tx pause frames as well.

>> +     if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
>> +             netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
>> +             rc = -EFAULT;
>> +             goto out;
> 
> Don't say you support autoneg in supported if that is the case.
> 
> And EFAULT is about memory problems. EINVAL, maybe EPERM? or
> EOPNOTSUPP.
> 
>         Andrew

Yeah, should be switched to EPERM/EOPNOTSUPP.
Regarding the support of autoneg - the HW supports autoneg but it might be
disabled by the FW. Hence we might not be able to switch it on.

Omer Shpigelman June 18, 2024, 11:16 a.m. UTC | #10

On 6/15/24 13:55, Zhu Yanjun wrote:
> [You don't often get email from yanjun.zhu@linux.dev. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> 在 2024/6/13 16:22, Omer Shpigelman 写道:
>> +
>> +/* This function should be called after ctrl_lock was taken */
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/dev-tools/sparse.rst?h=v6.10-rc3#n64
>
> "
> __must_hold - The specified lock is held on function entry and exit.
> "
> Add "__must_hold" to confirm "The specified lock is held on function
> entry and exit." ?
>
> Zhu Yanjun

Thanks, I'll add it.

Andrew Lunn June 18, 2024, 2:19 p.m. UTC | #11

> >> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
> >> +{
> >> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> >> +     struct net_device *ndev = port->ndev;
> >> +     u32 mtu;
> >> +
> >> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> >> +             netdev_err(ndev, "port is in reset, can't get MTU\n");
> >> +             return 0;
> >> +     }
> >> +
> >> +     mtu = ndev->mtu;
> > 
> > I think you need a better error message. All this does is access
> > ndev->mtu. What does it matter if the port is in reset? You don't
> > access it.
> > 
> 
> This function is called from the CN driver to get the current MTU in order
> to configure it to the HW, for exmaple when configuring an IB QP. The MTU
> value might be changed by user while we execute this function.

Change of MTU will happen while holding RTNL. Why not simply hold RTNL
while programming the hardware? That is the normal pattern for MAC
drivers.

> >> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
> >> +{
> >> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
> >> +     int rc = 0;
> >> +
> >> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> >> +             netdev_err(netdev, "port is in reset, can't change MTU\n");
> >> +             return -EBUSY;
> >> +     }
> >> +
> >> +     if (netif_running(port->ndev)) {
> >> +             hbl_en_port_close(port);
> >> +
> >> +             /* Sleep in order to let obsolete events to be dropped before re-opening the port */
> >> +             msleep(20);
> >> +
> >> +             netdev->mtu = new_mtu;
> >> +
> >> +             rc = hbl_en_port_open(port);
> >> +             if (rc)
> >> +                     netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
> > 
> > Does that mean the port is FUBAR?
> > 
> > Most operations like this are expected to roll back to the previous
> > working configuration on failure. So if changing the MTU requires new
> > buffers in your ring, you should first allocate the new buffers, then
> > free the old buffers, so that if allocation fails, you still have
> > buffers, and the device can continue operating.
> > 
> 
> A failure in opening a port is a fatal error. It shouldn't happen. This is
> not something we wish to recover from.

What could cause open to fail? Is memory allocated?

> This kind of an error indicates a severe system error that will usually
> require a driver removal and reload anyway.
> 
> >> +module_param(poll_enable, bool, 0444);
> >> +MODULE_PARM_DESC(poll_enable,
> >> +              "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
> > 
> > Module parameters are not liked. This probably needs to go away.
> > 
> 
> I see that various vendors under net/ethernet/* use module parameters.
> Can't we add another one?

Look at the history of those module parameters. Do you see many added
in the last year? 5 years?

> >> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
> >> +{
> >> +     modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
> >> +     modinfo->type = ETH_MODULE_SFF_8636;
> > 
> > Is this an SFF, not an SFP? How else can you know what module it is
> > without doing an I2C transfer to ask the module what it is?
> > 
> 
> The current type is SFF and it is unlikely to be changed.

Well, SFF are soldered to the board, so yes, it is unlikely to
change...

Please add a comment that this is an SFF, not an SFP, so is soldered
to the board, and so it is known to be an 8636 compatible device.

> Are you referring to get_module_eeprom_by_page()? if so, then it is not
> supported by our FW, we read the entire data on device load.
> However, I can hide that behind the new API and return only the
> requested page if that's the intention.

Well, if your firmware is so limited, then you might as well stick to
the old API, and let the core do the conversion to the legacy
code. But i'm surprised you don't allow access to the temperature
sensors, received signal strength, voltages etc, which could be
exported via HWMON.

> >> +                     ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
> > 
> > That looks odd. Care to explain?
> > 
> 
> The HW of all of our ports supports autoneg.
> But in addition, the ports are divided to two groups:
> internal: ports which are connected to other Gaudi2 ports in the same server.
> external: ports which are connected to an external switch.
> Only internal ports use autoneg.
> The ports mask which sets each port as internal/external is fetched from
> the FW on device load.

That is not what i meant. lc_advertising should indicate the link
modes the peer is advertising. If this was a copper link, it typically
would contain 10BaseT-Half, 10BaseT-Full, 100BaseT-Half,
100BaseT-Full, 1000BaseT-Half. Setting the Autoneg bit is pointless,
since the peer must be advertising in order that lp_advertising has a
value!

> Our HW supports Pause frames.
> But, PFC can be disabled via lldptool for exmaple, so we won't advertise
> it.

Please also implement the standard netdev way of configuring pause.
When you do that, you should start to understand how pause can be
negotiated, or forced. That is what most get wrong.

> I'll try to find more info about it, but can you please share what's wrong
> with the curent code?
> BTW I will change it to Asym_Pause as we support Tx pause frames as well.
> 
> >> +     if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
> >> +             netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
> >> +             rc = -EFAULT;
> >> +             goto out;
> > 
> > Don't say you support autoneg in supported if that is the case.
> > 
> > And EFAULT is about memory problems. EINVAL, maybe EPERM? or
> > EOPNOTSUPP.
> > 
> >         Andrew
> 
> Yeah, should be switched to EPERM/EOPNOTSUPP.
> Regarding the support of autoneg - the HW supports autoneg but it might be
> disabled by the FW. Hence we might not be able to switch it on.

No problem, ask the firmware what it is doing, and return the reality
in ksetting. Only say you support autoneg if your firmware allows you
to perform autoneg.

   Andrew

Omer Shpigelman June 18, 2024, 7:37 p.m. UTC | #12

On 6/15/24 01:48, Joe Damato wrote:
> [You don't often get email from jdamato@fastly.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Thu, Jun 13, 2024 at 11:22:02AM +0300, Omer Shpigelman wrote:
>> This ethernet driver is initialized via auxiliary bus by the hbl_cn
>> driver.
>> It serves mainly for control operations that are needed for AI scaling.
>>
>> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
>> Co-developed-by: Abhilash K V <kvabhilash@habana.ai>
>> Signed-off-by: Abhilash K V <kvabhilash@habana.ai>
>> Co-developed-by: Andrey Agranovich <aagranovich@habana.ai>
>> Signed-off-by: Andrey Agranovich <aagranovich@habana.ai>
>> Co-developed-by: Bharat Jauhari <bjauhari@habana.ai>
>> Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
>> Co-developed-by: David Meriin <dmeriin@habana.ai>
>> Signed-off-by: David Meriin <dmeriin@habana.ai>
>> Co-developed-by: Sagiv Ozeri <sozeri@habana.ai>
>> Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
>> Co-developed-by: Zvika Yehudai <zyehudai@habana.ai>
>> Signed-off-by: Zvika Yehudai <zyehudai@habana.ai>

<...>

>> +             if (hdev->poll_enable)
>> +                     skb = __netdev_alloc_skb_ip_align(ndev, pkt_size, GFP_KERNEL);
>> +             else
>> +                     skb = napi_alloc_skb(&port->napi, pkt_size);
>> +
>> +             if (!skb) {
>> +                     atomic64_inc(&port->net_stats.rx_dropped);
> 
> It seems like buffer exhaustion (!skb) would be rx_missed_errors?
> 
> The documentation in include/uapi/linux/if_link.h:
> 
>  * @rx_dropped: Number of packets received but not processed,
>  *   e.g. due to lack of resources or unsupported protocol.
>  *   For hardware interfaces this counter may include packets discarded
>  *   due to L2 address filtering but should not include packets dropped
>  *   by the device due to buffer exhaustion which are counted separately in
>  *   @rx_missed_errors (since procfs folds those two counters together).
> 
> But, I don't know much about your hardware so I could be wrong.
> 

Per my understanding rx_dropped should be used here. According the doc you
posted, rx_dropped should be used in case of dropped packets due to lack
of resources, while rx_missed_errors should be used for packets that were
dropped by the device due to buffer exhaustion, not by the driver.
Please correct me if I misunderstood something.

>> +                     break;
>> +             }
>> +
>> +             skb_copy_to_linear_data(skb, pkt_addr, pkt_size);
>> +             skb_put(skb, pkt_size);
>> +
>> +             if (is_pkt_swap_enabled(hdev))
>> +                     dump_swap_pkt(port, skb);
>> +
>> +             skb->protocol = eth_type_trans(skb, ndev);
>> +
>> +             /* Zero the packet buffer memory to avoid leak in case of wrong
>> +              * size is used when next packet populates the same memory
>> +              */
>> +             memset(pkt_addr, 0, pkt_size);
>> +
>> +             /* polling is done in thread context and hence BH should be disabled */
>> +             if (hdev->poll_enable)
>> +                     local_bh_disable();
>> +
>> +             rc = netif_receive_skb(skb);
> 
> Is there any reason in particular to call netif_receive_skb instead of
> napi_gro_receive ?
> 

As you can see, we also support polling mode which is a non-NAPI flow.
We could use napi_gro_receive() for NAPI flow and netif_receive_skb() for
polling mode but we don't support RX checksum offload anyway.

>> +
>> +             if (hdev->poll_enable)
>> +                     local_bh_enable();

<...>

>> +     pkt_count = hbl_en_handle_rx(port, budget);
>> +
>> +     /* If budget not fully consumed, exit the polling mode */
>> +     if (pkt_count < budget) {
>> +             napi_complete_done(napi, pkt_count);
> 
> I believe this code might be incorrect and that it should be:
> 
>   if (napi_complete_done(napi, pkt_done))
>      hdev->asic_funcs.reenable_rx_irq(port);
>

Thanks, I'll add the condition.

Omer Shpigelman June 18, 2024, 7:39 p.m. UTC | #13

On 6/15/24 03:16, Stephen Hemminger wrote:
> [You don't often get email from stephen@networkplumber.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
>> +
>> +/* get the src IP as it is done in devinet_ioctl() */
>> +static int hbl_en_get_src_ip(struct hbl_aux_dev *aux_dev, u32 port_idx, u32 *src_ip)
>> +{
>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>> +     struct net_device *ndev = port->ndev;
>> +     struct in_device *in_dev;
>> +     struct in_ifaddr *ifa;
>> +     int rc = 0;
>> +
>> +     /* for the case where no src IP is configured */
>> +     *src_ip = 0;
>> +
>> +     /* rtnl lock should be acquired in relevant flows before taking configuration lock */
>> +     if (!rtnl_is_locked()) {
>> +             netdev_err(port->ndev, "Rtnl lock is not acquired, can't proceed\n");
>> +             rc = -EFAULT;
>> +             goto out;
>> +     }
>> +
>> +     in_dev = __in_dev_get_rtnl(ndev);
>> +     if (!in_dev) {
>> +             netdev_err(port->ndev, "Failed to get IPv4 struct\n");
>> +             rc = -EFAULT;
>> +             goto out;
>> +     }
>> +
>> +     ifa = rtnl_dereference(in_dev->ifa_list);
>> +
>> +     while (ifa) {
>> +             if (!strcmp(ndev->name, ifa->ifa_label)) {
>> +                     /* convert the BE to native and later on it will be
>> +                      * written to the HW as LE in QPC_SET
>> +                      */
>> +                     *src_ip = be32_to_cpu(ifa->ifa_local);
>> +                     break;
>> +             }
>> +             ifa = rtnl_dereference(ifa->ifa_next);
>> +     }
>> +out:
>> +     return rc;
>> +}
> 
> Does this device require IPv4? What about users and infrastructures that use IPv6 only?
> IPv4 is legacy at this point.

Gaudi2 supports IPv4 only.

Stephen Hemminger June 18, 2024, 9:19 p.m. UTC | #14

On Tue, 18 Jun 2024 19:37:36 +0000
Omer Shpigelman <oshpigelman@habana.ai> wrote:

> > 
> > Is there any reason in particular to call netif_receive_skb instead of
> > napi_gro_receive ?
> >   
> 
> As you can see, we also support polling mode which is a non-NAPI flow.
> We could use napi_gro_receive() for NAPI flow and netif_receive_skb() for
> polling mode but we don't support RX checksum offload anyway.

 Why non-NAPI? I thought current netdev policy was all drivers should
use NAPI.

Omer Shpigelman June 19, 2024, 7:16 a.m. UTC | #15

On 6/18/24 17:19, Andrew Lunn wrote:
>>>> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
>>>> +{
>>>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>>>> +     struct net_device *ndev = port->ndev;
>>>> +     u32 mtu;
>>>> +
>>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>>>> +             netdev_err(ndev, "port is in reset, can't get MTU\n");
>>>> +             return 0;
>>>> +     }
>>>> +
>>>> +     mtu = ndev->mtu;
>>>
>>> I think you need a better error message. All this does is access
>>> ndev->mtu. What does it matter if the port is in reset? You don't
>>> access it.
>>>
>>
>> This function is called from the CN driver to get the current MTU in order
>> to configure it to the HW, for exmaple when configuring an IB QP. The MTU
>> value might be changed by user while we execute this function.
> 
> Change of MTU will happen while holding RTNL. Why not simply hold RTNL
> while programming the hardware? That is the normal pattern for MAC
> drivers.
>

I can hold the RTNL lock while configuring the HW but it seems like a big
overhead. Configuring the HW might take some time due to QP draining or
cache invalidation.
To me it seems unnecessary but if that's the common way then I'll change
it.
 
>>>> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
>>>> +{
>>>> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
>>>> +     int rc = 0;
>>>> +
>>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>>>> +             netdev_err(netdev, "port is in reset, can't change MTU\n");
>>>> +             return -EBUSY;
>>>> +     }
>>>> +
>>>> +     if (netif_running(port->ndev)) {
>>>> +             hbl_en_port_close(port);
>>>> +
>>>> +             /* Sleep in order to let obsolete events to be dropped before re-opening the port */
>>>> +             msleep(20);
>>>> +
>>>> +             netdev->mtu = new_mtu;
>>>> +
>>>> +             rc = hbl_en_port_open(port);
>>>> +             if (rc)
>>>> +                     netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
>>>
>>> Does that mean the port is FUBAR?
>>>
>>> Most operations like this are expected to roll back to the previous
>>> working configuration on failure. So if changing the MTU requires new
>>> buffers in your ring, you should first allocate the new buffers, then
>>> free the old buffers, so that if allocation fails, you still have
>>> buffers, and the device can continue operating.
>>>
>>
>> A failure in opening a port is a fatal error. It shouldn't happen. This is
>> not something we wish to recover from.
> 
> What could cause open to fail? Is memory allocated?
> 

Memory is allocated but it is freed in case of a failure.
Port opening can fail due to other reasons as well like some HW timeout
while configuring the ETH QP.

>> This kind of an error indicates a severe system error that will usually
>> require a driver removal and reload anyway.
>>
>>>> +module_param(poll_enable, bool, 0444);
>>>> +MODULE_PARM_DESC(poll_enable,
>>>> +              "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
>>>
>>> Module parameters are not liked. This probably needs to go away.
>>>
>>
>> I see that various vendors under net/ethernet/* use module parameters.
>> Can't we add another one?
> 
> Look at the history of those module parameters. Do you see many added
> in the last year? 5 years?
> 

I didn't check that prior to my submit. Regarding this "no new module
parameters allowed" rule, is that documented anywhere? if not, is that the
common practice? not to try to do something that was not done recently?
how "recently" is defined?
I just want to clarify this because it's hard to handle these submissions
when we write some code based on existing examples but then we are
rejected because "we don't do that here anymore".
I want to avoid future cases of this mismatch.

>>>> +static int hbl_en_ethtool_get_module_info(struct net_device *ndev, struct ethtool_modinfo *modinfo)
>>>> +{
>>>> +     modinfo->eeprom_len = ETH_MODULE_SFF_8636_LEN;
>>>> +     modinfo->type = ETH_MODULE_SFF_8636;
>>>
>>> Is this an SFF, not an SFP? How else can you know what module it is
>>> without doing an I2C transfer to ask the module what it is?
>>>
>>
>> The current type is SFF and it is unlikely to be changed.
> 
> Well, SFF are soldered to the board, so yes, it is unlikely to
> change...
> 
> Please add a comment that this is an SFF, not an SFP, so is soldered
> to the board, and so it is known to be an 8636 compatible device.
> 

I'll add.

>> Are you referring to get_module_eeprom_by_page()? if so, then it is not
>> supported by our FW, we read the entire data on device load.
>> However, I can hide that behind the new API and return only the
>> requested page if that's the intention.
> 
> Well, if your firmware is so limited, then you might as well stick to
> the old API, and let the core do the conversion to the legacy
> code. But i'm surprised you don't allow access to the temperature
> sensors, received signal strength, voltages etc, which could be
> exported via HWMON.
> 

I'll stick to the old API.
Regaring the sensors, our compute driver (under accel/habanalabs) exports
them via HWMON.

>>>> +                     ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
>>>
>>> That looks odd. Care to explain?
>>>
>>
>> The HW of all of our ports supports autoneg.
>> But in addition, the ports are divided to two groups:
>> internal: ports which are connected to other Gaudi2 ports in the same server.
>> external: ports which are connected to an external switch.
>> Only internal ports use autoneg.
>> The ports mask which sets each port as internal/external is fetched from
>> the FW on device load.
> 
> That is not what i meant. lc_advertising should indicate the link
> modes the peer is advertising. If this was a copper link, it typically
> would contain 10BaseT-Half, 10BaseT-Full, 100BaseT-Half,
> 100BaseT-Full, 1000BaseT-Half. Setting the Autoneg bit is pointless,
> since the peer must be advertising in order that lp_advertising has a
> value!
> 

Sorry, but I don't get this. The problem is the setting of the Autoneg bit
in lp_advertising? is that redundant? I see that other vendors set it too
in case that Autoneg was completed.

>> Our HW supports Pause frames.
>> But, PFC can be disabled via lldptool for exmaple, so we won't advertise
>> it.
> 
> Please also implement the standard netdev way of configuring pause.
> When you do that, you should start to understand how pause can be
> negotiated, or forced. That is what most get wrong.
> 

Let me revisit this.

>> I'll try to find more info about it, but can you please share what's wrong
>> with the curent code?
>> BTW I will change it to Asym_Pause as we support Tx pause frames as well.
>>
>>>> +     if (auto_neg && !(hdev->auto_neg_mask & BIT(port_idx))) {
>>>> +             netdev_err(port->ndev, "port autoneg is disabled by BMC\n");
>>>> +             rc = -EFAULT;
>>>> +             goto out;
>>>
>>> Don't say you support autoneg in supported if that is the case.
>>>
>>> And EFAULT is about memory problems. EINVAL, maybe EPERM? or
>>> EOPNOTSUPP.
>>>
>>>         Andrew
>>
>> Yeah, should be switched to EPERM/EOPNOTSUPP.
>> Regarding the support of autoneg - the HW supports autoneg but it might be
>> disabled by the FW. Hence we might not be able to switch it on.
> 
> No problem, ask the firmware what it is doing, and return the reality
> in ksetting. Only say you support autoneg if your firmware allows you
> to perform autoneg.
> 
>    Andrew
> 

Ok, I'll set the Autoneg support bit properly.

Przemek Kitszel June 19, 2024, 8:01 a.m. UTC | #16

On 6/19/24 09:16, Omer Shpigelman wrote:
> On 6/18/24 17:19, Andrew Lunn wrote:

[...]

>>>>> +module_param(poll_enable, bool, 0444);
>>>>> +MODULE_PARM_DESC(poll_enable,
>>>>> +              "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
>>>>
>>>> Module parameters are not liked. This probably needs to go away.
>>>>
>>>
>>> I see that various vendors under net/ethernet/* use module parameters.
>>> Can't we add another one?
>>
>> Look at the history of those module parameters. Do you see many added
>> in the last year? 5 years?
>>
> 
> I didn't check that prior to my submit. Regarding this "no new module
> parameters allowed" rule, is that documented anywhere? if not, is that the
> common practice? not to try to do something that was not done recently?
> how "recently" is defined?
> I just want to clarify this because it's hard to handle these submissions
> when we write some code based on existing examples but then we are
> rejected because "we don't do that here anymore".
> I want to avoid future cases of this mismatch.
> 

best way is to read netdev ML, that way you will learn what interfaces
are frowned upon and which are outright banned, sometimes you could
judge yourself knowing which interfaces are most developed recently

in this module params example - they were introduced to allow init phase
configuration of the device, that could not be postponed, what in the
general case sounds like a workaround; hardest cases include huge swaths
of (physical continuous) memory to be allocated, but for that there are
now device tree binding solutions; more typical cases for networking are
resolved via devlink reload

devlink parms are also the thing that should be used as a default for
new parameters, the best if given parameter is not driver specific quirk

poll_enable sounds like something that should be a common param,
but you have to better describe what you mean there
(see napi_poll(), "Enable Rx polling" would mean to use that as default,
do you mean busy polling or what?)

Omer Shpigelman June 19, 2024, 12:07 p.m. UTC | #17

On 6/15/24 03:10, Stephen Hemminger wrote:
> [You don't often get email from stephen@networkplumber.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Thu, 13 Jun 2024 11:22:02 +0300
> Omer Shpigelman <oshpigelman@habana.ai> wrote:
> 
>> +static int hbl_en_ports_reopen(struct hbl_aux_dev *aux_dev)
>> +{
>> +     struct hbl_en_device *hdev = aux_dev->priv;
>> +     struct hbl_en_port *port;
>> +     int rc = 0, i;
>> +
>> +     for (i = 0; i < hdev->max_num_of_ports; i++) {
>> +             if (!(hdev->ports_mask & BIT(i)))
>> +                     continue;
>> +
>> +             port = &hdev->ports[i];
>> +
>> +             /* It could be that the port was shutdown by 'ip link set down' and there is no need
>> +              * in reopening it.
>> +              * Since we mark the ports as in reset even if they are disabled, we clear the flag
>> +              * here anyway.
>> +              * See hbl_en_ports_stop_prepare() for more info.
>> +              */
>> +             if (!netif_running(port->ndev)) {
>> +                     atomic_set(&port->in_reset, 0);
>> +                     continue;
>> +             }
>> +
> 
> Rather than duplicating network device state in your own flags, it would be better to use
> existing infrastructure. Read Documentation/networking/operstates.rst
> 
> Then you could also get rid of the kludge timer stuff in hbl_en_close().
> 

I think that additional explanation is needed here.
In addition to netdev flows, we also support an internal reset flow
(that's what the in_reset boolean indicates).
Our NIC driver is an extension of the compute driver (they share the same
HW) and a reset flow might be needed due to some compute operation which
is entirely unrelated to the NIC driver. But we must not access the HW
while this reset flow is executed.
Note that this internal reset flow originates from the compute driver and
hence we might have parallel netdev operations that should be blocked
meanwhile.
The internal reset flow has 2 phases - teardown and re-init. This reopen
function is called during the re-init phase to restore the NIC ports, but
only if they were actually opened prior to the reset flow.
Regarding hbl_en_close() - during the port close we need to write to the
HW so due to the explanation above, also there we should wait for an
existing internal reset flow to finish first.
Let me know if that's clear enough and addresses your concerns.

Omer Shpigelman June 19, 2024, 12:13 p.m. UTC | #18

On 6/19/24 00:19, Stephen Hemminger wrote:
> On Tue, 18 Jun 2024 19:37:36 +0000
> Omer Shpigelman <oshpigelman@habana.ai> wrote:
> 
>>>
>>> Is there any reason in particular to call netif_receive_skb instead of
>>> napi_gro_receive ?
>>>   
>>
>> As you can see, we also support polling mode which is a non-NAPI flow.
>> We could use napi_gro_receive() for NAPI flow and netif_receive_skb() for
>> polling mode but we don't support RX checksum offload anyway.
> 
>  Why non-NAPI? I thought current netdev policy was all drivers should
> use NAPI.

If that's the current policy then I can remove this non-NAPI mode.
I see on another thread that module parameters are not allowed so
apparently I'll need to remove this polling mode anyway as it is set by a
module parameter.

Omer Shpigelman June 19, 2024, 12:15 p.m. UTC | #19

On 6/19/24 11:01, Przemek Kitszel wrote:
> On 6/19/24 09:16, Omer Shpigelman wrote:
>> On 6/18/24 17:19, Andrew Lunn wrote:
> 
> [...]
> 
>>>>>> +module_param(poll_enable, bool, 0444);
>>>>>> +MODULE_PARM_DESC(poll_enable,
>>>>>> +              "Enable Rx polling rather than IRQ + NAPI (0 = no, 1 = yes, default: no)");
>>>>>
>>>>> Module parameters are not liked. This probably needs to go away.
>>>>>
>>>>
>>>> I see that various vendors under net/ethernet/* use module parameters.
>>>> Can't we add another one?
>>>
>>> Look at the history of those module parameters. Do you see many added
>>> in the last year? 5 years?
>>>
>>
>> I didn't check that prior to my submit. Regarding this "no new module
>> parameters allowed" rule, is that documented anywhere? if not, is that the
>> common practice? not to try to do something that was not done recently?
>> how "recently" is defined?
>> I just want to clarify this because it's hard to handle these submissions
>> when we write some code based on existing examples but then we are
>> rejected because "we don't do that here anymore".
>> I want to avoid future cases of this mismatch.
>>
> 
> best way is to read netdev ML, that way you will learn what interfaces
> are frowned upon and which are outright banned, sometimes you could
> judge yourself knowing which interfaces are most developed recently
> 
> in this module params example - they were introduced to allow init phase
> configuration of the device, that could not be postponed, what in the
> general case sounds like a workaround; hardest cases include huge swaths
> of (physical continuous) memory to be allocated, but for that there are
> now device tree binding solutions; more typical cases for networking are
> resolved via devlink reload
> 
> devlink parms are also the thing that should be used as a default for
> new parameters, the best if given parameter is not driver specific quirk
> 
> poll_enable sounds like something that should be a common param,
> but you have to better describe what you mean there
> (see napi_poll(), "Enable Rx polling" would mean to use that as default,
> do you mean busy polling or what?)

Yes, busy polling.
But never mind, I was informed that NAPI must be used so apparently I'll
need to anyway remove this polling mode and its module parameter.

Jakub Kicinski June 19, 2024, 3:21 p.m. UTC | #20

On Wed, 19 Jun 2024 07:16:20 +0000 Omer Shpigelman wrote:
> >> Are you referring to get_module_eeprom_by_page()? if so, then it is not
> >> supported by our FW, we read the entire data on device load.
> >> However, I can hide that behind the new API and return only the
> >> requested page if that's the intention.  
> > 
> > Well, if your firmware is so limited, then you might as well stick to
> > the old API, and let the core do the conversion to the legacy
> > code. But i'm surprised you don't allow access to the temperature
> > sensors, received signal strength, voltages etc, which could be
> > exported via HWMON.
> 
> I'll stick to the old API.
> Regaring the sensors, our compute driver (under accel/habanalabs) exports
> them via HWMON.

You support 400G, you really need to give the user the ability
to access higher pages.

Andrew Lunn June 19, 2024, 3:40 p.m. UTC | #21

> > Does this device require IPv4? What about users and infrastructures that use IPv6 only?
> > IPv4 is legacy at this point.
> 
> Gaudi2 supports IPv4 only.

Really? I guess really old stuff, SLIP from 1988 does not support
IPv6, but i don't remember seeing anything from this century which
does not support passing IPv6 frames over a netdev.

     Andrew

Andrew Lunn June 19, 2024, 4:13 p.m. UTC | #22

On Wed, Jun 19, 2024 at 07:16:20AM +0000, Omer Shpigelman wrote:
> On 6/18/24 17:19, Andrew Lunn wrote:
> >>>> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
> >>>> +{
> >>>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
> >>>> +     struct net_device *ndev = port->ndev;
> >>>> +     u32 mtu;
> >>>> +
> >>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> >>>> +             netdev_err(ndev, "port is in reset, can't get MTU\n");
> >>>> +             return 0;
> >>>> +     }
> >>>> +
> >>>> +     mtu = ndev->mtu;
> >>>
> >>> I think you need a better error message. All this does is access
> >>> ndev->mtu. What does it matter if the port is in reset? You don't
> >>> access it.
> >>>
> >>
> >> This function is called from the CN driver to get the current MTU in order
> >> to configure it to the HW, for exmaple when configuring an IB QP. The MTU
> >> value might be changed by user while we execute this function.
> > 
> > Change of MTU will happen while holding RTNL. Why not simply hold RTNL
> > while programming the hardware? That is the normal pattern for MAC
> > drivers.
> >
> 
> I can hold the RTNL lock while configuring the HW but it seems like a big
> overhead. Configuring the HW might take some time due to QP draining or
> cache invalidation.

How often does the MTU change? Once, maybe twice on boot, and never
again? MTU change is not hot path. For slow path code, KISS is much
better, so it is likely to be correct. 

> To me it seems unnecessary but if that's the common way then I'll change
> it.
>  
> >>>> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
> >>>> +{
> >>>> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
> >>>> +     int rc = 0;
> >>>> +
> >>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
> >>>> +             netdev_err(netdev, "port is in reset, can't change MTU\n");
> >>>> +             return -EBUSY;
> >>>> +     }
> >>>> +
> >>>> +     if (netif_running(port->ndev)) {
> >>>> +             hbl_en_port_close(port);
> >>>> +
> >>>> +             /* Sleep in order to let obsolete events to be dropped before re-opening the port */
> >>>> +             msleep(20);
> >>>> +
> >>>> +             netdev->mtu = new_mtu;
> >>>> +
> >>>> +             rc = hbl_en_port_open(port);
> >>>> +             if (rc)
> >>>> +                     netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
> >>>
> >>> Does that mean the port is FUBAR?
> >>>
> >>> Most operations like this are expected to roll back to the previous
> >>> working configuration on failure. So if changing the MTU requires new
> >>> buffers in your ring, you should first allocate the new buffers, then
> >>> free the old buffers, so that if allocation fails, you still have
> >>> buffers, and the device can continue operating.
> >>>
> >>
> >> A failure in opening a port is a fatal error. It shouldn't happen. This is
> >> not something we wish to recover from.
> > 
> > What could cause open to fail? Is memory allocated?
> > 
> 
> Memory is allocated but it is freed in case of a failure.
> Port opening can fail due to other reasons as well like some HW timeout
> while configuring the ETH QP.

If the hardware timeout because the hardware is dead, there is nothing
you can do about it. Its dead.

But what about when the system is under memory pressure? You say it
allocates memory. What happens if those allocations fail. Does
changing the MTU take me from a working system to a dead system? It is
good practice to not kill a working system under situations like
memory pressure. You try to first allocate the memory you need to
handle the new MTU, and only if successful do you free existing memory
you no longer need. That means if you cannot allocate the needed
memory, you still have the old memory, you can keep the old MTU and
return -ENOMEM, and the system keeps running.

> I didn't check that prior to my submit. Regarding this "no new module
> parameters allowed" rule, is that documented anywhere?

Lots of emails that fly passed on the mailing list. Maybe once every
couple of months when a vendor tries to mainline a new driver without
reading the mailing list for a few months to know how mainline
actually works. I _guess_ Davem has been pushing back on module
parameters for 10 years? Maybe more.


> if not, is that the
> common practice? not to try to do something that was not done recently?
> how "recently" is defined?
> I just want to clarify this because it's hard to handle these submissions
> when we write some code based on existing examples but then we are
> rejected because "we don't do that here anymore".
> I want to avoid future cases of this mismatch.

My suggestion would be to spend 30 minutes every day reading patches
and review comment on the mailing list. Avoid making the same mistakes
others make, especially newbies to mainline, and see what others are
doing in the same niche as this device. 30 minutes might seem like a
lot, but how much time did you waste implementing polling mode, now
you are going to throw it away?

> >>>> +                     ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
> >>>
> >>> That looks odd. Care to explain?
> >>>
> >>
> >> The HW of all of our ports supports autoneg.
> >> But in addition, the ports are divided to two groups:
> >> internal: ports which are connected to other Gaudi2 ports in the same server.
> >> external: ports which are connected to an external switch.
> >> Only internal ports use autoneg.
> >> The ports mask which sets each port as internal/external is fetched from
> >> the FW on device load.
> > 
> > That is not what i meant. lc_advertising should indicate the link
> > modes the peer is advertising. If this was a copper link, it typically
> > would contain 10BaseT-Half, 10BaseT-Full, 100BaseT-Half,
> > 100BaseT-Full, 1000BaseT-Half. Setting the Autoneg bit is pointless,
> > since the peer must be advertising in order that lp_advertising has a
> > value!
> > 
> 
> Sorry, but I don't get this. The problem is the setting of the Autoneg bit
> in lp_advertising? is that redundant? I see that other vendors set it too
> in case that Autoneg was completed.


$ ethtool eth0
Settings for eth0:
	Supported ports: [ TP	 MII ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full

This is `supported`. The hardware can do these link modes.

	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes

It also support symmetric pause, and can do autoneg.

	Supported FEC modes: Not reported
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported

This is `advertising`, and is what this device is advertising to the
link partner. By default you copy supported into advertising, but the
user can use ethtool -s advertise N, where N is a list of link modes,
to change what is advertised to the link partner.

	Link partner advertised link modes:  10baseT/Half 10baseT/Full
	                                     100baseT/Half 100baseT/Full
	                                     1000baseT/Full
	Link partner advertised pause frame use: Symmetric
	Link partner advertised auto-negotiation: Yes
	Link partner advertised FEC modes: Not reported

This is `lp_advertising`, what the link partner is advertising to this
device. Once you have this, you mask lp_advertising with advertising,
and generally pick the link mode with the highest bandwidth:

	Speed: 1000Mb/s
	Duplex: Full

So autoneg resolved to 1000baseT/Full

	Andrew

Omer Shpigelman June 20, 2024, 8:36 a.m. UTC | #23

On 6/19/24 18:40, Andrew Lunn wrote:
>>> Does this device require IPv4? What about users and infrastructures that use IPv6 only?
>>> IPv4 is legacy at this point.
>>
>> Gaudi2 supports IPv4 only.
> 
> Really? I guess really old stuff, SLIP from 1988 does not support
> IPv6, but i don't remember seeing anything from this century which
> does not support passing IPv6 frames over a netdev.
> 
>      Andrew

We support IPv6 for ETH, not for RDMA. For RDMA, IPv4 is good enough for
our use case so IPv6 was not required. Stephen's comment was about the
code where the CN driver fetches the port IP for configuring it to the
RDMA QPs. It is an RDMA specific path.

Omer Shpigelman June 20, 2024, 8:43 a.m. UTC | #24

On 6/19/24 18:21, Jakub Kicinski wrote:
> [Some people who received this message don't often get email from kuba@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Wed, 19 Jun 2024 07:16:20 +0000 Omer Shpigelman wrote:
>>>> Are you referring to get_module_eeprom_by_page()? if so, then it is not
>>>> supported by our FW, we read the entire data on device load.
>>>> However, I can hide that behind the new API and return only the
>>>> requested page if that's the intention.
>>>
>>> Well, if your firmware is so limited, then you might as well stick to
>>> the old API, and let the core do the conversion to the legacy
>>> code. But i'm surprised you don't allow access to the temperature
>>> sensors, received signal strength, voltages etc, which could be
>>> exported via HWMON.
>>
>> I'll stick to the old API.
>> Regaring the sensors, our compute driver (under accel/habanalabs) exports
>> them via HWMON.
> 
> You support 400G, you really need to give the user the ability
> to access higher pages.

Actually the 200G and 400G modes in the ethtool code should be removed
from this patch set. They are not relevant for Gaudi2. I'll fix it in the
next version.

Jakub Kicinski June 20, 2024, 1:51 p.m. UTC | #25

On Thu, 20 Jun 2024 08:43:34 +0000 Omer Shpigelman wrote:
> > You support 400G, you really need to give the user the ability
> > to access higher pages.  
> 
> Actually the 200G and 400G modes in the ethtool code should be removed
> from this patch set. They are not relevant for Gaudi2. I'll fix it in the
> next version.

How do your customers / users check SFP diagnostics?

Andrew Lunn June 20, 2024, 7:14 p.m. UTC | #26

On Thu, Jun 20, 2024 at 06:51:35AM -0700, Jakub Kicinski wrote:
> On Thu, 20 Jun 2024 08:43:34 +0000 Omer Shpigelman wrote:
> > > You support 400G, you really need to give the user the ability
> > > to access higher pages.  
> > 
> > Actually the 200G and 400G modes in the ethtool code should be removed
> > from this patch set. They are not relevant for Gaudi2. I'll fix it in the
> > next version.
> 
> How do your customers / users check SFP diagnostics?
 
And perform firmware upgrade of the SFPs?

https://lore.kernel.org/netdev/20240619121727.3643161-7-danieller@nvidia.com/T/

	Andrew

Omer Shpigelman June 23, 2024, 6:22 a.m. UTC | #27

On 6/19/24 19:13, Andrew Lunn wrote:
> On Wed, Jun 19, 2024 at 07:16:20AM +0000, Omer Shpigelman wrote:
>> On 6/18/24 17:19, Andrew Lunn wrote:
>>>>>> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
>>>>>> +{
>>>>>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>>>>>> +     struct net_device *ndev = port->ndev;
>>>>>> +     u32 mtu;
>>>>>> +
>>>>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>>>>>> +             netdev_err(ndev, "port is in reset, can't get MTU\n");
>>>>>> +             return 0;
>>>>>> +     }
>>>>>> +
>>>>>> +     mtu = ndev->mtu;
>>>>>
>>>>> I think you need a better error message. All this does is access
>>>>> ndev->mtu. What does it matter if the port is in reset? You don't
>>>>> access it.
>>>>>
>>>>
>>>> This function is called from the CN driver to get the current MTU in order
>>>> to configure it to the HW, for exmaple when configuring an IB QP. The MTU
>>>> value might be changed by user while we execute this function.
>>>
>>> Change of MTU will happen while holding RTNL. Why not simply hold RTNL
>>> while programming the hardware? That is the normal pattern for MAC
>>> drivers.
>>>
>>
>> I can hold the RTNL lock while configuring the HW but it seems like a big
>> overhead. Configuring the HW might take some time due to QP draining or
>> cache invalidation.
> 
> How often does the MTU change? Once, maybe twice on boot, and never
> again? MTU change is not hot path. For slow path code, KISS is much
> better, so it is likely to be correct. 
> 

Yeah, it's not a hot path so I guess we can just return the MTU value
regardless of a parallel reset flow.

>> To me it seems unnecessary but if that's the common way then I'll change
>> it.
>>  
>>>>>> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
>>>>>> +{
>>>>>> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
>>>>>> +     int rc = 0;
>>>>>> +
>>>>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>>>>>> +             netdev_err(netdev, "port is in reset, can't change MTU\n");
>>>>>> +             return -EBUSY;
>>>>>> +     }
>>>>>> +
>>>>>> +     if (netif_running(port->ndev)) {
>>>>>> +             hbl_en_port_close(port);
>>>>>> +
>>>>>> +             /* Sleep in order to let obsolete events to be dropped before re-opening the port */
>>>>>> +             msleep(20);
>>>>>> +
>>>>>> +             netdev->mtu = new_mtu;
>>>>>> +
>>>>>> +             rc = hbl_en_port_open(port);
>>>>>> +             if (rc)
>>>>>> +                     netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
>>>>>
>>>>> Does that mean the port is FUBAR?
>>>>>
>>>>> Most operations like this are expected to roll back to the previous
>>>>> working configuration on failure. So if changing the MTU requires new
>>>>> buffers in your ring, you should first allocate the new buffers, then
>>>>> free the old buffers, so that if allocation fails, you still have
>>>>> buffers, and the device can continue operating.
>>>>>
>>>>
>>>> A failure in opening a port is a fatal error. It shouldn't happen. This is
>>>> not something we wish to recover from.
>>>
>>> What could cause open to fail? Is memory allocated?
>>>
>>
>> Memory is allocated but it is freed in case of a failure.
>> Port opening can fail due to other reasons as well like some HW timeout
>> while configuring the ETH QP.
> 
> If the hardware timeout because the hardware is dead, there is nothing
> you can do about it. Its dead.
> 

In our case the HW might timeout without being dead. Our ETH and RDMA QPs
are being configured in the same path in the HW so it is possible that a
timeout for ETH QP configuration will occur due to many parallel RDMA QPs
configurations and so a simple ETH QP configuration retry will solve it.

> But what about when the system is under memory pressure? You say it
> allocates memory. What happens if those allocations fail. Does
> changing the MTU take me from a working system to a dead system? It is
> good practice to not kill a working system under situations like
> memory pressure. You try to first allocate the memory you need to
> handle the new MTU, and only if successful do you free existing memory
> you no longer need. That means if you cannot allocate the needed
> memory, you still have the old memory, you can keep the old MTU and
> return -ENOMEM, and the system keeps running.
> 

That's a good optimization for these kind of on-the-fly configurations but
as you wrote before, changing an MTU value is not a hot path so out of
cost-benefit considerations we didn't find it mandatory to optimize this
flow.
But let me check this option for the next patch set version.

>> I didn't check that prior to my submit. Regarding this "no new module
>> parameters allowed" rule, is that documented anywhere?
> 
> Lots of emails that fly passed on the mailing list. Maybe once every
> couple of months when a vendor tries to mainline a new driver without
> reading the mailing list for a few months to know how mainline
> actually works. I _guess_ Davem has been pushing back on module
> parameters for 10 years? Maybe more.
> 
>

Ok, I'll just drop it in the next patch set version.
 
>> if not, is that the
>> common practice? not to try to do something that was not done recently?
>> how "recently" is defined?
>> I just want to clarify this because it's hard to handle these submissions
>> when we write some code based on existing examples but then we are
>> rejected because "we don't do that here anymore".
>> I want to avoid future cases of this mismatch.
> 
> My suggestion would be to spend 30 minutes every day reading patches
> and review comment on the mailing list. Avoid making the same mistakes
> others make, especially newbies to mainline, and see what others are
> doing in the same niche as this device. 30 minutes might seem like a
> lot, but how much time did you waste implementing polling mode, now
> you are going to throw it away?
>

I get your point but still it will be good if it would be documented
somewhere IMHO.
 
>>>>>> +                     ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
>>>>>
>>>>> That looks odd. Care to explain?
>>>>>
>>>>
>>>> The HW of all of our ports supports autoneg.
>>>> But in addition, the ports are divided to two groups:
>>>> internal: ports which are connected to other Gaudi2 ports in the same server.
>>>> external: ports which are connected to an external switch.
>>>> Only internal ports use autoneg.
>>>> The ports mask which sets each port as internal/external is fetched from
>>>> the FW on device load.
>>>
>>> That is not what i meant. lc_advertising should indicate the link
>>> modes the peer is advertising. If this was a copper link, it typically
>>> would contain 10BaseT-Half, 10BaseT-Full, 100BaseT-Half,
>>> 100BaseT-Full, 1000BaseT-Half. Setting the Autoneg bit is pointless,
>>> since the peer must be advertising in order that lp_advertising has a
>>> value!
>>>
>>
>> Sorry, but I don't get this. The problem is the setting of the Autoneg bit
>> in lp_advertising? is that redundant? I see that other vendors set it too
>> in case that Autoneg was completed.
> 
> 
> $ ethtool eth0
> Settings for eth0:
> 	Supported ports: [ TP	 MII ]
> 	Supported link modes:   10baseT/Half 10baseT/Full
> 	                        100baseT/Half 100baseT/Full
> 	                        1000baseT/Full
> 
> This is `supported`. The hardware can do these link modes.
> 
> 	Supported pause frame use: Symmetric Receive-only
> 	Supports auto-negotiation: Yes
> 
> It also support symmetric pause, and can do autoneg.
> 
> 	Supported FEC modes: Not reported
> 	Advertised link modes:  10baseT/Half 10baseT/Full
> 	                        100baseT/Half 100baseT/Full
> 	                        1000baseT/Full
> 	Advertised pause frame use: Symmetric Receive-only
> 	Advertised auto-negotiation: Yes
> 	Advertised FEC modes: Not reported
> 
> This is `advertising`, and is what this device is advertising to the
> link partner. By default you copy supported into advertising, but the
> user can use ethtool -s advertise N, where N is a list of link modes,
> to change what is advertised to the link partner.
> 
> 	Link partner advertised link modes:  10baseT/Half 10baseT/Full
> 	                                     100baseT/Half 100baseT/Full
> 	                                     1000baseT/Full
> 	Link partner advertised pause frame use: Symmetric
> 	Link partner advertised auto-negotiation: Yes
> 	Link partner advertised FEC modes: Not reported
> 
> This is `lp_advertising`, what the link partner is advertising to this
> device. Once you have this, you mask lp_advertising with advertising,
> and generally pick the link mode with the highest bandwidth:
> 
> 	Speed: 1000Mb/s
> 	Duplex: Full
> 
> So autoneg resolved to 1000baseT/Full
> 
> 	Andrew

I'm familiar with this logic but I don't understand your point. The point
you are making is that setting this Autoneg bit in lp_advertising is
pointless? I see other vendors setting it too in case that autoneg was
completed.
Is that redundant also in their case? because it looks to me that in this
case we followed the same logic and conventions other vendors followed.

Andrew Lunn June 23, 2024, 2:46 p.m. UTC | #28

> > But what about when the system is under memory pressure? You say it
> > allocates memory. What happens if those allocations fail. Does
> > changing the MTU take me from a working system to a dead system? It is
> > good practice to not kill a working system under situations like
> > memory pressure. You try to first allocate the memory you need to
> > handle the new MTU, and only if successful do you free existing memory
> > you no longer need. That means if you cannot allocate the needed
> > memory, you still have the old memory, you can keep the old MTU and
> > return -ENOMEM, and the system keeps running.
> > 
> 
> That's a good optimization for these kind of on-the-fly configurations but
> as you wrote before, changing an MTU value is not a hot path so out of
> cost-benefit considerations we didn't find it mandatory to optimize this
> flow.

I would not call this an optimization. And it is not just about
changing the MTU. ethtool set_ringparam() is also likely to run into
this problem, and any other configuration which requires reallocating
the rings.

This is something else which comes up every few months on the list,
and drivers writers who monitor the list will write their drivers that
why, not 'optimise' it later.

> I get your point but still it will be good if it would be documented
> somewhere IMHO.

Kernel documentation is poor, agreed. But kernel policy is also
somewhat fluid, best practices change, and any developers can
influence that policy, different subsystems can and do have
contradictory policy, etc. The mailing list is the best place to learn
and to take part in this community. You need to be on the list for
other reasons as well.

> I'm familiar with this logic but I don't understand your point. The point
> you are making is that setting this Autoneg bit in lp_advertising is
> pointless? I see other vendors setting it too in case that autoneg was
> completed.
> Is that redundant also in their case? because it looks to me that in this
> case we followed the same logic and conventions other vendors followed.

Please show us the output from ethtool. Does it look like the example
i showed? I must admit, i'm more from the embedded world and don't
have access to high speed interfaces. But the basic concept of
auto-neg should not change that much.

	 Andrew

Omer Shpigelman June 23, 2024, 2:48 p.m. UTC | #29

On 6/20/24 22:14, Andrew Lunn wrote:
> On Thu, Jun 20, 2024 at 06:51:35AM -0700, Jakub Kicinski wrote:
>> On Thu, 20 Jun 2024 08:43:34 +0000 Omer Shpigelman wrote:
>>>> You support 400G, you really need to give the user the ability
>>>> to access higher pages.  
>>>
>>> Actually the 200G and 400G modes in the ethtool code should be removed
>>> from this patch set. They are not relevant for Gaudi2. I'll fix it in the
>>> next version.
>>
>> How do your customers / users check SFP diagnostics?
>  
> And perform firmware upgrade of the SFPs?
> 
> https://lore.kernel.org/netdev/20240619121727.3643161-7-danieller@nvidia.com/T/
> 
> 	Andrew
> 

Via OAM I2C Master.

Omer Shpigelman June 26, 2024, 10:13 a.m. UTC | #30

On 6/23/24 17:46, Andrew Lunn wrote:
>>> But what about when the system is under memory pressure? You say it
>>> allocates memory. What happens if those allocations fail. Does
>>> changing the MTU take me from a working system to a dead system? It is
>>> good practice to not kill a working system under situations like
>>> memory pressure. You try to first allocate the memory you need to
>>> handle the new MTU, and only if successful do you free existing memory
>>> you no longer need. That means if you cannot allocate the needed
>>> memory, you still have the old memory, you can keep the old MTU and
>>> return -ENOMEM, and the system keeps running.
>>>
>>
>> That's a good optimization for these kind of on-the-fly configurations but
>> as you wrote before, changing an MTU value is not a hot path so out of
>> cost-benefit considerations we didn't find it mandatory to optimize this
>> flow.
> 
> I would not call this an optimization. And it is not just about
> changing the MTU. ethtool set_ringparam() is also likely to run into
> this problem, and any other configuration which requires reallocating
> the rings.
> 
> This is something else which comes up every few months on the list,
> and drivers writers who monitor the list will write their drivers that
> why, not 'optimise' it later.
> 

Actually I was wrong, we don't allocate memory in this port reset flow, we
only reset the rings. But I get your point, it makes sense.

>> I get your point but still it will be good if it would be documented
>> somewhere IMHO.
> 
> Kernel documentation is poor, agreed. But kernel policy is also
> somewhat fluid, best practices change, and any developers can
> influence that policy, different subsystems can and do have
> contradictory policy, etc. The mailing list is the best place to learn
> and to take part in this community. You need to be on the list for
> other reasons as well.
> 

Ok, got it.

>> I'm familiar with this logic but I don't understand your point. The point
>> you are making is that setting this Autoneg bit in lp_advertising is
>> pointless? I see other vendors setting it too in case that autoneg was
>> completed.
>> Is that redundant also in their case? because it looks to me that in this
>> case we followed the same logic and conventions other vendors followed.
> 
> Please show us the output from ethtool. Does it look like the example
> i showed? I must admit, i'm more from the embedded world and don't
> have access to high speed interfaces. But the basic concept of
> auto-neg should not change that much.
> 
> 	 Andrew

Here is the output:
$ ethtool eth0
Settings for eth0:
	Supported ports: [ FIBRE	 Backplane ]
	Supported link modes:   100000baseKR4/Full
	                        100000baseSR4/Full
	                        100000baseCR4/Full
	                        100000baseLR4_ER4/Full
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  100000baseKR4/Full
	                        100000baseSR4/Full
	                        100000baseCR4/Full
	                        100000baseLR4_ER4/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Link partner advertised link modes:  Not reported
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Link partner advertised FEC modes: Not reported
	Speed: 100000Mb/s
	Duplex: Full
	Auto-negotiation: on

There are few points to mention:
1. We don't allow to modify the advertised link modes so by definition the
   advertised ones are a copy of the supported ones.
2. Reading the peer advertised link modes is not supported so we don't
   report them (similarly to some other vendors).
3. Our speed is fixed and also cannot be changed so we don't mask
   lp_advertising with advertising to pick the highest speed. We aim for a
   specific speed and hence it's binary - or we'll have a link with that
   specific speed or we won't have a link at all.
4. If we support autoneg and it was completed, we can conclude that also
   our peer supports autoneg and hence we report that.

Andrew Lunn June 26, 2024, 2:13 p.m. UTC | #31

> Here is the output:
> $ ethtool eth0
> Settings for eth0:
> 	Supported ports: [ FIBRE	 Backplane ]
> 	Supported link modes:   100000baseKR4/Full
> 	                        100000baseSR4/Full
> 	                        100000baseCR4/Full
> 	                        100000baseLR4_ER4/Full
> 	Supported pause frame use: Symmetric
> 	Supports auto-negotiation: Yes
> 	Supported FEC modes: Not reported
> 	Advertised link modes:  100000baseKR4/Full
> 	                        100000baseSR4/Full
> 	                        100000baseCR4/Full
> 	                        100000baseLR4_ER4/Full
> 	Advertised pause frame use: Symmetric
> 	Advertised auto-negotiation: Yes
> 	Advertised FEC modes: Not reported
> 	Link partner advertised link modes:  Not reported
> 	Link partner advertised pause frame use: No
> 	Link partner advertised auto-negotiation: Yes
> 	Link partner advertised FEC modes: Not reported
> 	Speed: 100000Mb/s
> 	Duplex: Full
> 	Auto-negotiation: on
> 
> There are few points to mention:
> 1. We don't allow to modify the advertised link modes so by definition the
>    advertised ones are a copy of the supported ones.

So there is no way to ask it use to use 100000baseCR4/Full, for
example? You would normally change the advertised modes to just that
one link mode, and then it has no choice. It either uses
100000baseCR4/Full, or it does not establish a link.

Also, my experience with slower modules is that one supporting
2500BaseX can also support 1000BaseX. However, there is no auto-neg
defined for speeds, just pause. So if the link peer only supports
1000BaseX, you don't get link. What you typically see is:

$ ethtool eth0
Settings for eth0:
 	Supported ports: [ FIBRE	 Backplane ]
 	Supported link modes:   1000baseX
 	                        2500baseX
 	Supported pause frame use: Symmetric
 	Supports auto-negotiation: Yes
 	Supported FEC modes: Not reported
 	Advertised link modes:  2500baseX
 	Advertised pause frame use: Symmetric

and then you use ethtool to change advertising to 1000baseX and then
you get link. Can these modules support slower speeds?

> 2. Reading the peer advertised link modes is not supported so we don't
>    report them (similarly to some other vendors).

Not supported by your firmware? Or not supported by the modules?

    Andrew

Omer Shpigelman June 30, 2024, 7:11 a.m. UTC | #32

On 6/26/24 17:13, Andrew Lunn wrote:
>> Here is the output:
>> $ ethtool eth0
>> Settings for eth0:
>> 	Supported ports: [ FIBRE	 Backplane ]
>> 	Supported link modes:   100000baseKR4/Full
>> 	                        100000baseSR4/Full
>> 	                        100000baseCR4/Full
>> 	                        100000baseLR4_ER4/Full
>> 	Supported pause frame use: Symmetric
>> 	Supports auto-negotiation: Yes
>> 	Supported FEC modes: Not reported
>> 	Advertised link modes:  100000baseKR4/Full
>> 	                        100000baseSR4/Full
>> 	                        100000baseCR4/Full
>> 	                        100000baseLR4_ER4/Full
>> 	Advertised pause frame use: Symmetric
>> 	Advertised auto-negotiation: Yes
>> 	Advertised FEC modes: Not reported
>> 	Link partner advertised link modes:  Not reported
>> 	Link partner advertised pause frame use: No
>> 	Link partner advertised auto-negotiation: Yes
>> 	Link partner advertised FEC modes: Not reported
>> 	Speed: 100000Mb/s
>> 	Duplex: Full
>> 	Auto-negotiation: on
>>
>> There are few points to mention:
>> 1. We don't allow to modify the advertised link modes so by definition the
>>    advertised ones are a copy of the supported ones.
> 
> So there is no way to ask it use to use 100000baseCR4/Full, for
> example? You would normally change the advertised modes to just that
> one link mode, and then it has no choice. It either uses
> 100000baseCR4/Full, or it does not establish a link.
> 

No, our FW doesn't support it as we have no use case for that.

> Also, my experience with slower modules is that one supporting
> 2500BaseX can also support 1000BaseX. However, there is no auto-neg
> defined for speeds, just pause. So if the link peer only supports
> 1000BaseX, you don't get link. What you typically see is:
> 
> $ ethtool eth0
> Settings for eth0:
>  	Supported ports: [ FIBRE	 Backplane ]
>  	Supported link modes:   1000baseX
>  	                        2500baseX
>  	Supported pause frame use: Symmetric
>  	Supports auto-negotiation: Yes
>  	Supported FEC modes: Not reported
>  	Advertised link modes:  2500baseX
>  	Advertised pause frame use: Symmetric
> 
> and then you use ethtool to change advertising to 1000baseX and then
> you get link. Can these modules support slower speeds?
> 

No, we support a single speed.

>> 2. Reading the peer advertised link modes is not supported so we don't
>>    report them (similarly to some other vendors).
> 
> Not supported by your firmware? Or not supported by the modules?
> 

Let me explain it better - Gaudi2 is not a general purpose Ethernet NIC.
Its goal is to support any Ethernet traffic that is needed for enabling
the scaling of AI neural networks training as part of HLS2 server:
https://www.intel.com/content/www/us/en/content-details/784778/hls-gaudi-2-deep-learning-server-datasheet.html

Hence, in contrary to a general purpose Ethernet NIC, it is well known who
is our peer and what are its capabilities - it is a Gaudi2 NIC or a
switch.
Technically we can read the advertised link partner modes but we had no
demand for that because the driver and the user are well aware of who is
on the other side.
Reading it from the FW will be the same as having it hard coded because
the value is already known (otherwise we won't have a link). I can add it
to lp_advertising if necessary although per my check most vendors don't
report it either.

>     Andrew

[09/15] net: hbl_en: add habanalabs Ethernet driver

Commit Message

Comments

Patch