diff mbox series

[net-next] bond: add mac filter option for balance-xor

Message ID b03f0896e1a0b65cc1b278aecc9d080b2ec9d8a6.1648136359.git.jtoppins@redhat.com (mailing list archive)
State Deferred
Delegated to: Netdev Maintainers
Headers show
Series [net-next] bond: add mac filter option for balance-xor | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4797 this patch: 4797
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 793 this patch: 793
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4953 this patch: 4953
netdev/checkpatch warning CHECK: braces {} should be used on all arms of this statement WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 142 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline fail Was 0 now: 2

Commit Message

Jonathan Toppins March 24, 2022, 3:54 p.m. UTC
Attempt to replicate the OvS SLB Bonding[1] feature by preventing
duplicate frame delivery on a bond whos members are connected to
physically different switches.

Combining this feature with vlan+srcmac hash policy allows a user to
create an access network without the need to use expensive switches that
support features like Cisco's VCP.

[1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding

Co-developed-by: Long Xin <lxin@redhat.com>
Signed-off-by: Long Xin <lxin@redhat.com>
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
---
 Documentation/networking/bonding.rst  |  19 +++
 drivers/net/bonding/Makefile          |   2 +-
 drivers/net/bonding/bond_mac_filter.c | 222 ++++++++++++++++++++++++++
 drivers/net/bonding/bond_mac_filter.h |  40 +++++
 drivers/net/bonding/bond_main.c       |  26 +++
 drivers/net/bonding/bond_netlink.c    |  13 ++
 drivers/net/bonding/bond_options.c    |  58 ++++++-
 drivers/net/bonding/bonding_priv.h    |   1 +
 include/net/bond_options.h            |   1 +
 include/net/bonding.h                 |   3 +
 include/uapi/linux/if_link.h          |   1 +
 11 files changed, 384 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/bonding/bond_mac_filter.c
 create mode 100644 drivers/net/bonding/bond_mac_filter.h

Comments

Jakub Kicinski March 24, 2022, 11:03 p.m. UTC | #1
On Thu, 24 Mar 2022 11:54:41 -0400 Jonathan Toppins wrote:
> Attempt to replicate the OvS SLB Bonding[1] feature by preventing
> duplicate frame delivery on a bond whos members are connected to
> physically different switches.
> 
> Combining this feature with vlan+srcmac hash policy allows a user to
> create an access network without the need to use expensive switches that
> support features like Cisco's VCP.

# Form letter - net-next is closed

We have already sent the networking pull request for 5.18
and therefore net-next is closed for new drivers, features,
code refactoring and optimizations. We are currently accepting
bug fixes only.

Please repost when net-next reopens after 5.18-rc1 is cut.

RFC patches sent for review only are obviously welcome at any time.
Jay Vosburgh March 25, 2022, 12:50 a.m. UTC | #2
Considering this as an RFC given that net-next is closed...
	
Jonathan Toppins <jtoppins@redhat.com> wrote:

>Attempt to replicate the OvS SLB Bonding[1] feature by preventing
>duplicate frame delivery on a bond whos members are connected to
>physically different switches.
>
>Combining this feature with vlan+srcmac hash policy allows a user to
>create an access network without the need to use expensive switches that
>support features like Cisco's VCP.

	Could you describe this use case / implementation in a bit more
detail?  I.e., how exactly that configuration works.  I don't think this
patch is replicating everything in the OVS SLB documentation.

>[1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding
>
>Co-developed-by: Long Xin <lxin@redhat.com>
>Signed-off-by: Long Xin <lxin@redhat.com>
>Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
>---
> Documentation/networking/bonding.rst  |  19 +++
> drivers/net/bonding/Makefile          |   2 +-
> drivers/net/bonding/bond_mac_filter.c | 222 ++++++++++++++++++++++++++
> drivers/net/bonding/bond_mac_filter.h |  40 +++++
> drivers/net/bonding/bond_main.c       |  26 +++
> drivers/net/bonding/bond_netlink.c    |  13 ++
> drivers/net/bonding/bond_options.c    |  58 ++++++-
> drivers/net/bonding/bonding_priv.h    |   1 +
> include/net/bond_options.h            |   1 +
> include/net/bonding.h                 |   3 +
> include/uapi/linux/if_link.h          |   1 +
> 11 files changed, 384 insertions(+), 2 deletions(-)
> create mode 100644 drivers/net/bonding/bond_mac_filter.c
> create mode 100644 drivers/net/bonding/bond_mac_filter.h
>
>diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>index 525e6842dd33..a5a1669d3efe 100644
>--- a/Documentation/networking/bonding.rst
>+++ b/Documentation/networking/bonding.rst
>@@ -550,6 +550,25 @@ lacp_rate
> 
> 	The default is slow.
> 
>+mac_filter
>+
>+	Tells the bonding device to drop frames received who's source MAC
>+	address	matches entries in a filter table. The filter table is
>+	populated when the bond transmits frames. This is similar in
>+	concept to the MAC learning table implemented in the bridge code.
>+
>+	This filtering is only enabled for the balance-xor bonding mode.
>+
>+	off or 0
>+		Turns the feature off
>+
>+	number
>+		A number greater than zero turns the feature on and sets
>+		the maximum number of MAC addresses to store in the hash
>+		table.
>+
>+	The default is off.
>+
> max_bonds
> 
> 	Specifies the number of bonding devices to create for this
>diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
>index 30e8ae3da2da..5dbc360a8522 100644
>--- a/drivers/net/bonding/Makefile
>+++ b/drivers/net/bonding/Makefile
>@@ -5,7 +5,7 @@
> 
> obj-$(CONFIG_BONDING) += bonding.o
> 
>-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
>+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o bond_mac_filter.o
> 
> proc-$(CONFIG_PROC_FS) += bond_procfs.o
> bonding-objs += $(proc-y)
>diff --git a/drivers/net/bonding/bond_mac_filter.c b/drivers/net/bonding/bond_mac_filter.c
>new file mode 100644
>index 000000000000..a16a1a000f05
>--- /dev/null
>+++ b/drivers/net/bonding/bond_mac_filter.c
>@@ -0,0 +1,222 @@
>+// SPDX-License-Identifier: GPL-2.0-only
>+/*
>+ * Filter received frames based on MAC addresses "behind" the bond.
>+ */
>+
>+#include "bonding_priv.h"
>+
>+/* -------------- Cache Management -------------- */

	I don't think this header adds anything, given that there's not
really a lot in the section.

>+static struct kmem_cache *bond_mac_cache __read_mostly;
>+
>+int __init bond_mac_cache_init(void)
>+{
>+	bond_mac_cache = kmem_cache_create("bond_mac_cache",
>+					   sizeof(struct bond_mac_cache_entry),
>+					   0, SLAB_HWCACHE_ALIGN, NULL);
>+	if (!bond_mac_cache)
>+		return -ENOMEM;
>+	return 0;
>+}
>+
>+void bond_mac_cache_destroy(void)
>+{
>+	kmem_cache_destroy(bond_mac_cache);
>+}

	There are a lot of the above sort of wrapper functions that are
only ever called once.  Some of these, e.g., mac_delete, below, I agree
with, as the call site is nested fairly deep and the function is
non-trivial; or, mac_delete_rcu, which is used as a callback.

	The above two, though, I don't see a justification for, along
with hold_time and maybe a couple others, below.  In my opinion,
over-abstracting these trivial things with one call site makes the code
harder to follow.

>+
>+/* -------------- Hash Table Management -------------- */
>+
>+static const struct rhashtable_params bond_rht_params = {
>+	.head_offset         = offsetof(struct bond_mac_cache_entry, rhnode),
>+	.key_offset          = offsetof(struct bond_mac_cache_entry, key),
>+	.key_len             = sizeof(struct mac_addr),
>+	.automatic_shrinking = true,
>+};
>+
>+static inline unsigned long hold_time(const struct bonding *bond)
>+{
>+	return msecs_to_jiffies(5000);
>+}

	This shouldn't be a magic number, and if it's an important
timeout, should it be configurable?

>+
>+static bool has_expired(const struct bonding *bond,
>+			struct bond_mac_cache_entry *mac)
>+{
>+	return time_before_eq(mac->used + hold_time(bond), jiffies);
>+}
>+
>+static void mac_delete_rcu(struct callback_head *head)
>+{
>+	kmem_cache_free(bond_mac_cache,
>+			container_of(head, struct bond_mac_cache_entry, rcu));
>+}
>+
>+static int mac_delete(struct bonding *bond,
>+		      struct bond_mac_cache_entry *entry)
>+{
>+	int ret;
>+
>+	ret = rhashtable_remove_fast(bond->mac_filter_tbl,
>+				     &entry->rhnode,
>+				     bond->mac_filter_tbl->p);
>+	set_bit(BOND_MAC_DEAD, &entry->flags);
>+	call_rcu(&entry->rcu, mac_delete_rcu);
>+	return ret;
>+}
>+
>+void bond_mac_hash_release_entries(struct work_struct *work)
>+{
>+	struct bonding *bond = container_of(work, struct bonding,
>+				mac_work.work);
>+	struct rhashtable_iter iter;
>+	struct bond_mac_cache_entry *entry;
>+	unsigned long flags;
>+
>+	rhashtable_walk_enter(bond->mac_filter_tbl, &iter);
>+	rhashtable_walk_start(&iter);
>+	while ((entry = rhashtable_walk_next(&iter)) != NULL) {
>+		if (IS_ERR(entry))
>+			continue;
>+
>+		spin_lock_irqsave(&entry->lock, flags);
>+		if (has_expired(bond, entry))
>+			mac_delete(bond, entry);
>+		spin_unlock_irqrestore(&entry->lock, flags);
>+	}
>+	rhashtable_walk_stop(&iter);
>+	rhashtable_walk_exit(&iter);
>+	queue_delayed_work(bond->wq, &bond->mac_work,
>+			   msecs_to_jiffies(5 * 60 * 1000));
>+}
>+
>+int bond_mac_hash_init(struct bonding *bond)
>+{
>+	int rc;

	As a point of style, (almost) everywhere else in bonding uses
"ret" for a return value.  The exceptions are largely my doing, but,
still, it'd be nice to be mostly consistent in nomenclature.

>+
>+	netdev_dbg(bond->dev, "mac_filter: alloc memory for hash table\n");
>+	bond->mac_filter_tbl = kzalloc(sizeof(*bond->mac_filter_tbl),
>+				       GFP_KERNEL);
>+	if (!bond->mac_filter_tbl)
>+		return -ENOMEM;
>+
>+	rc = rhashtable_init(bond->mac_filter_tbl, &bond_rht_params);
>+	if (rc)
>+		kfree(bond->mac_filter_tbl);
>+
>+	return rc;
>+}
>+
>+static void bond_mac_free_entry(void *entry, void *ctx)
>+{
>+	kmem_cache_free((struct kmem_cache *)ctx, entry);
>+}
>+
>+void bond_mac_hash_destroy(struct bonding *bond)
>+{
>+	if (bond->mac_filter_tbl) {
>+		rhashtable_free_and_destroy(bond->mac_filter_tbl,
>+					    bond_mac_free_entry,
>+					    bond_mac_cache);
>+		kfree(bond->mac_filter_tbl);
>+		bond->mac_filter_tbl = NULL;
>+	}
>+}
>+
>+static int mac_create(struct bonding *bond, const u8 *addr)
>+{
>+	struct bond_mac_cache_entry *entry;
>+	int ret;
>+
>+	entry = kmem_cache_alloc(bond_mac_cache, GFP_ATOMIC);
>+	if (!entry)
>+		return -ENOMEM;
>+	spin_lock_init(&entry->lock);
>+	memcpy(&entry->key, addr, sizeof(entry->key));
>+	entry->used = jiffies;
>+	ret = rhashtable_lookup_insert_fast(bond->mac_filter_tbl,
>+					    &entry->rhnode,
>+					    bond->mac_filter_tbl->p);
>+	if (ret) {
>+		kmem_cache_free(bond_mac_cache, entry);
>+		entry = NULL;
>+		if (ret == -EEXIST)
>+			return 0;
>+		pr_err_once("Failed to insert mac entry %d\n", ret);
>+	}
>+	return ret;
>+}
>+
>+static struct bond_mac_cache_entry *mac_find(struct bonding *bond,
>+					     const u8 *addr)
>+{
>+	struct mac_addr key;
>+
>+	memcpy(&key, addr, sizeof(key));
>+	return rhashtable_lookup(bond->mac_filter_tbl, &key,
>+				 bond->mac_filter_tbl->p);
>+}
>+
>+inline void mac_update(struct bond_mac_cache_entry *entry)
>+{
>+	entry->used = jiffies;
>+}
>+
>+int bond_mac_insert(struct bonding *bond, const u8 *addr)
>+{
>+	struct bond_mac_cache_entry *entry;
>+	int rc = 0;
>+
>+	if (!is_valid_ether_addr(addr))
>+		return -EINVAL;
>+
>+	rcu_read_lock();
>+	entry = mac_find(bond, addr);
>+	if (entry) {
>+		unsigned long flags;
>+
>+		spin_lock_irqsave(&entry->lock, flags);
>+		if (!test_bit(BOND_MAC_DEAD, &entry->flags)) {
>+			mac_update(entry);
>+			spin_unlock_irqrestore(&entry->lock, flags);
>+			goto out;
>+		}
>+		spin_unlock_irqrestore(&entry->lock, flags);

	This seems really expensive, as it will add a spin_lock_irqsave
round trip for almost every packet transmitted when mac_filter is
enabled (as this will be called by bond_xmit_3ad_xor_slave_get).

>+	}
>+
>+	rc = mac_create(bond, addr);
>+
>+out:
>+	rcu_read_unlock();
>+	return rc;
>+}
>+
>+int bond_xor_recv(const struct sk_buff *skb, struct bonding *bond,
>+		  struct slave *slave)
>+{
>+	const struct ethhdr *mac_hdr;
>+	struct bond_mac_cache_entry *entry;
>+	int rc = RX_HANDLER_PASS;
>+
>+	mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>+	rcu_read_lock();
>+	if (is_multicast_ether_addr(mac_hdr->h_dest) &&
>+	    slave != rcu_dereference(bond->curr_active_slave)) {
>+		rc = RX_HANDLER_CONSUMED;
>+		goto out;
>+	}
>+
>+	entry = mac_find(bond, mac_hdr->h_source);
>+	if (entry) {
>+		unsigned long flags;
>+		bool expired;
>+
>+		spin_lock_irqsave(&entry->lock, flags);
>+		expired = has_expired(bond, entry);
>+		spin_unlock_irqrestore(&entry->lock, flags);
>+		if (!expired)
>+			rc = RX_HANDLER_CONSUMED;
>+	}

	As above, really expensive, except for incoming packets here
(since this is called as the recv_probe).

>+
>+out:
>+	rcu_read_unlock();
>+	return rc;
>+}
>diff --git a/drivers/net/bonding/bond_mac_filter.h b/drivers/net/bonding/bond_mac_filter.h
>new file mode 100644
>index 000000000000..0cfcc5653e7e
>--- /dev/null
>+++ b/drivers/net/bonding/bond_mac_filter.h
>@@ -0,0 +1,40 @@
>+/* SPDX-License-Identifier: GPL-2.0-only
>+ *
>+ * Filter received frames based on MAC addresses "behind" the bond.
>+ */
>+
>+#ifndef _BOND_MAC_FILTER_H
>+#define _BOND_MAC_FILTER_H
>+#include <net/bonding.h>
>+#include <linux/spinlock.h>
>+#include <linux/rhashtable.h>
>+
>+enum {
>+	BOND_MAC_DEAD,
>+	BOND_MAC_LOCKED,
>+	BOND_MAC_STATIC,
>+};
>+
>+struct bond_mac_cache_entry {
>+	struct rhash_head	rhnode;
>+	struct mac_addr		key;
>+
>+	spinlock_t		lock; /* protects used member */
>+	unsigned long		flags;
>+	unsigned long		used;
>+	struct rcu_head		rcu;
>+};
>+
>+int __init bond_mac_cache_init(void);
>+void bond_mac_cache_destroy(void);
>+
>+void bond_mac_hash_release_entries(struct work_struct *work);
>+int bond_mac_hash_init(struct bonding *bond);
>+void bond_mac_hash_destroy(struct bonding *bond);
>+
>+int bond_mac_insert(struct bonding *bond, const u8 *addr);
>+int bond_xor_recv(const struct sk_buff *skb,
>+		  struct bonding *bond,
>+		  struct slave *slave);
>+
>+#endif
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 15eddca7b4b6..f5a8a50e9116 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1549,6 +1549,10 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
> 		return RX_HANDLER_EXACT;
> 	}
> 
>+	/* this function should not rely on the recv_probe to set ret
>+	 * correctly
>+	 */
>+	ret = RX_HANDLER_ANOTHER;

	This change is overriding the return from a recv_probe added by
this patch (bond_xor_recv can return RX_HANDLER_PASS).  Why?

	Also, I don't agree with the comment; the recv_probe return
value by design affects the return value from bond_handle_frame.

> 	skb->dev = bond->dev;
> 
> 	if (BOND_MODE(bond) == BOND_MODE_ALB &&
>@@ -4117,6 +4121,7 @@ void bond_work_init_all(struct bonding *bond)
> 	INIT_DELAYED_WORK(&bond->arp_work, bond_arp_monitor);
> 	INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
> 	INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
>+	INIT_DELAYED_WORK(&bond->mac_work, bond_mac_hash_release_entries);
> }
> 
> static void bond_work_cancel_all(struct bonding *bond)
>@@ -4127,6 +4132,7 @@ static void bond_work_cancel_all(struct bonding *bond)
> 	cancel_delayed_work_sync(&bond->ad_work);
> 	cancel_delayed_work_sync(&bond->mcast_work);
> 	cancel_delayed_work_sync(&bond->slave_arr_work);
>+	cancel_delayed_work_sync(&bond->mac_work);
> }
> 
> static int bond_open(struct net_device *bond_dev)
>@@ -4174,6 +4180,11 @@ static int bond_open(struct net_device *bond_dev)
> 		bond_3ad_initiate_agg_selection(bond, 1);
> 	}
> 
>+	if (bond->params.mac_filter) {
>+		bond->recv_probe = bond_xor_recv;
>+		queue_delayed_work(bond->wq, &bond->mac_work, 0);
>+	}
>+
> 	if (bond_mode_can_use_xmit_hash(bond))
> 		bond_update_slave_arr(bond, NULL);
> 
>@@ -5043,6 +5054,13 @@ static struct slave *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
> 	if (unlikely(!count))
> 		return NULL;
> 
>+	if (bond->params.mac_filter) {
>+		const struct ethhdr *mac_hdr;
>+
>+		mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>+		if (bond_mac_insert(bond, mac_hdr->h_source))
>+			return NULL;
>+	}
> 	slave = slaves->arr[hash % count];
> 	return slave;
> }
>@@ -5660,6 +5678,8 @@ static void bond_destructor(struct net_device *bond_dev)
> 
> 	if (bond->rr_tx_counter)
> 		free_percpu(bond->rr_tx_counter);
>+
>+	bond_mac_hash_destroy(bond);
> }
> 
> void bond_setup(struct net_device *bond_dev)
>@@ -6115,6 +6135,7 @@ static int bond_check_params(struct bond_params *params)
> 	params->downdelay = downdelay;
> 	params->peer_notif_delay = 0;
> 	params->use_carrier = use_carrier;
>+	params->mac_filter = 0;
> 	params->lacp_active = 1;
> 	params->lacp_fast = lacp_fast;
> 	params->primary[0] = 0;
>@@ -6317,6 +6338,10 @@ static int __init bonding_init(void)
> 			goto err;
> 	}
> 
>+	res = bond_mac_cache_init();
>+	if (res)
>+		goto err;
>+
> 	skb_flow_dissector_init(&flow_keys_bonding,
> 				flow_keys_bonding_keys,
> 				ARRAY_SIZE(flow_keys_bonding_keys));
>@@ -6346,6 +6371,7 @@ static void __exit bonding_exit(void)
> 	/* Make sure we don't have an imbalance on our netpoll blocking */
> 	WARN_ON(atomic_read(&netpoll_block_tx));
> #endif
>+	bond_mac_cache_destroy();
> }
> 
> module_init(bonding_init);
>diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
>index f427fa1737c7..249d79b6e21a 100644
>--- a/drivers/net/bonding/bond_netlink.c
>+++ b/drivers/net/bonding/bond_netlink.c
>@@ -113,6 +113,7 @@ static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
> 	[IFLA_BOND_PEER_NOTIF_DELAY]    = { .type = NLA_U32 },
> 	[IFLA_BOND_MISSED_MAX]		= { .type = NLA_U8 },
> 	[IFLA_BOND_NS_IP6_TARGET]	= { .type = NLA_NESTED },
>+	[IFLA_BOND_MAC_FILTER]		= { .type = NLA_U8 },
> };
> 
> static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = {
>@@ -498,6 +499,14 @@ static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[],
> 		if (err)
> 			return err;
> 	}
>+	if (data[IFLA_BOND_MAC_FILTER]) {
>+		u8 mac_filter = nla_get_u8(data[IFLA_BOND_MAC_FILTER]);
>+
>+		bond_opt_initval(&newval, mac_filter);
>+		err = __bond_opt_set(bond, BOND_OPT_MAC_FILTER, &newval);
>+		if (err)
>+			return err;
>+	}
> 
> 	return 0;
> }
>@@ -565,6 +574,7 @@ static size_t bond_get_size(const struct net_device *bond_dev)
> 						/* IFLA_BOND_NS_IP6_TARGET */
> 		nla_total_size(sizeof(struct nlattr)) +
> 		nla_total_size(sizeof(struct in6_addr)) * BOND_MAX_NS_TARGETS +
>+		nla_total_size(sizeof(u8)) +	/* IFLA_BOND_MAC_FILTER */
> 		0;
> }
> 
>@@ -723,6 +733,9 @@ static int bond_fill_info(struct sk_buff *skb,
> 	if (nla_put_u8(skb, IFLA_BOND_MISSED_MAX,
> 		       bond->params.missed_max))
> 		goto nla_put_failure;
>+	if (nla_put_u8(skb, IFLA_BOND_MAC_FILTER,
>+		       bond->params.mac_filter))
>+		goto nla_put_failure;
> 
> 	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
> 		struct ad_info info;
>diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>index 64f7db2627ce..0f6036ff7b86 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -15,6 +15,7 @@
> #include <linux/sched/signal.h>
> 
> #include <net/bonding.h>
>+#include "bonding_priv.h"
> 
> static int bond_option_active_slave_set(struct bonding *bond,
> 					const struct bond_opt_value *newval);
>@@ -84,6 +85,8 @@ static int bond_option_ad_user_port_key_set(struct bonding *bond,
> 					    const struct bond_opt_value *newval);
> static int bond_option_missed_max_set(struct bonding *bond,
> 				      const struct bond_opt_value *newval);
>+static int bond_option_mac_filter_set(struct bonding *bond,
>+				      const struct bond_opt_value *newval);
> 
> 
> static const struct bond_opt_value bond_mode_tbl[] = {
>@@ -226,6 +229,12 @@ static const struct bond_opt_value bond_missed_max_tbl[] = {
> 	{ NULL,		-1,	0},
> };
> 
>+static const struct bond_opt_value bond_mac_filter_tbl[] = {
>+	{ "off",	0,	BOND_VALFLAG_MIN | BOND_VALFLAG_DEFAULT},
>+	{ "maxval",	18,	BOND_VALFLAG_MAX},

	What's the magic number 18?

>+	{ NULL,		-1,	0}
>+};
>+
> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
> 	[BOND_OPT_MODE] = {
> 		.id = BOND_OPT_MODE,
>@@ -482,7 +491,16 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
> 		.desc = "Delay between each peer notification on failover event, in milliseconds",
> 		.values = bond_intmax_tbl,
> 		.set = bond_option_peer_notif_delay_set
>-	}
>+	},
>+	[BOND_OPT_MAC_FILTER] = {
>+		.id = BOND_OPT_MAC_FILTER,
>+		.name = "mac_filter",
>+		.unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_XOR)),
>+		.desc = "filter received frames based on MAC addresses that have transmitted from the bond, number of MAC addresses to track",
>+		.flags = BOND_OPTFLAG_NOSLAVES | BOND_OPTFLAG_IFDOWN,
>+		.values = bond_mac_filter_tbl,
>+		.set = bond_option_mac_filter_set
>+	},
> };
> 
> /* Searches for an option by name */
>@@ -1035,6 +1053,44 @@ static int bond_option_use_carrier_set(struct bonding *bond,
> 	return 0;
> }
> 
>+static int bond_option_mac_filter_set(struct bonding *bond,
>+				      const struct bond_opt_value *newval)
>+{
>+	int rc = 0;
>+	u8 prev = bond->params.mac_filter;
>+
>+	if (newval->value && bond->params.arp_interval) {
>+		netdev_err(bond->dev, "ARP monitoring cannot be used with MAC Filtering.\n");
>+		rc = -EPERM;
>+		goto err;
>+	}

	What happens if a user (a) switches to ARP monitor with
arp_validate in balance-xor mode after mac_filter is enabled, or, (b)
changes the mode to something other than balance-xor with mac_filter
enabled (both by changing the configuration in real time)?

	-J

>+
>+	netdev_dbg(bond->dev, "Setting mac_filter to %llu\n",
>+		   newval->value);
>+	bond->params.mac_filter = newval->value;
>+
>+	if (prev == 0 && bond->params.mac_filter > 0) {
>+		rc = bond_mac_hash_init(bond);
>+		if (rc)
>+			goto err;
>+	} else if (prev > 0 && bond->params.mac_filter == 0)
>+		bond_mac_hash_destroy(bond);
>+
>+	if (bond->mac_filter_tbl) {
>+		bond->mac_filter_tbl->p.max_size =
>+			1 << bond->params.mac_filter;
>+		netdev_dbg(bond->dev, "mac_filter hash table size: %d\n",
>+			   bond->mac_filter_tbl->p.max_size);
>+	}
>+
>+out:
>+	return rc;
>+
>+err:
>+	bond->params.mac_filter = 0;
>+	goto out;
>+}
>+
> /* There are two tricky bits here.  First, if ARP monitoring is activated, then
>  * we must disable MII monitoring.  Second, if the ARP timer isn't running,
>  * we must start it.
>diff --git a/drivers/net/bonding/bonding_priv.h b/drivers/net/bonding/bonding_priv.h
>index 48cdf3a49a7d..0299f8bcb5fd 100644
>--- a/drivers/net/bonding/bonding_priv.h
>+++ b/drivers/net/bonding/bonding_priv.h
>@@ -15,6 +15,7 @@
> #ifndef _BONDING_PRIV_H
> #define _BONDING_PRIV_H
> #include <generated/utsrelease.h>
>+#include "bond_mac_filter.h"
> 
> #define DRV_NAME	"bonding"
> #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
>diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>index 61b49063791c..42e3e676b9c2 100644
>--- a/include/net/bond_options.h
>+++ b/include/net/bond_options.h
>@@ -67,6 +67,7 @@ enum {
> 	BOND_OPT_LACP_ACTIVE,
> 	BOND_OPT_MISSED_MAX,
> 	BOND_OPT_NS_TARGETS,
>+	BOND_OPT_MAC_FILTER,
> 	BOND_OPT_LAST
> };
> 
>diff --git a/include/net/bonding.h b/include/net/bonding.h
>index b14f4c0b4e9e..5bc3e7b5a2af 100644
>--- a/include/net/bonding.h
>+++ b/include/net/bonding.h
>@@ -125,6 +125,7 @@ struct bond_params {
> 	int miimon;
> 	u8 num_peer_notif;
> 	u8 missed_max;
>+	u8 mac_filter;
> 	int arp_interval;
> 	int arp_validate;
> 	int arp_all_targets;
>@@ -248,6 +249,7 @@ struct bonding {
> 	struct   delayed_work alb_work;
> 	struct   delayed_work ad_work;
> 	struct   delayed_work mcast_work;
>+	struct   delayed_work mac_work;
> 	struct   delayed_work slave_arr_work;
> #ifdef CONFIG_DEBUG_FS
> 	/* debugging support via debugfs */
>@@ -260,6 +262,7 @@ struct bonding {
> 	spinlock_t ipsec_lock;
> #endif /* CONFIG_XFRM_OFFLOAD */
> 	struct bpf_prog *xdp_prog;
>+	struct rhashtable *mac_filter_tbl;
> };
> 
> #define bond_slave_get_rcu(dev) \
>diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>index cc284c048e69..9291df89581f 100644
>--- a/include/uapi/linux/if_link.h
>+++ b/include/uapi/linux/if_link.h
>@@ -929,6 +929,7 @@ enum {
> 	IFLA_BOND_AD_LACP_ACTIVE,
> 	IFLA_BOND_MISSED_MAX,
> 	IFLA_BOND_NS_IP6_TARGET,
>+	IFLA_BOND_MAC_FILTER,
> 	__IFLA_BOND_MAX,
> };
> 
>-- 
>2.27.0
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Jonathan Toppins March 30, 2022, 2:56 p.m. UTC | #3
On 3/24/22 20:50, Jay Vosburgh wrote:
> 	Considering this as an RFC given that net-next is closed...
> 	
> Jonathan Toppins <jtoppins@redhat.com> wrote:
> 
>> Attempt to replicate the OvS SLB Bonding[1] feature by preventing
>> duplicate frame delivery on a bond whos members are connected to
>> physically different switches.
>>
>> Combining this feature with vlan+srcmac hash policy allows a user to
>> create an access network without the need to use expensive switches that
>> support features like Cisco's VCP.
> 
> 	Could you describe this use case / implementation in a bit more
> detail?  I.e., how exactly that configuration works.  I don't think this
> patch is replicating everything in the OVS SLB documentation.
> 

The original ask was to provide OvS SLB like functionality in Linux's 
bonding as they wanted to move away from OvS. This does not try and 
implement OvS SLB in its entirety as that would require implementing 
bonding in the Linux bridging code. Instead we implemented a MAC filter 
that prevents duplicate frame delivery when handling BUM traffic. We 
currently do not handle individual vlans.

This is trying to solve for the logical network of the form:

                        lk1    +---+
              +-+        +-----|sw1|
h1 --linkH1--|b|        |     +---+
...          |r|--bond1-+       |lk3
hN --linkHN--|1|        |     +---+
              +-+        +-----|sw2|
                        lk2    +---+

Where h1...hN are hosts connected to a Linux bridge, br1, with bond1 and 
its associated links, lk1 & lk2, provide egress/ingress. One can assume 
bond1, br1, and hosts h1...hN are all contained in a single box and 
h1...hN are virtual machines or containers and lk1 & lk2 provide 
redundant connections to the data center. The requirement is to use all 
bandwidth when the system is functioning normally. Sw1 & Sw2 are 
physical switches that do not implement any advanced L2 management 
features such as MLAG, Cisco's VPC, or LACP. Link 3, lk3, provides a 
switch interconnect to provide layer 2 connectivity in the event a link 
or switch fails.

>> [1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding
>> >> diff --git a/drivers/net/bonding/bond_mac_filter.c 
b/drivers/net/bonding/bond_mac_filter.c
>> new file mode 100644
>> index 000000000000..a16a1a000f05
>> --- /dev/null
>> +++ b/drivers/net/bonding/bond_mac_filter.c
>> @@ -0,0 +1,222 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Filter received frames based on MAC addresses "behind" the bond.
>> + */
>> +
>> +#include "bonding_priv.h"
>> +
>> +/* -------------- Cache Management -------------- */
> 
> 	I don't think this header adds anything, given that there's not
> really a lot in the section.

ok can remove.

> 
>> +static struct kmem_cache *bond_mac_cache __read_mostly;
>> +
>> +int __init bond_mac_cache_init(void)
>> +{
>> +	bond_mac_cache = kmem_cache_create("bond_mac_cache",
>> +					   sizeof(struct bond_mac_cache_entry),
>> +					   0, SLAB_HWCACHE_ALIGN, NULL);
>> +	if (!bond_mac_cache)
>> +		return -ENOMEM;
>> +	return 0;
>> +}
>> +
>> +void bond_mac_cache_destroy(void)
>> +{
>> +	kmem_cache_destroy(bond_mac_cache);
>> +}
> 
> 	There are a lot of the above sort of wrapper functions that are
> only ever called once.  Some of these, e.g., mac_delete, below, I agree
> with, as the call site is nested fairly deep and the function is
> non-trivial; or, mac_delete_rcu, which is used as a callback.
> 
> 	The above two, though, I don't see a justification for, along
> with hold_time and maybe a couple others, below.  In my opinion,
> over-abstracting these trivial things with one call site makes the code
> harder to follow.
> 

ok can remove.

>> +
>> +static inline unsigned long hold_time(const struct bonding *bond)
>> +{
>> +	return msecs_to_jiffies(5000);
>> +}
> 
> 	This shouldn't be a magic number, and if it's an important
> timeout, should it be configurable?
> 

This is the MAC entry age-out time much like Linux bridge 
"br->ageing_time". We didn't find a need to modify the age-out time 
currently, if you think another bond parameter is needed to make this 
configurable I can add it.

>> +int bond_mac_hash_init(struct bonding *bond)
>> +{
>> +	int rc;
> 
> 	As a point of style, (almost) everywhere else in bonding uses
> "ret" for a return value.  The exceptions are largely my doing, but,
> still, it'd be nice to be mostly consistent in nomenclature.
> 

ok.

>> +
>> +int bond_mac_insert(struct bonding *bond, const u8 *addr)
>> +{
>> +	struct bond_mac_cache_entry *entry;
>> +	int rc = 0;
>> +
>> +	if (!is_valid_ether_addr(addr))
>> +		return -EINVAL;
>> +
>> +	rcu_read_lock();
>> +	entry = mac_find(bond, addr);
>> +	if (entry) {
>> +		unsigned long flags;
>> +
>> +		spin_lock_irqsave(&entry->lock, flags);
>> +		if (!test_bit(BOND_MAC_DEAD, &entry->flags)) {
>> +			mac_update(entry);
>> +			spin_unlock_irqrestore(&entry->lock, flags);
>> +			goto out;
>> +		}
>> +		spin_unlock_irqrestore(&entry->lock, flags);
> 
> 	This seems really expensive, as it will add a spin_lock_irqsave
> round trip for almost every packet transmitted when mac_filter is
> enabled (as this will be called by bond_xmit_3ad_xor_slave_get).
> 

It is, if you have a suggestion to make it less expensive I would like 
to hear ideas on this. On bond transmit the bond needs to see if a new 
MAC source is talking, if it is not new we just need to update the 
ageout time (mac_update). If the MAC is new we need to add a new entry 
to the filter table. The lock is per-entry so it is not blocking every 
entry update just the one we are dealing with right now.

>> +	}
>> +
>> +	rc = mac_create(bond, addr);
>> +
>> +out:
>> +	rcu_read_unlock();
>> +	return rc;
>> +}
>> +
>> +int bond_xor_recv(const struct sk_buff *skb, struct bonding *bond,
>> +		  struct slave *slave)
>> +{
>> +	const struct ethhdr *mac_hdr;
>> +	struct bond_mac_cache_entry *entry;
>> +	int rc = RX_HANDLER_PASS;
>> +
>> +	mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>> +	rcu_read_lock();
>> +	if (is_multicast_ether_addr(mac_hdr->h_dest) &&
>> +	    slave != rcu_dereference(bond->curr_active_slave)) {
>> +		rc = RX_HANDLER_CONSUMED;
>> +		goto out;
>> +	}
>> +
>> +	entry = mac_find(bond, mac_hdr->h_source);
>> +	if (entry) {
>> +		unsigned long flags;
>> +		bool expired;
>> +
>> +		spin_lock_irqsave(&entry->lock, flags);
>> +		expired = has_expired(bond, entry);
>> +		spin_unlock_irqrestore(&entry->lock, flags);
>> +		if (!expired)
>> +			rc = RX_HANDLER_CONSUMED;
>> +	}
> 
> 	As above, really expensive, except for incoming packets here
> (since this is called as the recv_probe).
> 

By incoming packets do you mean packets received by the bond? If so not 
sure I understand the comment. On the receive side this will be run for 
every frame handled by the bond and the lock will only be taken if an 
entry is found for the source MAC.

>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 15eddca7b4b6..f5a8a50e9116 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1549,6 +1549,10 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
>> 		return RX_HANDLER_EXACT;
>> 	}
>>
>> +	/* this function should not rely on the recv_probe to set ret
>> +	 * correctly
>> +	 */
>> +	ret = RX_HANDLER_ANOTHER;
> 
> 	This change is overriding the return from a recv_probe added by
> this patch (bond_xor_recv can return RX_HANDLER_PASS).  Why?
> 
> 	Also, I don't agree with the comment; the recv_probe return
> value by design affects the return value from bond_handle_frame.
> 

Had to go back and look at why this was added, I don't see a need for it 
now so will remove.

>>
>> static const struct bond_opt_value bond_mode_tbl[] = {
>> @@ -226,6 +229,12 @@ static const struct bond_opt_value bond_missed_max_tbl[] = {
>> 	{ NULL,		-1,	0},
>> };
>>
>> +static const struct bond_opt_value bond_mac_filter_tbl[] = {
>> +	{ "off",	0,	BOND_VALFLAG_MIN | BOND_VALFLAG_DEFAULT},
>> +	{ "maxval",	18,	BOND_VALFLAG_MAX},
> 
> 	What's the magic number 18?

It is simply the maximum number the option can be set to as I thought 
2^18 was more than enough entries in the hash table.

>> /* Searches for an option by name */
>> @@ -1035,6 +1053,44 @@ static int bond_option_use_carrier_set(struct bonding *bond,
>> 	return 0;
>> }
>>
>> +static int bond_option_mac_filter_set(struct bonding *bond,
>> +				      const struct bond_opt_value *newval)
>> +{
>> +	int rc = 0;
>> +	u8 prev = bond->params.mac_filter;
>> +
>> +	if (newval->value && bond->params.arp_interval) {
>> +		netdev_err(bond->dev, "ARP monitoring cannot be used with MAC Filtering.\n");
>> +		rc = -EPERM;
>> +		goto err;
>> +	}
> 
> 	What happens if a user (a) switches to ARP monitor with
> arp_validate in balance-xor mode after mac_filter is enabled, or, (b)
> changes the mode to something other than balance-xor with mac_filter
> enabled (both by changing the configuration in real time)?

For (a) the arp config handlers will need to be modified to account for 
this. I assume they will take the same approach as with mii vs arp monitor.

For (b) the sites that test for bond->params.mac_filter will need to be 
logically anded with bond->params.mode == BOND_MODE_XOR.

Thank you for the comments.

-Jon
diff mbox series

Patch

diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 525e6842dd33..a5a1669d3efe 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -550,6 +550,25 @@  lacp_rate
 
 	The default is slow.
 
+mac_filter
+
+	Tells the bonding device to drop frames received who's source MAC
+	address	matches entries in a filter table. The filter table is
+	populated when the bond transmits frames. This is similar in
+	concept to the MAC learning table implemented in the bridge code.
+
+	This filtering is only enabled for the balance-xor bonding mode.
+
+	off or 0
+		Turns the feature off
+
+	number
+		A number greater than zero turns the feature on and sets
+		the maximum number of MAC addresses to store in the hash
+		table.
+
+	The default is off.
+
 max_bonds
 
 	Specifies the number of bonding devices to create for this
diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 30e8ae3da2da..5dbc360a8522 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -5,7 +5,7 @@ 
 
 obj-$(CONFIG_BONDING) += bonding.o
 
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o bond_mac_filter.o
 
 proc-$(CONFIG_PROC_FS) += bond_procfs.o
 bonding-objs += $(proc-y)
diff --git a/drivers/net/bonding/bond_mac_filter.c b/drivers/net/bonding/bond_mac_filter.c
new file mode 100644
index 000000000000..a16a1a000f05
--- /dev/null
+++ b/drivers/net/bonding/bond_mac_filter.c
@@ -0,0 +1,222 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Filter received frames based on MAC addresses "behind" the bond.
+ */
+
+#include "bonding_priv.h"
+
+/* -------------- Cache Management -------------- */
+
+static struct kmem_cache *bond_mac_cache __read_mostly;
+
+int __init bond_mac_cache_init(void)
+{
+	bond_mac_cache = kmem_cache_create("bond_mac_cache",
+					   sizeof(struct bond_mac_cache_entry),
+					   0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!bond_mac_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void bond_mac_cache_destroy(void)
+{
+	kmem_cache_destroy(bond_mac_cache);
+}
+
+/* -------------- Hash Table Management -------------- */
+
+static const struct rhashtable_params bond_rht_params = {
+	.head_offset         = offsetof(struct bond_mac_cache_entry, rhnode),
+	.key_offset          = offsetof(struct bond_mac_cache_entry, key),
+	.key_len             = sizeof(struct mac_addr),
+	.automatic_shrinking = true,
+};
+
+static inline unsigned long hold_time(const struct bonding *bond)
+{
+	return msecs_to_jiffies(5000);
+}
+
+static bool has_expired(const struct bonding *bond,
+			struct bond_mac_cache_entry *mac)
+{
+	return time_before_eq(mac->used + hold_time(bond), jiffies);
+}
+
+static void mac_delete_rcu(struct callback_head *head)
+{
+	kmem_cache_free(bond_mac_cache,
+			container_of(head, struct bond_mac_cache_entry, rcu));
+}
+
+static int mac_delete(struct bonding *bond,
+		      struct bond_mac_cache_entry *entry)
+{
+	int ret;
+
+	ret = rhashtable_remove_fast(bond->mac_filter_tbl,
+				     &entry->rhnode,
+				     bond->mac_filter_tbl->p);
+	set_bit(BOND_MAC_DEAD, &entry->flags);
+	call_rcu(&entry->rcu, mac_delete_rcu);
+	return ret;
+}
+
+void bond_mac_hash_release_entries(struct work_struct *work)
+{
+	struct bonding *bond = container_of(work, struct bonding,
+				mac_work.work);
+	struct rhashtable_iter iter;
+	struct bond_mac_cache_entry *entry;
+	unsigned long flags;
+
+	rhashtable_walk_enter(bond->mac_filter_tbl, &iter);
+	rhashtable_walk_start(&iter);
+	while ((entry = rhashtable_walk_next(&iter)) != NULL) {
+		if (IS_ERR(entry))
+			continue;
+
+		spin_lock_irqsave(&entry->lock, flags);
+		if (has_expired(bond, entry))
+			mac_delete(bond, entry);
+		spin_unlock_irqrestore(&entry->lock, flags);
+	}
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
+	queue_delayed_work(bond->wq, &bond->mac_work,
+			   msecs_to_jiffies(5 * 60 * 1000));
+}
+
+int bond_mac_hash_init(struct bonding *bond)
+{
+	int rc;
+
+	netdev_dbg(bond->dev, "mac_filter: alloc memory for hash table\n");
+	bond->mac_filter_tbl = kzalloc(sizeof(*bond->mac_filter_tbl),
+				       GFP_KERNEL);
+	if (!bond->mac_filter_tbl)
+		return -ENOMEM;
+
+	rc = rhashtable_init(bond->mac_filter_tbl, &bond_rht_params);
+	if (rc)
+		kfree(bond->mac_filter_tbl);
+
+	return rc;
+}
+
+static void bond_mac_free_entry(void *entry, void *ctx)
+{
+	kmem_cache_free((struct kmem_cache *)ctx, entry);
+}
+
+void bond_mac_hash_destroy(struct bonding *bond)
+{
+	if (bond->mac_filter_tbl) {
+		rhashtable_free_and_destroy(bond->mac_filter_tbl,
+					    bond_mac_free_entry,
+					    bond_mac_cache);
+		kfree(bond->mac_filter_tbl);
+		bond->mac_filter_tbl = NULL;
+	}
+}
+
+static int mac_create(struct bonding *bond, const u8 *addr)
+{
+	struct bond_mac_cache_entry *entry;
+	int ret;
+
+	entry = kmem_cache_alloc(bond_mac_cache, GFP_ATOMIC);
+	if (!entry)
+		return -ENOMEM;
+	spin_lock_init(&entry->lock);
+	memcpy(&entry->key, addr, sizeof(entry->key));
+	entry->used = jiffies;
+	ret = rhashtable_lookup_insert_fast(bond->mac_filter_tbl,
+					    &entry->rhnode,
+					    bond->mac_filter_tbl->p);
+	if (ret) {
+		kmem_cache_free(bond_mac_cache, entry);
+		entry = NULL;
+		if (ret == -EEXIST)
+			return 0;
+		pr_err_once("Failed to insert mac entry %d\n", ret);
+	}
+	return ret;
+}
+
+static struct bond_mac_cache_entry *mac_find(struct bonding *bond,
+					     const u8 *addr)
+{
+	struct mac_addr key;
+
+	memcpy(&key, addr, sizeof(key));
+	return rhashtable_lookup(bond->mac_filter_tbl, &key,
+				 bond->mac_filter_tbl->p);
+}
+
+inline void mac_update(struct bond_mac_cache_entry *entry)
+{
+	entry->used = jiffies;
+}
+
+int bond_mac_insert(struct bonding *bond, const u8 *addr)
+{
+	struct bond_mac_cache_entry *entry;
+	int rc = 0;
+
+	if (!is_valid_ether_addr(addr))
+		return -EINVAL;
+
+	rcu_read_lock();
+	entry = mac_find(bond, addr);
+	if (entry) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&entry->lock, flags);
+		if (!test_bit(BOND_MAC_DEAD, &entry->flags)) {
+			mac_update(entry);
+			spin_unlock_irqrestore(&entry->lock, flags);
+			goto out;
+		}
+		spin_unlock_irqrestore(&entry->lock, flags);
+	}
+
+	rc = mac_create(bond, addr);
+
+out:
+	rcu_read_unlock();
+	return rc;
+}
+
+int bond_xor_recv(const struct sk_buff *skb, struct bonding *bond,
+		  struct slave *slave)
+{
+	const struct ethhdr *mac_hdr;
+	struct bond_mac_cache_entry *entry;
+	int rc = RX_HANDLER_PASS;
+
+	mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+	rcu_read_lock();
+	if (is_multicast_ether_addr(mac_hdr->h_dest) &&
+	    slave != rcu_dereference(bond->curr_active_slave)) {
+		rc = RX_HANDLER_CONSUMED;
+		goto out;
+	}
+
+	entry = mac_find(bond, mac_hdr->h_source);
+	if (entry) {
+		unsigned long flags;
+		bool expired;
+
+		spin_lock_irqsave(&entry->lock, flags);
+		expired = has_expired(bond, entry);
+		spin_unlock_irqrestore(&entry->lock, flags);
+		if (!expired)
+			rc = RX_HANDLER_CONSUMED;
+	}
+
+out:
+	rcu_read_unlock();
+	return rc;
+}
diff --git a/drivers/net/bonding/bond_mac_filter.h b/drivers/net/bonding/bond_mac_filter.h
new file mode 100644
index 000000000000..0cfcc5653e7e
--- /dev/null
+++ b/drivers/net/bonding/bond_mac_filter.h
@@ -0,0 +1,40 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * Filter received frames based on MAC addresses "behind" the bond.
+ */
+
+#ifndef _BOND_MAC_FILTER_H
+#define _BOND_MAC_FILTER_H
+#include <net/bonding.h>
+#include <linux/spinlock.h>
+#include <linux/rhashtable.h>
+
+enum {
+	BOND_MAC_DEAD,
+	BOND_MAC_LOCKED,
+	BOND_MAC_STATIC,
+};
+
+struct bond_mac_cache_entry {
+	struct rhash_head	rhnode;
+	struct mac_addr		key;
+
+	spinlock_t		lock; /* protects used member */
+	unsigned long		flags;
+	unsigned long		used;
+	struct rcu_head		rcu;
+};
+
+int __init bond_mac_cache_init(void);
+void bond_mac_cache_destroy(void);
+
+void bond_mac_hash_release_entries(struct work_struct *work);
+int bond_mac_hash_init(struct bonding *bond);
+void bond_mac_hash_destroy(struct bonding *bond);
+
+int bond_mac_insert(struct bonding *bond, const u8 *addr);
+int bond_xor_recv(const struct sk_buff *skb,
+		  struct bonding *bond,
+		  struct slave *slave);
+
+#endif
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 15eddca7b4b6..f5a8a50e9116 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1549,6 +1549,10 @@  static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		return RX_HANDLER_EXACT;
 	}
 
+	/* this function should not rely on the recv_probe to set ret
+	 * correctly
+	 */
+	ret = RX_HANDLER_ANOTHER;
 	skb->dev = bond->dev;
 
 	if (BOND_MODE(bond) == BOND_MODE_ALB &&
@@ -4117,6 +4121,7 @@  void bond_work_init_all(struct bonding *bond)
 	INIT_DELAYED_WORK(&bond->arp_work, bond_arp_monitor);
 	INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
 	INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
+	INIT_DELAYED_WORK(&bond->mac_work, bond_mac_hash_release_entries);
 }
 
 static void bond_work_cancel_all(struct bonding *bond)
@@ -4127,6 +4132,7 @@  static void bond_work_cancel_all(struct bonding *bond)
 	cancel_delayed_work_sync(&bond->ad_work);
 	cancel_delayed_work_sync(&bond->mcast_work);
 	cancel_delayed_work_sync(&bond->slave_arr_work);
+	cancel_delayed_work_sync(&bond->mac_work);
 }
 
 static int bond_open(struct net_device *bond_dev)
@@ -4174,6 +4180,11 @@  static int bond_open(struct net_device *bond_dev)
 		bond_3ad_initiate_agg_selection(bond, 1);
 	}
 
+	if (bond->params.mac_filter) {
+		bond->recv_probe = bond_xor_recv;
+		queue_delayed_work(bond->wq, &bond->mac_work, 0);
+	}
+
 	if (bond_mode_can_use_xmit_hash(bond))
 		bond_update_slave_arr(bond, NULL);
 
@@ -5043,6 +5054,13 @@  static struct slave *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
 	if (unlikely(!count))
 		return NULL;
 
+	if (bond->params.mac_filter) {
+		const struct ethhdr *mac_hdr;
+
+		mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+		if (bond_mac_insert(bond, mac_hdr->h_source))
+			return NULL;
+	}
 	slave = slaves->arr[hash % count];
 	return slave;
 }
@@ -5660,6 +5678,8 @@  static void bond_destructor(struct net_device *bond_dev)
 
 	if (bond->rr_tx_counter)
 		free_percpu(bond->rr_tx_counter);
+
+	bond_mac_hash_destroy(bond);
 }
 
 void bond_setup(struct net_device *bond_dev)
@@ -6115,6 +6135,7 @@  static int bond_check_params(struct bond_params *params)
 	params->downdelay = downdelay;
 	params->peer_notif_delay = 0;
 	params->use_carrier = use_carrier;
+	params->mac_filter = 0;
 	params->lacp_active = 1;
 	params->lacp_fast = lacp_fast;
 	params->primary[0] = 0;
@@ -6317,6 +6338,10 @@  static int __init bonding_init(void)
 			goto err;
 	}
 
+	res = bond_mac_cache_init();
+	if (res)
+		goto err;
+
 	skb_flow_dissector_init(&flow_keys_bonding,
 				flow_keys_bonding_keys,
 				ARRAY_SIZE(flow_keys_bonding_keys));
@@ -6346,6 +6371,7 @@  static void __exit bonding_exit(void)
 	/* Make sure we don't have an imbalance on our netpoll blocking */
 	WARN_ON(atomic_read(&netpoll_block_tx));
 #endif
+	bond_mac_cache_destroy();
 }
 
 module_init(bonding_init);
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index f427fa1737c7..249d79b6e21a 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -113,6 +113,7 @@  static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
 	[IFLA_BOND_PEER_NOTIF_DELAY]    = { .type = NLA_U32 },
 	[IFLA_BOND_MISSED_MAX]		= { .type = NLA_U8 },
 	[IFLA_BOND_NS_IP6_TARGET]	= { .type = NLA_NESTED },
+	[IFLA_BOND_MAC_FILTER]		= { .type = NLA_U8 },
 };
 
 static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = {
@@ -498,6 +499,14 @@  static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[],
 		if (err)
 			return err;
 	}
+	if (data[IFLA_BOND_MAC_FILTER]) {
+		u8 mac_filter = nla_get_u8(data[IFLA_BOND_MAC_FILTER]);
+
+		bond_opt_initval(&newval, mac_filter);
+		err = __bond_opt_set(bond, BOND_OPT_MAC_FILTER, &newval);
+		if (err)
+			return err;
+	}
 
 	return 0;
 }
@@ -565,6 +574,7 @@  static size_t bond_get_size(const struct net_device *bond_dev)
 						/* IFLA_BOND_NS_IP6_TARGET */
 		nla_total_size(sizeof(struct nlattr)) +
 		nla_total_size(sizeof(struct in6_addr)) * BOND_MAX_NS_TARGETS +
+		nla_total_size(sizeof(u8)) +	/* IFLA_BOND_MAC_FILTER */
 		0;
 }
 
@@ -723,6 +733,9 @@  static int bond_fill_info(struct sk_buff *skb,
 	if (nla_put_u8(skb, IFLA_BOND_MISSED_MAX,
 		       bond->params.missed_max))
 		goto nla_put_failure;
+	if (nla_put_u8(skb, IFLA_BOND_MAC_FILTER,
+		       bond->params.mac_filter))
+		goto nla_put_failure;
 
 	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
 		struct ad_info info;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 64f7db2627ce..0f6036ff7b86 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -15,6 +15,7 @@ 
 #include <linux/sched/signal.h>
 
 #include <net/bonding.h>
+#include "bonding_priv.h"
 
 static int bond_option_active_slave_set(struct bonding *bond,
 					const struct bond_opt_value *newval);
@@ -84,6 +85,8 @@  static int bond_option_ad_user_port_key_set(struct bonding *bond,
 					    const struct bond_opt_value *newval);
 static int bond_option_missed_max_set(struct bonding *bond,
 				      const struct bond_opt_value *newval);
+static int bond_option_mac_filter_set(struct bonding *bond,
+				      const struct bond_opt_value *newval);
 
 
 static const struct bond_opt_value bond_mode_tbl[] = {
@@ -226,6 +229,12 @@  static const struct bond_opt_value bond_missed_max_tbl[] = {
 	{ NULL,		-1,	0},
 };
 
+static const struct bond_opt_value bond_mac_filter_tbl[] = {
+	{ "off",	0,	BOND_VALFLAG_MIN | BOND_VALFLAG_DEFAULT},
+	{ "maxval",	18,	BOND_VALFLAG_MAX},
+	{ NULL,		-1,	0}
+};
+
 static const struct bond_option bond_opts[BOND_OPT_LAST] = {
 	[BOND_OPT_MODE] = {
 		.id = BOND_OPT_MODE,
@@ -482,7 +491,16 @@  static const struct bond_option bond_opts[BOND_OPT_LAST] = {
 		.desc = "Delay between each peer notification on failover event, in milliseconds",
 		.values = bond_intmax_tbl,
 		.set = bond_option_peer_notif_delay_set
-	}
+	},
+	[BOND_OPT_MAC_FILTER] = {
+		.id = BOND_OPT_MAC_FILTER,
+		.name = "mac_filter",
+		.unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_XOR)),
+		.desc = "filter received frames based on MAC addresses that have transmitted from the bond, number of MAC addresses to track",
+		.flags = BOND_OPTFLAG_NOSLAVES | BOND_OPTFLAG_IFDOWN,
+		.values = bond_mac_filter_tbl,
+		.set = bond_option_mac_filter_set
+	},
 };
 
 /* Searches for an option by name */
@@ -1035,6 +1053,44 @@  static int bond_option_use_carrier_set(struct bonding *bond,
 	return 0;
 }
 
+static int bond_option_mac_filter_set(struct bonding *bond,
+				      const struct bond_opt_value *newval)
+{
+	int rc = 0;
+	u8 prev = bond->params.mac_filter;
+
+	if (newval->value && bond->params.arp_interval) {
+		netdev_err(bond->dev, "ARP monitoring cannot be used with MAC Filtering.\n");
+		rc = -EPERM;
+		goto err;
+	}
+
+	netdev_dbg(bond->dev, "Setting mac_filter to %llu\n",
+		   newval->value);
+	bond->params.mac_filter = newval->value;
+
+	if (prev == 0 && bond->params.mac_filter > 0) {
+		rc = bond_mac_hash_init(bond);
+		if (rc)
+			goto err;
+	} else if (prev > 0 && bond->params.mac_filter == 0)
+		bond_mac_hash_destroy(bond);
+
+	if (bond->mac_filter_tbl) {
+		bond->mac_filter_tbl->p.max_size =
+			1 << bond->params.mac_filter;
+		netdev_dbg(bond->dev, "mac_filter hash table size: %d\n",
+			   bond->mac_filter_tbl->p.max_size);
+	}
+
+out:
+	return rc;
+
+err:
+	bond->params.mac_filter = 0;
+	goto out;
+}
+
 /* There are two tricky bits here.  First, if ARP monitoring is activated, then
  * we must disable MII monitoring.  Second, if the ARP timer isn't running,
  * we must start it.
diff --git a/drivers/net/bonding/bonding_priv.h b/drivers/net/bonding/bonding_priv.h
index 48cdf3a49a7d..0299f8bcb5fd 100644
--- a/drivers/net/bonding/bonding_priv.h
+++ b/drivers/net/bonding/bonding_priv.h
@@ -15,6 +15,7 @@ 
 #ifndef _BONDING_PRIV_H
 #define _BONDING_PRIV_H
 #include <generated/utsrelease.h>
+#include "bond_mac_filter.h"
 
 #define DRV_NAME	"bonding"
 #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
diff --git a/include/net/bond_options.h b/include/net/bond_options.h
index 61b49063791c..42e3e676b9c2 100644
--- a/include/net/bond_options.h
+++ b/include/net/bond_options.h
@@ -67,6 +67,7 @@  enum {
 	BOND_OPT_LACP_ACTIVE,
 	BOND_OPT_MISSED_MAX,
 	BOND_OPT_NS_TARGETS,
+	BOND_OPT_MAC_FILTER,
 	BOND_OPT_LAST
 };
 
diff --git a/include/net/bonding.h b/include/net/bonding.h
index b14f4c0b4e9e..5bc3e7b5a2af 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -125,6 +125,7 @@  struct bond_params {
 	int miimon;
 	u8 num_peer_notif;
 	u8 missed_max;
+	u8 mac_filter;
 	int arp_interval;
 	int arp_validate;
 	int arp_all_targets;
@@ -248,6 +249,7 @@  struct bonding {
 	struct   delayed_work alb_work;
 	struct   delayed_work ad_work;
 	struct   delayed_work mcast_work;
+	struct   delayed_work mac_work;
 	struct   delayed_work slave_arr_work;
 #ifdef CONFIG_DEBUG_FS
 	/* debugging support via debugfs */
@@ -260,6 +262,7 @@  struct bonding {
 	spinlock_t ipsec_lock;
 #endif /* CONFIG_XFRM_OFFLOAD */
 	struct bpf_prog *xdp_prog;
+	struct rhashtable *mac_filter_tbl;
 };
 
 #define bond_slave_get_rcu(dev) \
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cc284c048e69..9291df89581f 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -929,6 +929,7 @@  enum {
 	IFLA_BOND_AD_LACP_ACTIVE,
 	IFLA_BOND_MISSED_MAX,
 	IFLA_BOND_NS_IP6_TARGET,
+	IFLA_BOND_MAC_FILTER,
 	__IFLA_BOND_MAX,
 };