diff mbox series

[bpf-next,V2,5/6] bpf: Add MTU check for TC-BPF packets after egress hook

Message ID 160208778070.798237.16265441131909465819.stgit@firesoul (mailing list archive)
State Changes Requested
Headers show
Series bpf: New approach for BPF MTU handling and enforcement | expand

Commit Message

Jesper Dangaard Brouer Oct. 7, 2020, 4:23 p.m. UTC
The MTU should only apply to transmitted packets.

When TC-ingress redirect packet to egress on another netdev, then the
normal netstack MTU checks are skipped (and driver level will not catch
any MTU violation, checked ixgbe).

This patch choose not to add MTU check in the egress code path of
skb_do_redirect() prior to calling dev_queue_xmit(), because it is still
possible to run another BPF egress program that will shrink/consume
headers, which will make packet comply with netdev MTU. This use-case
might already be in production use (if ingress MTU is larger than egress).

Instead do the MTU check after sch_handle_egress() step, for the cases
that require this.

The cases need a bit explaining. Ingress to egress redirected packets
could be detected via skb->tc_at_ingress bit, but it is not reliable,
because sch_handle_egress() could steal the packet and redirect this
(again) to another egress netdev, which will then have the
skb->tc_at_ingress cleared. There is also the case of TC-egress prog
increase packet size and then redirect it egress. Thus, it is more
reliable to do the MTU check for any redirected packet (both ingress and
egress), which is available via skb_is_redirected() in earlier patch.
Also handle case where egress BPF-prog increased size.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

Troubleshooting: MTU violations are recorded in TX dropped counter, and
kprobe on dev_queue_xmit() have retval -EMSGSIZE.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/dev.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

Comments

Daniel Borkmann Oct. 7, 2020, 9:37 p.m. UTC | #1
On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
[...]
>   net/core/dev.c |   24 ++++++++++++++++++++++--
>   1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b433098896b2..19406013f93e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>   	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
>   	case TC_ACT_OK:
>   	case TC_ACT_RECLASSIFY:
> +		*ret = NET_XMIT_SUCCESS;
>   		skb->tc_index = TC_H_MIN(cl_res.classid);
>   		break;
>   	case TC_ACT_SHOT:
> @@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   {
>   	struct net_device *dev = skb->dev;
>   	struct netdev_queue *txq;
> +#ifdef CONFIG_NET_CLS_ACT
> +	bool mtu_check = false;
> +#endif
> +	bool again = false;
>   	struct Qdisc *q;
>   	int rc = -ENOMEM;
> -	bool again = false;
>   
>   	skb_reset_mac_header(skb);
>   
> @@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   
>   	qdisc_pkt_len_init(skb);
>   #ifdef CONFIG_NET_CLS_ACT
> +	mtu_check = skb_is_redirected(skb);
>   	skb->tc_at_ingress = 0;
>   # ifdef CONFIG_NET_EGRESS
>   	if (static_branch_unlikely(&egress_needed_key)) {
> +		unsigned int len_orig = skb->len;
> +
>   		skb = sch_handle_egress(skb, &rc, dev);
>   		if (!skb)
>   			goto out;
> +		/* BPF-prog ran and could have changed packet size beyond MTU */
> +		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
> +			mtu_check = true;
>   	}
>   # endif
> +	/* MTU-check only happens on "last" net_device in a redirect sequence
> +	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
> +	 * either ingress or egress to another device).
> +	 */

Hmm, quite some overhead in fast path. Also, won't this be checked multiple times
on stacked devices? :( Moreover, this missed the fact that 'real' qdiscs can have
filters attached too, this would come after this check. Can't this instead be in
driver layer for those that really need it? I would probably only drop the check
as done in 1/6 and allow the BPF prog to do the validation if needed.

> +	if (mtu_check && !is_skb_forwardable(dev, skb)) {
> +		rc = -EMSGSIZE;
> +		goto drop;
> +	}
>   #endif
>   	/* If device/qdisc don't need skb->dst, release it right now while
>   	 * its hot in this cpu cache.
> @@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   
>   	rc = -ENETDOWN;
>   	rcu_read_unlock_bh();
> -
> +#ifdef CONFIG_NET_CLS_ACT
> +drop:
> +#endif
>   	atomic_long_inc(&dev->tx_dropped);
>   	kfree_skb_list(skb);
>   	return rc;
>
John Fastabend Oct. 7, 2020, 10:36 p.m. UTC | #2
Daniel Borkmann wrote:
> On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> [...]
> >   net/core/dev.c |   24 ++++++++++++++++++++++--
> >   1 file changed, 22 insertions(+), 2 deletions(-)

Couple high-level comments. Whats the problem with just letting the driver
consume the packet? I would chalk it up to a buggy BPF program that is
sending these packets. The drivers really shouldn't panic or do anything
horrible under this case because even today I don't think we can be
100% certain MTU on skb matches set MTU. Imagine the case where I change
the MTU from 9kB->1500B there will be some skbs in-flight with the larger
length and some with the shorter. If the drivers panic/fault or otherwise
does something else horrible thats not going to be friendly in general case
regardless of what BPF does. And seeing this type of config is all done
async its tricky (not practical) to flush any skbs in-flight.

I've spent many hours debugging these types of feature flag, mtu
change bugs on the driver side I'm not sure it can be resolved by
the stack easily. Better to just build drivers that can handle it IMO.

Do we know if sending >MTU size skbs to drivers causes problems in real
cases? I haven't tried on the NICs I have here, but I expect they should
be fine. Fine here being system keeps running as expected. Dropping the
skb either on TX or RX side is expected. Even with this change though
its possible for the skb to slip through if I configure MTU on a live
system.

> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b433098896b2..19406013f93e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >   	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
> >   	case TC_ACT_OK:
> >   	case TC_ACT_RECLASSIFY:
> > +		*ret = NET_XMIT_SUCCESS;
> >   		skb->tc_index = TC_H_MIN(cl_res.classid);
> >   		break;
> >   	case TC_ACT_SHOT:
> > @@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   {
> >   	struct net_device *dev = skb->dev;
> >   	struct netdev_queue *txq;
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	bool mtu_check = false;
> > +#endif
> > +	bool again = false;
> >   	struct Qdisc *q;
> >   	int rc = -ENOMEM;
> > -	bool again = false;
> >   
> >   	skb_reset_mac_header(skb);
> >   
> > @@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	qdisc_pkt_len_init(skb);
> >   #ifdef CONFIG_NET_CLS_ACT
> > +	mtu_check = skb_is_redirected(skb);
> >   	skb->tc_at_ingress = 0;
> >   # ifdef CONFIG_NET_EGRESS
> >   	if (static_branch_unlikely(&egress_needed_key)) {
> > +		unsigned int len_orig = skb->len;
> > +
> >   		skb = sch_handle_egress(skb, &rc, dev);
> >   		if (!skb)
> >   			goto out;
> > +		/* BPF-prog ran and could have changed packet size beyond MTU */
> > +		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
> > +			mtu_check = true;
> >   	}
> >   # endif
> > +	/* MTU-check only happens on "last" net_device in a redirect sequence
> > +	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
> > +	 * either ingress or egress to another device).
> > +	 */
> 
> Hmm, quite some overhead in fast path. Also, won't this be checked multiple times
> on stacked devices? :( Moreover, this missed the fact that 'real' qdiscs can have
> filters attached too, this would come after this check. Can't this instead be in
> driver layer for those that really need it? I would probably only drop the check
> as done in 1/6 and allow the BPF prog to do the validation if needed.

Any checks like this should probably go in validate_xmit_skb_list() this is
where we check other things GSO, checksum, etc. that depend on drivers. Worse
case we could add a netif_needs_mtu() case in validate_xmit_skb if drivers
really can't handle >MTU size.

> 
> > +	if (mtu_check && !is_skb_forwardable(dev, skb)) {
> > +		rc = -EMSGSIZE;
> > +		goto drop;
> > +	}
> >   #endif
> >   	/* If device/qdisc don't need skb->dst, release it right now while
> >   	 * its hot in this cpu cache.
> > @@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	rc = -ENETDOWN;
> >   	rcu_read_unlock_bh();
> > -
> > +#ifdef CONFIG_NET_CLS_ACT
> > +drop:
> > +#endif
> >   	atomic_long_inc(&dev->tx_dropped);
> >   	kfree_skb_list(skb);
> >   	return rc;
> >
Maciej Żenczykowski Oct. 7, 2020, 11:52 p.m. UTC | #3
On Wed, Oct 7, 2020 at 3:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Daniel Borkmann wrote:
> > On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> > [...]
> > >   net/core/dev.c |   24 ++++++++++++++++++++++--
> > >   1 file changed, 22 insertions(+), 2 deletions(-)
>
> Couple high-level comments. Whats the problem with just letting the driver
> consume the packet? I would chalk it up to a buggy BPF program that is
> sending these packets. The drivers really shouldn't panic or do anything
> horrible under this case because even today I don't think we can be
> 100% certain MTU on skb matches set MTU. Imagine the case where I change
> the MTU from 9kB->1500B there will be some skbs in-flight with the larger
> length and some with the shorter. If the drivers panic/fault or otherwise
> does something else horrible thats not going to be friendly in general case
> regardless of what BPF does. And seeing this type of config is all done
> async its tricky (not practical) to flush any skbs in-flight.
>
> I've spent many hours debugging these types of feature flag, mtu
> change bugs on the driver side I'm not sure it can be resolved by
> the stack easily. Better to just build drivers that can handle it IMO.
>
> Do we know if sending >MTU size skbs to drivers causes problems in real
> cases? I haven't tried on the NICs I have here, but I expect they should
> be fine. Fine here being system keeps running as expected. Dropping the
> skb either on TX or RX side is expected. Even with this change though
> its possible for the skb to slip through if I configure MTU on a live
> system.

I wholeheartedly agree with the above.

Ideally the only >mtu check should happen at driver admittance.
But again ideally it should happen in some core stack location not in
the driver itself.
However, due to both gso and vlan offload, even this is not trivial to do...
The mtu is L3, but drivers/hardware/the wire usually care about L2...
(because ultimately that's what gets received and must fit in receive buffers)
John Fastabend Oct. 8, 2020, 3:19 a.m. UTC | #4
Maciej Żenczykowski wrote:
> On Wed, Oct 7, 2020 at 3:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Daniel Borkmann wrote:
> > > On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> > > [...]
> > > >   net/core/dev.c |   24 ++++++++++++++++++++++--
> > > >   1 file changed, 22 insertions(+), 2 deletions(-)
> >
> > Couple high-level comments. Whats the problem with just letting the driver
> > consume the packet? I would chalk it up to a buggy BPF program that is
> > sending these packets. The drivers really shouldn't panic or do anything
> > horrible under this case because even today I don't think we can be
> > 100% certain MTU on skb matches set MTU. Imagine the case where I change
> > the MTU from 9kB->1500B there will be some skbs in-flight with the larger
> > length and some with the shorter. If the drivers panic/fault or otherwise
> > does something else horrible thats not going to be friendly in general case
> > regardless of what BPF does. And seeing this type of config is all done
> > async its tricky (not practical) to flush any skbs in-flight.
> >
> > I've spent many hours debugging these types of feature flag, mtu
> > change bugs on the driver side I'm not sure it can be resolved by
> > the stack easily. Better to just build drivers that can handle it IMO.
> >
> > Do we know if sending >MTU size skbs to drivers causes problems in real
> > cases? I haven't tried on the NICs I have here, but I expect they should
> > be fine. Fine here being system keeps running as expected. Dropping the
> > skb either on TX or RX side is expected. Even with this change though
> > its possible for the skb to slip through if I configure MTU on a live
> > system.
> 
> I wholeheartedly agree with the above.
> 
> Ideally the only >mtu check should happen at driver admittance.
> But again ideally it should happen in some core stack location not in
> the driver itself.

Ideally maybe, but IMO we should just let the skb go to the driver
and let the driver sort it out. Even if this means pushing the packet
onto the wire then the switch will drop it or the receiver, etc. A
BPF program can do lots of horrible things that should never be
on the wire otherwise. MTU is just one of them, but sending corrupted
payloads, adding bogus headers, checksums etc. so I don't think we can
reasonable protect against all of them.

Of course if the driver is going to hang/panic then something needs
to be done. Perhaps a needs_mtu_check feature flag, although thats
not so nice either so perhaps drivers just need to handle it themselves.
Also even today the case could happen without BPF as best I can tell
so the drivers should be prepared for it.

> However, due to both gso and vlan offload, even this is not trivial to do...
> The mtu is L3, but drivers/hardware/the wire usually care about L2...
> (because ultimately that's what gets received and must fit in receive buffers)
Jesper Dangaard Brouer Oct. 8, 2020, 8:07 a.m. UTC | #5
On Wed, 07 Oct 2020 20:19:39 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Maciej Żenczykowski wrote:
> > On Wed, Oct 7, 2020 at 3:37 PM John Fastabend <john.fastabend@gmail.com> wrote:  
> > >
> > > Daniel Borkmann wrote:  
> > > > On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> > > > [...]  
> > > > >   net/core/dev.c |   24 ++++++++++++++++++++++--
> > > > >   1 file changed, 22 insertions(+), 2 deletions(-)  
> > >
> > > Couple high-level comments. Whats the problem with just letting the driver
> > > consume the packet? I would chalk it up to a buggy BPF program that is
> > > sending these packets. The drivers really shouldn't panic or do anything
> > > horrible under this case because even today I don't think we can be
> > > 100% certain MTU on skb matches set MTU. Imagine the case where I change
> > > the MTU from 9kB->1500B there will be some skbs in-flight with the larger
> > > length and some with the shorter. If the drivers panic/fault or otherwise
> > > does something else horrible thats not going to be friendly in general case
> > > regardless of what BPF does. And seeing this type of config is all done
> > > async its tricky (not practical) to flush any skbs in-flight.
> > >
> > > I've spent many hours debugging these types of feature flag, mtu
> > > change bugs on the driver side I'm not sure it can be resolved by
> > > the stack easily. Better to just build drivers that can handle it IMO.
> > >
> > > Do we know if sending >MTU size skbs to drivers causes problems in real
> > > cases? I haven't tried on the NICs I have here, but I expect they should
> > > be fine. Fine here being system keeps running as expected. Dropping the
> > > skb either on TX or RX side is expected. Even with this change though
> > > its possible for the skb to slip through if I configure MTU on a live
> > > system.  
> > 
> > I wholeheartedly agree with the above.
> > 
> > Ideally the only >mtu check should happen at driver admittance.
> > But again ideally it should happen in some core stack location not in
> > the driver itself.  
> 
> Ideally maybe, but IMO we should just let the skb go to the driver
> and let the driver sort it out. Even if this means pushing the packet
> onto the wire then the switch will drop it or the receiver, etc. A
> BPF program can do lots of horrible things that should never be
> on the wire otherwise. MTU is just one of them, but sending corrupted
> payloads, adding bogus headers, checksums etc. so I don't think we can
> reasonable protect against all of them.

That is a good point.

> Of course if the driver is going to hang/panic then something needs
> to be done. Perhaps a needs_mtu_check feature flag, although thats
> not so nice either so perhaps drivers just need to handle it themselves.
> Also even today the case could happen without BPF as best I can tell
> so the drivers should be prepared for it.

Yes, borderline cases do exist already today (like your reconf with
inflight packets), so I guess drivers should at-least not hang/panic
and be robust enough that we can drop this check.

I think you have convinced me that we can drop this check.  My only
concern is how people can troubleshoot this, but its not different from
current state.


> > However, due to both gso and vlan offload, even this is not trivial to do...
> > The mtu is L3, but drivers/hardware/the wire usually care about L2...

If net_device->mtu is L3 (1500) and XDP (and TC, right?) operate at L2,
that likely means that the "strict" bpf_mtu_check (in my BPF-helper) is
wrong, as XDP (and TC) length at this point include the 14 bytes
Ethernet header.  I will check and fix.

Is this accounted for via net_device->hard_header_len ?
Daniel Borkmann Oct. 8, 2020, 8:25 a.m. UTC | #6
On 10/8/20 10:07 AM, Jesper Dangaard Brouer wrote:
[...]
>>> However, due to both gso and vlan offload, even this is not trivial to do...
>>> The mtu is L3, but drivers/hardware/the wire usually care about L2...
> 
> If net_device->mtu is L3 (1500) and XDP (and TC, right?) operate at L2,
> that likely means that the "strict" bpf_mtu_check (in my BPF-helper) is
> wrong, as XDP (and TC) length at this point include the 14 bytes
> Ethernet header.  I will check and fix.

Yes, both at L2 layer.

> Is this accounted for via net_device->hard_header_len ?

It is, see also ether_setup().
Jesper Dangaard Brouer Oct. 8, 2020, 8:30 a.m. UTC | #7
On Wed, 7 Oct 2020 23:37:00 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> [...]
> >   net/core/dev.c |   24 ++++++++++++++++++++++--
> >   1 file changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b433098896b2..19406013f93e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >   	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
> >   	case TC_ACT_OK:
> >   	case TC_ACT_RECLASSIFY:
> > +		*ret = NET_XMIT_SUCCESS;
> >   		skb->tc_index = TC_H_MIN(cl_res.classid);
> >   		break;
> >   	case TC_ACT_SHOT:
> > @@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   {
> >   	struct net_device *dev = skb->dev;
> >   	struct netdev_queue *txq;
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	bool mtu_check = false;
> > +#endif
> > +	bool again = false;
> >   	struct Qdisc *q;
> >   	int rc = -ENOMEM;
> > -	bool again = false;
> >   
> >   	skb_reset_mac_header(skb);
> >   
> > @@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	qdisc_pkt_len_init(skb);
> >   #ifdef CONFIG_NET_CLS_ACT
> > +	mtu_check = skb_is_redirected(skb);
> >   	skb->tc_at_ingress = 0;
> >   # ifdef CONFIG_NET_EGRESS
> >   	if (static_branch_unlikely(&egress_needed_key)) {
> > +		unsigned int len_orig = skb->len;
> > +
> >   		skb = sch_handle_egress(skb, &rc, dev);
> >   		if (!skb)
> >   			goto out;
> > +		/* BPF-prog ran and could have changed packet size beyond MTU */
> > +		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
> > +			mtu_check = true;
> >   	}
> >   # endif
> > +	/* MTU-check only happens on "last" net_device in a redirect sequence
> > +	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
> > +	 * either ingress or egress to another device).
> > +	 */  
> 
> Hmm, quite some overhead in fast path. 

Not really, normal fast-path already call is_skb_forwardable(). And it
already happens in existing code, ingress redirect code, which I remove
calling in patch 6.

(I have considered inlining is_skb_forwardable as a optimization for
general netstack dev_forward_skb)

> Also, won't this be checked multiple times on stacked devices? :(

I don't think it will be checked multiple times, because we have a
skb_reset_redirect() in ingress path (just after sch_handle_ingress()).

> Moreover, this missed the fact that 'real' qdiscs can have
> filters attached too, this would come after this check. Can't this instead be in
> driver layer for those that really need it? I would probably only drop the check
> as done in 1/6 and allow the BPF prog to do the validation if needed.

See other reply, this is likely what we will end-up with.

> > +	if (mtu_check && !is_skb_forwardable(dev, skb)) {
> > +		rc = -EMSGSIZE;
> > +		goto drop;
> > +	}
> >   #endif
> >   	/* If device/qdisc don't need skb->dst, release it right now while
> >   	 * its hot in this cpu cache.
> > @@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	rc = -ENETDOWN;
> >   	rcu_read_unlock_bh();
> > -
> > +#ifdef CONFIG_NET_CLS_ACT
> > +drop:
> > +#endif
> >   	atomic_long_inc(&dev->tx_dropped);
> >   	kfree_skb_list(skb);
> >   	return rc;
> >   
>
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index b433098896b2..19406013f93e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3870,6 +3870,7 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
 	case TC_ACT_OK:
 	case TC_ACT_RECLASSIFY:
+		*ret = NET_XMIT_SUCCESS;
 		skb->tc_index = TC_H_MIN(cl_res.classid);
 		break;
 	case TC_ACT_SHOT:
@@ -4064,9 +4065,12 @@  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
+#ifdef CONFIG_NET_CLS_ACT
+	bool mtu_check = false;
+#endif
+	bool again = false;
 	struct Qdisc *q;
 	int rc = -ENOMEM;
-	bool again = false;
 
 	skb_reset_mac_header(skb);
 
@@ -4082,14 +4086,28 @@  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	qdisc_pkt_len_init(skb);
 #ifdef CONFIG_NET_CLS_ACT
+	mtu_check = skb_is_redirected(skb);
 	skb->tc_at_ingress = 0;
 # ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
+		unsigned int len_orig = skb->len;
+
 		skb = sch_handle_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
+		/* BPF-prog ran and could have changed packet size beyond MTU */
+		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
+			mtu_check = true;
 	}
 # endif
+	/* MTU-check only happens on "last" net_device in a redirect sequence
+	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
+	 * either ingress or egress to another device).
+	 */
+	if (mtu_check && !is_skb_forwardable(dev, skb)) {
+		rc = -EMSGSIZE;
+		goto drop;
+	}
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
 	 * its hot in this cpu cache.
@@ -4157,7 +4175,9 @@  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	rc = -ENETDOWN;
 	rcu_read_unlock_bh();
-
+#ifdef CONFIG_NET_CLS_ACT
+drop:
+#endif
 	atomic_long_inc(&dev->tx_dropped);
 	kfree_skb_list(skb);
 	return rc;