diff mbox series

[net] tcp: Mark fastopen SYN packet as lost when receiving ICMP_TOOBIG/ICMP_FRAG_NEEDED

Message ID 160765171921.6905.7897898635812579754.stgit@localhost.localdomain (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: Mark fastopen SYN packet as lost when receiving ICMP_TOOBIG/ICMP_FRAG_NEEDED | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit fail Errors and warnings before: 1560 this patch: 1564
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 43 lines checked
netdev/build_allmodconfig_warn fail Errors and warnings before: 1569 this patch: 1573
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Alexander Duyck Dec. 11, 2020, 1:55 a.m. UTC
From: Alexander Duyck <alexanderduyck@fb.com>

In the case of a fastopen SYN there are cases where it may trigger either a
ICMP_TOOBIG message in the case of IPv6 or a fragmentation request in the
case of IPv4. This results in the socket stalling for a second or more as
it does not respond to the message by retransmitting the SYN frame.

Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
makes use of the entire MTU. In the case of fastopen it does, and an
additional complication is that the retransmit queue doesn't contain the
original frames. As a result when tcp_simple_retransmit is called and
walks the list of frames in the queue it may not mark the frames as lost
because both the SYN and the data packet each individually are smaller than
the MSS size after the adjustment. This results in the socket being stalled
until the retransmit timer kicks in and forces the SYN frame out again
without the data attached.

In order to resolve this we need to mark the SYN frame as lost if it is the
first packet in the queue. Doing this allows the socket to recover much
more quickly without the retransmit timeout stall.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
---
 include/net/tcp.h    |    1 +
 net/ipv4/tcp_input.c |    8 ++++++++
 net/ipv4/tcp_ipv4.c  |    6 ++++++
 net/ipv6/tcp_ipv6.c  |    4 ++++
 4 files changed, 19 insertions(+)

Comments

Eric Dumazet Dec. 11, 2020, 6:24 a.m. UTC | #1
On Fri, Dec 11, 2020 at 2:55 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> From: Alexander Duyck <alexanderduyck@fb.com>
>
> In the case of a fastopen SYN there are cases where it may trigger either a
> ICMP_TOOBIG message in the case of IPv6 or a fragmentation request in the
> case of IPv4. This results in the socket stalling for a second or more as
> it does not respond to the message by retransmitting the SYN frame.
>
> Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> makes use of the entire MTU. In the case of fastopen it does, and an
> additional complication is that the retransmit queue doesn't contain the
> original frames. As a result when tcp_simple_retransmit is called and
> walks the list of frames in the queue it may not mark the frames as lost
> because both the SYN and the data packet each individually are smaller than
> the MSS size after the adjustment. This results in the socket being stalled
> until the retransmit timer kicks in and forces the SYN frame out again
> without the data attached.
>
> In order to resolve this we need to mark the SYN frame as lost if it is the
> first packet in the queue. Doing this allows the socket to recover much
> more quickly without the retransmit timeout stall.
>
> Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>


I do not think it is net candidate, but net-next

Yuchung might correct me, but I think TCP Fastopen standard was very
conservative about payload len in the SYN packet

So receiving an ICMP was never considered.

> ---
>  include/net/tcp.h    |    1 +
>  net/ipv4/tcp_input.c |    8 ++++++++
>  net/ipv4/tcp_ipv4.c  |    6 ++++++
>  net/ipv6/tcp_ipv6.c  |    4 ++++
>  4 files changed, 19 insertions(+)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d4ef5bf94168..6181ad98727a 100644
> --- a/include/net/tcp.h


> +++ b/net/ipv4/tcp_ipv4.c
> @@ -546,6 +546,12 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
>                         if (sk->sk_state == TCP_LISTEN)
>                                 goto out;
>
> +                       /* fastopen SYN may have triggered the fragmentation
> +                        * request. Mark the SYN or SYN/ACK as lost.
> +                        */
> +                       if (sk->sk_state == TCP_SYN_SENT)
> +                               tcp_mark_syn_lost(sk);

This is going to crash in some cases, you do not know if you own the socket.
(Look a few lines below)

> +
>                         tp->mtu_info = info;
>                         if (!sock_owned_by_user(sk)) {
>                                 tcp_v4_mtu_reduced(sk);
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 992cbf3eb9e3..d7b1346863e3 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -443,6 +443,10 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
>                 if (!ip6_sk_accept_pmtu(sk))
>                         goto out;
>
> +               /* fastopen SYN may have triggered TOOBIG, mark it lost. */
> +               if (sk->sk_state == TCP_SYN_SENT)
> +                       tcp_mark_syn_lost(sk);


Same issue here.

> +
>                 tp->mtu_info = ntohl(info);
>                 if (!sock_owned_by_user(sk))
>                         tcp_v6_mtu_reduced(sk);
>
>
kernel test robot Dec. 11, 2020, 10:51 a.m. UTC | #2
Hi Alexander,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net/master]

url:    https://github.com/0day-ci/linux/commits/Alexander-Duyck/tcp-Mark-fastopen-SYN-packet-as-lost-when-receiving-ICMP_TOOBIG-ICMP_FRAG_NEEDED/20201211-100032
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git d9838b1d39283c1200c13f9076474c7624b8ec34
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/01abda6be2a196620ae2057c3e654edc82beb144
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Alexander-Duyck/tcp-Mark-fastopen-SYN-packet-as-lost-when-receiving-ICMP_TOOBIG-ICMP_FRAG_NEEDED/20201211-100032
        git checkout 01abda6be2a196620ae2057c3e654edc82beb144
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "tcp_mark_syn_lost" [net/ipv6/ipv6.ko] undefined!
ERROR: modpost: "clk_set_min_rate" [sound/soc/atmel/snd-soc-mchp-spdifrx.ko] undefined!
ERROR: modpost: "__delay" [drivers/net/mdio/mdio-cavium.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Alexander Duyck Dec. 11, 2020, 4:02 p.m. UTC | #3
On Thu, Dec 10, 2020 at 10:24 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Dec 11, 2020 at 2:55 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > From: Alexander Duyck <alexanderduyck@fb.com>
> >
> > In the case of a fastopen SYN there are cases where it may trigger either a
> > ICMP_TOOBIG message in the case of IPv6 or a fragmentation request in the
> > case of IPv4. This results in the socket stalling for a second or more as
> > it does not respond to the message by retransmitting the SYN frame.
> >
> > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > makes use of the entire MTU. In the case of fastopen it does, and an
> > additional complication is that the retransmit queue doesn't contain the
> > original frames. As a result when tcp_simple_retransmit is called and
> > walks the list of frames in the queue it may not mark the frames as lost
> > because both the SYN and the data packet each individually are smaller than
> > the MSS size after the adjustment. This results in the socket being stalled
> > until the retransmit timer kicks in and forces the SYN frame out again
> > without the data attached.
> >
> > In order to resolve this we need to mark the SYN frame as lost if it is the
> > first packet in the queue. Doing this allows the socket to recover much
> > more quickly without the retransmit timeout stall.
> >
> > Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
>
>
> I do not think it is net candidate, but net-next
>
> Yuchung might correct me, but I think TCP Fastopen standard was very
> conservative about payload len in the SYN packet
>
> So receiving an ICMP was never considered.

That's fine. I can target this for net-next. I had just selected net
since I had considered it a fix, but I suppose it could be considered
a behavioral change.

> > ---
> >  include/net/tcp.h    |    1 +
> >  net/ipv4/tcp_input.c |    8 ++++++++
> >  net/ipv4/tcp_ipv4.c  |    6 ++++++
> >  net/ipv6/tcp_ipv6.c  |    4 ++++
> >  4 files changed, 19 insertions(+)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index d4ef5bf94168..6181ad98727a 100644
> > --- a/include/net/tcp.h
>
>
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -546,6 +546,12 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> >                         if (sk->sk_state == TCP_LISTEN)
> >                                 goto out;
> >
> > +                       /* fastopen SYN may have triggered the fragmentation
> > +                        * request. Mark the SYN or SYN/ACK as lost.
> > +                        */
> > +                       if (sk->sk_state == TCP_SYN_SENT)
> > +                               tcp_mark_syn_lost(sk);
>
> This is going to crash in some cases, you do not know if you own the socket.
> (Look a few lines below)

Okay, I will look into moving this down into the block below since I
assume if it is owned by user we cannot make these changes.

> > +
> >                         tp->mtu_info = info;
> >                         if (!sock_owned_by_user(sk)) {
> >                                 tcp_v4_mtu_reduced(sk);
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 992cbf3eb9e3..d7b1346863e3 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -443,6 +443,10 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
> >                 if (!ip6_sk_accept_pmtu(sk))
> >                         goto out;
> >
> > +               /* fastopen SYN may have triggered TOOBIG, mark it lost. */
> > +               if (sk->sk_state == TCP_SYN_SENT)
> > +                       tcp_mark_syn_lost(sk);
>
>
> Same issue here.

I'll move this one too.

> > +
> >                 tp->mtu_info = ntohl(info);
> >                 if (!sock_owned_by_user(sk))
> >                         tcp_v6_mtu_reduced(sk);
> >
> >
Eric Dumazet Dec. 11, 2020, 4:22 p.m. UTC | #4
On Fri, Dec 11, 2020 at 5:03 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> That's fine. I can target this for net-next. I had just selected net
> since I had considered it a fix, but I suppose it could be considered
> a behavioral change.

We are very late in the 5.10 cycle, and we never handled ICMP in this
state, so net-next is definitely better.

Note that RFC 7413 states in 4.1.3 :

 The client MUST cache cookies from servers for later Fast Open
   connections.  For a multihomed client, the cookies are dependent on
   the client and server IP addresses.  Hence, the client should cache
   at most one (most recently received) cookie per client and server IP
   address pair.

   When caching cookies, we recommend that the client also cache the
   Maximum Segment Size (MSS) advertised by the server.  The client can
   cache the MSS advertised by the server in order to determine the
   maximum amount of data that the client can fit in the SYN packet in
   subsequent TFO connections.  Caching the server MSS is useful
   because, with Fast Open, a client sends data in the SYN packet before
   the server announces its MSS in the SYN-ACK packet.  If the client
   sends more data in the SYN packet than the server will accept, this
   will likely require the client to retransmit some or all of the data.
   Hence, caching the server MSS can enhance performance.

   Without a cached server MSS, the amount of data in the SYN packet is
   limited to the default MSS of 536 bytes for IPv4 [RFC1122] and 1220
   bytes for IPv6 [RFC2460].  Even if the client complies with this
   limit when sending the SYN, it is known that an IPv4 receiver
   advertising an MSS less than 536 bytes can receive a segment larger
   than it is expecting.

   If the cached MSS is larger than the typical size (1460 bytes for
   IPv4 or 1440 bytes for IPv6), then the excess data in the SYN packet
   may cause problems that offset the performance benefit of Fast Open.
   For example, the unusually large SYN may trigger IP fragmentation and
   may confuse firewalls or middleboxes, causing SYN retransmission and
   other side effects.  Therefore, the client MAY limit the cached MSS
   to 1460 bytes for IPv4 or 1440 for IPv6.


Relying on ICMP is fragile, since they can be filtered in some way.
Alexander Duyck Dec. 11, 2020, 5:15 p.m. UTC | #5
On Fri, Dec 11, 2020 at 8:22 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Dec 11, 2020 at 5:03 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
> > That's fine. I can target this for net-next. I had just selected net
> > since I had considered it a fix, but I suppose it could be considered
> > a behavioral change.
>
> We are very late in the 5.10 cycle, and we never handled ICMP in this
> state, so net-next is definitely better.
>
> Note that RFC 7413 states in 4.1.3 :
>
>  The client MUST cache cookies from servers for later Fast Open
>    connections.  For a multihomed client, the cookies are dependent on
>    the client and server IP addresses.  Hence, the client should cache
>    at most one (most recently received) cookie per client and server IP
>    address pair.
>
>    When caching cookies, we recommend that the client also cache the
>    Maximum Segment Size (MSS) advertised by the server.  The client can
>    cache the MSS advertised by the server in order to determine the
>    maximum amount of data that the client can fit in the SYN packet in
>    subsequent TFO connections.  Caching the server MSS is useful
>    because, with Fast Open, a client sends data in the SYN packet before
>    the server announces its MSS in the SYN-ACK packet.  If the client
>    sends more data in the SYN packet than the server will accept, this
>    will likely require the client to retransmit some or all of the data.
>    Hence, caching the server MSS can enhance performance.
>
>    Without a cached server MSS, the amount of data in the SYN packet is
>    limited to the default MSS of 536 bytes for IPv4 [RFC1122] and 1220
>    bytes for IPv6 [RFC2460].  Even if the client complies with this
>    limit when sending the SYN, it is known that an IPv4 receiver
>    advertising an MSS less than 536 bytes can receive a segment larger
>    than it is expecting.
>
>    If the cached MSS is larger than the typical size (1460 bytes for
>    IPv4 or 1440 bytes for IPv6), then the excess data in the SYN packet
>    may cause problems that offset the performance benefit of Fast Open.
>    For example, the unusually large SYN may trigger IP fragmentation and
>    may confuse firewalls or middleboxes, causing SYN retransmission and
>    other side effects.  Therefore, the client MAY limit the cached MSS
>    to 1460 bytes for IPv4 or 1440 for IPv6.
>
>
> Relying on ICMP is fragile, since they can be filtered in some way.

In this case I am not relying on the ICMP, but thought that since I
have it I should make use of it. WIthout the ICMP we would still just
be waiting on the retransmit timer.

The problem case has a v6-in-v6 tunnel between the client and the
endpoint so both ends assume an MTU 1500 and advertise a 1440 MSS
which works fine until they actually go to send a large packet between
the two. At that point the tunnel is triggering an ICMP_TOOBIG and the
endpoint is stalling since the MSS is dropped to 1400, but the SYN and
data payload were already smaller than that so no retransmits are
being triggered. This results in TFO being 1s slower than non-TFO
because of the failure to trigger the retransmit for the frame that
violated the PMTU. The patch is meant to get the two back into
comparable times.
Eric Dumazet Dec. 11, 2020, 7:17 p.m. UTC | #6
On Fri, Dec 11, 2020 at 6:15 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Dec 11, 2020 at 8:22 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Dec 11, 2020 at 5:03 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >
> > > That's fine. I can target this for net-next. I had just selected net
> > > since I had considered it a fix, but I suppose it could be considered
> > > a behavioral change.
> >
> > We are very late in the 5.10 cycle, and we never handled ICMP in this
> > state, so net-next is definitely better.
> >
> > Note that RFC 7413 states in 4.1.3 :
> >
> >  The client MUST cache cookies from servers for later Fast Open
> >    connections.  For a multihomed client, the cookies are dependent on
> >    the client and server IP addresses.  Hence, the client should cache
> >    at most one (most recently received) cookie per client and server IP
> >    address pair.
> >
> >    When caching cookies, we recommend that the client also cache the
> >    Maximum Segment Size (MSS) advertised by the server.  The client can
> >    cache the MSS advertised by the server in order to determine the
> >    maximum amount of data that the client can fit in the SYN packet in
> >    subsequent TFO connections.  Caching the server MSS is useful
> >    because, with Fast Open, a client sends data in the SYN packet before
> >    the server announces its MSS in the SYN-ACK packet.  If the client
> >    sends more data in the SYN packet than the server will accept, this
> >    will likely require the client to retransmit some or all of the data.
> >    Hence, caching the server MSS can enhance performance.
> >
> >    Without a cached server MSS, the amount of data in the SYN packet is
> >    limited to the default MSS of 536 bytes for IPv4 [RFC1122] and 1220
> >    bytes for IPv6 [RFC2460].  Even if the client complies with this
> >    limit when sending the SYN, it is known that an IPv4 receiver
> >    advertising an MSS less than 536 bytes can receive a segment larger
> >    than it is expecting.
> >
> >    If the cached MSS is larger than the typical size (1460 bytes for
> >    IPv4 or 1440 bytes for IPv6), then the excess data in the SYN packet
> >    may cause problems that offset the performance benefit of Fast Open.
> >    For example, the unusually large SYN may trigger IP fragmentation and
> >    may confuse firewalls or middleboxes, causing SYN retransmission and
> >    other side effects.  Therefore, the client MAY limit the cached MSS
> >    to 1460 bytes for IPv4 or 1440 for IPv6.
> >
> >
> > Relying on ICMP is fragile, since they can be filtered in some way.
>
> In this case I am not relying on the ICMP, but thought that since I
> have it I should make use of it. WIthout the ICMP we would still just
> be waiting on the retransmit timer.
>
> The problem case has a v6-in-v6 tunnel between the client and the
> endpoint so both ends assume an MTU 1500 and advertise a 1440 MSS
> which works fine until they actually go to send a large packet between
> the two. At that point the tunnel is triggering an ICMP_TOOBIG and the
> endpoint is stalling since the MSS is dropped to 1400, but the SYN and
> data payload were already smaller than that so no retransmits are
> being triggered. This results in TFO being 1s slower than non-TFO
> because of the failure to trigger the retransmit for the frame that
> violated the PMTU. The patch is meant to get the two back into
> comparable times.

Okay... Have you studied why tcp_v4_mtu_reduced() (and IPv6 equivalent)
code does not yet handle the retransmit in TCP_SYN_SENT state ?
Alexander Duyck Dec. 11, 2020, 9:51 p.m. UTC | #7
On Fri, Dec 11, 2020 at 11:18 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Dec 11, 2020 at 6:15 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, Dec 11, 2020 at 8:22 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 5:03 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > >
> > > > That's fine. I can target this for net-next. I had just selected net
> > > > since I had considered it a fix, but I suppose it could be considered
> > > > a behavioral change.
> > >
> > > We are very late in the 5.10 cycle, and we never handled ICMP in this
> > > state, so net-next is definitely better.
> > >
> > > Note that RFC 7413 states in 4.1.3 :
> > >
> > >  The client MUST cache cookies from servers for later Fast Open
> > >    connections.  For a multihomed client, the cookies are dependent on
> > >    the client and server IP addresses.  Hence, the client should cache
> > >    at most one (most recently received) cookie per client and server IP
> > >    address pair.
> > >
> > >    When caching cookies, we recommend that the client also cache the
> > >    Maximum Segment Size (MSS) advertised by the server.  The client can
> > >    cache the MSS advertised by the server in order to determine the
> > >    maximum amount of data that the client can fit in the SYN packet in
> > >    subsequent TFO connections.  Caching the server MSS is useful
> > >    because, with Fast Open, a client sends data in the SYN packet before
> > >    the server announces its MSS in the SYN-ACK packet.  If the client
> > >    sends more data in the SYN packet than the server will accept, this
> > >    will likely require the client to retransmit some or all of the data.
> > >    Hence, caching the server MSS can enhance performance.
> > >
> > >    Without a cached server MSS, the amount of data in the SYN packet is
> > >    limited to the default MSS of 536 bytes for IPv4 [RFC1122] and 1220
> > >    bytes for IPv6 [RFC2460].  Even if the client complies with this
> > >    limit when sending the SYN, it is known that an IPv4 receiver
> > >    advertising an MSS less than 536 bytes can receive a segment larger
> > >    than it is expecting.
> > >
> > >    If the cached MSS is larger than the typical size (1460 bytes for
> > >    IPv4 or 1440 bytes for IPv6), then the excess data in the SYN packet
> > >    may cause problems that offset the performance benefit of Fast Open.
> > >    For example, the unusually large SYN may trigger IP fragmentation and
> > >    may confuse firewalls or middleboxes, causing SYN retransmission and
> > >    other side effects.  Therefore, the client MAY limit the cached MSS
> > >    to 1460 bytes for IPv4 or 1440 for IPv6.
> > >
> > >
> > > Relying on ICMP is fragile, since they can be filtered in some way.
> >
> > In this case I am not relying on the ICMP, but thought that since I
> > have it I should make use of it. WIthout the ICMP we would still just
> > be waiting on the retransmit timer.
> >
> > The problem case has a v6-in-v6 tunnel between the client and the
> > endpoint so both ends assume an MTU 1500 and advertise a 1440 MSS
> > which works fine until they actually go to send a large packet between
> > the two. At that point the tunnel is triggering an ICMP_TOOBIG and the
> > endpoint is stalling since the MSS is dropped to 1400, but the SYN and
> > data payload were already smaller than that so no retransmits are
> > being triggered. This results in TFO being 1s slower than non-TFO
> > because of the failure to trigger the retransmit for the frame that
> > violated the PMTU. The patch is meant to get the two back into
> > comparable times.
>
> Okay... Have you studied why tcp_v4_mtu_reduced() (and IPv6 equivalent)
> code does not yet handle the retransmit in TCP_SYN_SENT state ?

The problem lies in tcp_simple_retransmit(). Specifically the loop at
the start of the function goes to check the retransmit queue to see if
there are any packets larger than MSS and finds none since we don't
place the SYN w/ data in there and instead have a separate SYN and
data packet.

I'm debating if I should take an alternative approach and modify the
loop at the start of tcp_simple_transmit to add a check for a SYN
packet, tp->syn_data being set, and then comparing the next frame
length + MAX_TCP_HEADER_OPTIONS versus mss.
Yuchung Cheng Dec. 11, 2020, 10:53 p.m. UTC | #8
On Fri, Dec 11, 2020 at 1:51 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Dec 11, 2020 at 11:18 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Dec 11, 2020 at 6:15 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 8:22 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Dec 11, 2020 at 5:03 PM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > > That's fine. I can target this for net-next. I had just selected net
> > > > > since I had considered it a fix, but I suppose it could be considered
> > > > > a behavioral change.
> > > >
> > > > We are very late in the 5.10 cycle, and we never handled ICMP in this
> > > > state, so net-next is definitely better.
> > > >
> > > > Note that RFC 7413 states in 4.1.3 :
> > > >
> > > >  The client MUST cache cookies from servers for later Fast Open
> > > >    connections.  For a multihomed client, the cookies are dependent on
> > > >    the client and server IP addresses.  Hence, the client should cache
> > > >    at most one (most recently received) cookie per client and server IP
> > > >    address pair.
> > > >
> > > >    When caching cookies, we recommend that the client also cache the
> > > >    Maximum Segment Size (MSS) advertised by the server.  The client can
> > > >    cache the MSS advertised by the server in order to determine the
> > > >    maximum amount of data that the client can fit in the SYN packet in
> > > >    subsequent TFO connections.  Caching the server MSS is useful
> > > >    because, with Fast Open, a client sends data in the SYN packet before
> > > >    the server announces its MSS in the SYN-ACK packet.  If the client
> > > >    sends more data in the SYN packet than the server will accept, this
> > > >    will likely require the client to retransmit some or all of the data.
> > > >    Hence, caching the server MSS can enhance performance.
> > > >
> > > >    Without a cached server MSS, the amount of data in the SYN packet is
> > > >    limited to the default MSS of 536 bytes for IPv4 [RFC1122] and 1220
> > > >    bytes for IPv6 [RFC2460].  Even if the client complies with this
> > > >    limit when sending the SYN, it is known that an IPv4 receiver
> > > >    advertising an MSS less than 536 bytes can receive a segment larger
> > > >    than it is expecting.
> > > >
> > > >    If the cached MSS is larger than the typical size (1460 bytes for
> > > >    IPv4 or 1440 bytes for IPv6), then the excess data in the SYN packet
> > > >    may cause problems that offset the performance benefit of Fast Open.
> > > >    For example, the unusually large SYN may trigger IP fragmentation and
> > > >    may confuse firewalls or middleboxes, causing SYN retransmission and
> > > >    other side effects.  Therefore, the client MAY limit the cached MSS
> > > >    to 1460 bytes for IPv4 or 1440 for IPv6.
> > > >
> > > >
> > > > Relying on ICMP is fragile, since they can be filtered in some way.
> > >
> > > In this case I am not relying on the ICMP, but thought that since I
> > > have it I should make use of it. WIthout the ICMP we would still just
> > > be waiting on the retransmit timer.
> > >
> > > The problem case has a v6-in-v6 tunnel between the client and the
> > > endpoint so both ends assume an MTU 1500 and advertise a 1440 MSS
> > > which works fine until they actually go to send a large packet between
> > > the two. At that point the tunnel is triggering an ICMP_TOOBIG and the
> > > endpoint is stalling since the MSS is dropped to 1400, but the SYN and
> > > data payload were already smaller than that so no retransmits are
> > > being triggered. This results in TFO being 1s slower than non-TFO
> > > because of the failure to trigger the retransmit for the frame that
> > > violated the PMTU. The patch is meant to get the two back into
> > > comparable times.
> >
> > Okay... Have you studied why tcp_v4_mtu_reduced() (and IPv6 equivalent)
> > code does not yet handle the retransmit in TCP_SYN_SENT state ?
>
> The problem lies in tcp_simple_retransmit(). Specifically the loop at
> the start of the function goes to check the retransmit queue to see if
> there are any packets larger than MSS and finds none since we don't
> place the SYN w/ data in there and instead have a separate SYN and
> data packet.
>
> I'm debating if I should take an alternative approach and modify the
> loop at the start of tcp_simple_transmit to add a check for a SYN
> packet, tp->syn_data being set, and then comparing the next frame
> length + MAX_TCP_HEADER_OPTIONS versus mss.
Thanks for bringing up this tricky issue. The root cause seems to be
the special arrangement of storing SYN-data as one-(pure)-SYN and one
non-SYN data segment. Given tcp_simple_transmit probably is not called
frequently, your alternative approach sounds more appealing to me.

Replacing that strange syn|data arrangement for TFO has been on my
wish list for a long time... Ideally it's better to just store the
SYN+data and just carve out the SYN for retransmit.
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d4ef5bf94168..6181ad98727a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2062,6 +2062,7 @@  void tcp_init(void);
 
 /* tcp_recovery.c */
 void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
+void tcp_mark_syn_lost(struct sock *sk);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb,
 				u32 reo_wnd);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 389d1b340248..d0c5248bc4e1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1066,6 +1066,14 @@  void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
+void tcp_mark_syn_lost(struct sock *sk)
+{
+	struct sk_buff *skb = tcp_rtx_queue_head(sk);
+
+	if (skb && TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+		tcp_mark_skb_lost(sk, skb);
+}
+
 /* Updates the delivered and delivered_ce counts */
 static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
 				bool ece_ack)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8391aa29e7a4..ad62fe029646 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -546,6 +546,12 @@  int tcp_v4_err(struct sk_buff *skb, u32 info)
 			if (sk->sk_state == TCP_LISTEN)
 				goto out;
 
+			/* fastopen SYN may have triggered the fragmentation
+			 * request. Mark the SYN or SYN/ACK as lost.
+			 */
+			if (sk->sk_state == TCP_SYN_SENT)
+				tcp_mark_syn_lost(sk);
+
 			tp->mtu_info = info;
 			if (!sock_owned_by_user(sk)) {
 				tcp_v4_mtu_reduced(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 992cbf3eb9e3..d7b1346863e3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -443,6 +443,10 @@  static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		if (!ip6_sk_accept_pmtu(sk))
 			goto out;
 
+		/* fastopen SYN may have triggered TOOBIG, mark it lost. */
+		if (sk->sk_state == TCP_SYN_SENT)
+			tcp_mark_syn_lost(sk);
+
 		tp->mtu_info = ntohl(info);
 		if (!sock_owned_by_user(sk))
 			tcp_v6_mtu_reduced(sk);