diff mbox series

[net-next] tcp: refine tcp_prune_ofo_queue() logic

Message ID 20221101035234.3910189-1-edumazet@google.com (mailing list archive)
State Accepted
Commit b0e01253a764faab9ecf56c0759e6b06470eb9ff
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: refine tcp_prune_ofo_queue() logic | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3 this patch: 3
netdev/cc_maintainers warning 2 maintainers not CCed: dsahern@kernel.org yoshfuji@linux-ipv6.org
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3 this patch: 3
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet Nov. 1, 2022, 3:52 a.m. UTC
After commits 36a6503fedda ("tcp: refine tcp_prune_ofo_queue()
to not drop all packets") and 72cd43ba64fc1
("tcp: free batches of packets in tcp_prune_ofo_queue()")
tcp_prune_ofo_queue() drops a fraction of ooo queue,
to make room for incoming packet.

However it makes no sense to drop packets that are
before the incoming packet, in sequence space.

In order to recover from packet losses faster,
it makes more sense to only drop ooo packets
which are after the incoming packet.

Tested:
packetdrill test:
   0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
  +.1 < . 1:1(0) ack 1 win 1024
   +0 accept(3, ..., ...) = 4

 +.01 < . 200:300(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>

 +.01 < . 400:500(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>

 +.01 < . 600:700(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>

 +.01 < . 800:900(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>

 +.01 < . 1000:1100(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>

 +.01 < . 1200:1300(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>

// this packet is dropped because we have no room left.
 +.01 < . 1400:1500(100) ack 1 win 1024

 +.01 < . 1:200(199) ack 1 win 1024
// Make sure kernel did not drop 200:300 sequence
   +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
// Make room, since our RCVBUF is very small
   +0 read(4, ..., 299) = 299

 +.01 < . 300:400(100) ack 1 win 1024
   +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>

 +.01 < . 500:600(100) ack 1 win 1024
   +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>

   +0 read(4, ..., 400) = 400

 +.01 < . 700:800(100) ack 1 win 1024
   +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>

 +.01 < . 900:1000(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>

 +.01 < . 1100:1200(100) ack 1 win 1024
// This checks that 1200:1300 has not been removed from ooo queue
   +0 > . 1:1(0) ack 1300

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 51 +++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 20 deletions(-)

Comments

Soheil Hassas Yeganeh Nov. 1, 2022, 4:18 a.m. UTC | #1
On Mon, Oct 31, 2022 at 11:52 PM Eric Dumazet <edumazet@google.com> wrote:
>
> After commits 36a6503fedda ("tcp: refine tcp_prune_ofo_queue()
> to not drop all packets") and 72cd43ba64fc1
> ("tcp: free batches of packets in tcp_prune_ofo_queue()")
> tcp_prune_ofo_queue() drops a fraction of ooo queue,
> to make room for incoming packet.
>
> However it makes no sense to drop packets that are
> before the incoming packet, in sequence space.
>
> In order to recover from packet losses faster,
> it makes more sense to only drop ooo packets
> which are after the incoming packet.
>
> Tested:
> packetdrill test:
>    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
>    +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
>   +.1 < . 1:1(0) ack 1 win 1024
>    +0 accept(3, ..., ...) = 4
>
>  +.01 < . 200:300(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>
>
>  +.01 < . 400:500(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>
>
>  +.01 < . 600:700(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>
>
>  +.01 < . 800:900(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>
>
>  +.01 < . 1000:1100(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>
>
>  +.01 < . 1200:1300(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
>
> // this packet is dropped because we have no room left.
>  +.01 < . 1400:1500(100) ack 1 win 1024
>
>  +.01 < . 1:200(199) ack 1 win 1024
> // Make sure kernel did not drop 200:300 sequence
>    +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
> // Make room, since our RCVBUF is very small
>    +0 read(4, ..., 299) = 299
>
>  +.01 < . 300:400(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
>
>  +.01 < . 500:600(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>
>
>    +0 read(4, ..., 400) = 400
>
>  +.01 < . 700:800(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>
>
>  +.01 < . 900:1000(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>
>
>  +.01 < . 1100:1200(100) ack 1 win 1024
> // This checks that 1200:1300 has not been removed from ooo queue
>    +0 > . 1:1(0) ack 1300
>
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Very nice idea! It should lower tail latency on highly congested
links. Thank you Eric!

> ---
>  net/ipv4/tcp_input.c | 51 +++++++++++++++++++++++++++-----------------
>  1 file changed, 31 insertions(+), 20 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 0640453fce54b6daae0861d948f3db075830daf6..d764b5854dfcc865207b5eb749c29013ef18bdbc 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4764,8 +4764,8 @@ static void tcp_ofo_queue(struct sock *sk)
>         }
>  }
>
> -static bool tcp_prune_ofo_queue(struct sock *sk);
> -static int tcp_prune_queue(struct sock *sk);
> +static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
> +static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
>
>  static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>                                  unsigned int size)
> @@ -4773,11 +4773,11 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>         if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
>             !sk_rmem_schedule(sk, skb, size)) {
>
> -               if (tcp_prune_queue(sk) < 0)
> +               if (tcp_prune_queue(sk, skb) < 0)
>                         return -1;
>
>                 while (!sk_rmem_schedule(sk, skb, size)) {
> -                       if (!tcp_prune_ofo_queue(sk))
> +                       if (!tcp_prune_ofo_queue(sk, skb))
>                                 return -1;
>                 }
>         }
> @@ -5329,6 +5329,8 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>   * Clean the out-of-order queue to make room.
>   * We drop high sequences packets to :
>   * 1) Let a chance for holes to be filled.
> + *    This means we do not drop packets from ooo queue if their sequence
> + *    is before incoming packet sequence.
>   * 2) not add too big latencies if thousands of packets sit there.
>   *    (But if application shrinks SO_RCVBUF, we could still end up
>   *     freeing whole queue here)
> @@ -5336,24 +5338,31 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>   *
>   * Return true if queue has shrunk.
>   */
> -static bool tcp_prune_ofo_queue(struct sock *sk)
> +static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct rb_node *node, *prev;
> +       bool pruned = false;
>         int goal;
>
>         if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
>                 return false;
>
> -       NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
>         goal = sk->sk_rcvbuf >> 3;
>         node = &tp->ooo_last_skb->rbnode;
> +
>         do {
> +               struct sk_buff *skb = rb_to_skb(node);
> +
> +               /* If incoming skb would land last in ofo queue, stop pruning. */
> +               if (after(TCP_SKB_CB(in_skb)->seq, TCP_SKB_CB(skb)->seq))
> +                       break;
> +               pruned = true;
>                 prev = rb_prev(node);
>                 rb_erase(node, &tp->out_of_order_queue);
> -               goal -= rb_to_skb(node)->truesize;
> -               tcp_drop_reason(sk, rb_to_skb(node),
> -                               SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
> +               goal -= skb->truesize;
> +               tcp_drop_reason(sk, skb, SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
> +               tp->ooo_last_skb = rb_to_skb(prev);
>                 if (!prev || goal <= 0) {
>                         if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
>                             !tcp_under_memory_pressure(sk))
> @@ -5362,16 +5371,18 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
>                 }
>                 node = prev;
>         } while (node);
> -       tp->ooo_last_skb = rb_to_skb(prev);
>
> -       /* Reset SACK state.  A conforming SACK implementation will
> -        * do the same at a timeout based retransmit.  When a connection
> -        * is in a sad state like this, we care only about integrity
> -        * of the connection not performance.
> -        */
> -       if (tp->rx_opt.sack_ok)
> -               tcp_sack_reset(&tp->rx_opt);
> -       return true;
> +       if (pruned) {
> +               NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
> +               /* Reset SACK state.  A conforming SACK implementation will
> +                * do the same at a timeout based retransmit.  When a connection
> +                * is in a sad state like this, we care only about integrity
> +                * of the connection not performance.
> +                */
> +               if (tp->rx_opt.sack_ok)
> +                       tcp_sack_reset(&tp->rx_opt);
> +       }
> +       return pruned;
>  }
>
>  /* Reduce allocated memory if we can, trying to get
> @@ -5381,7 +5392,7 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
>   * until the socket owning process reads some of the data
>   * to stabilize the situation.
>   */
> -static int tcp_prune_queue(struct sock *sk)
> +static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>
> @@ -5408,7 +5419,7 @@ static int tcp_prune_queue(struct sock *sk)
>         /* Collapsing did not help, destructive actions follow.
>          * This must not ever occur. */
>
> -       tcp_prune_ofo_queue(sk);
> +       tcp_prune_ofo_queue(sk, in_skb);
>
>         if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
>                 return 0;
> --
> 2.38.1.273.g43a17bfeac-goog
>
Kuniyuki Iwashima Nov. 1, 2022, 4:54 p.m. UTC | #2
From:   Eric Dumazet <edumazet@google.com>
Date:   Tue,  1 Nov 2022 03:52:34 +0000
> After commits 36a6503fedda ("tcp: refine tcp_prune_ofo_queue()
> to not drop all packets") and 72cd43ba64fc1
> ("tcp: free batches of packets in tcp_prune_ofo_queue()")
> tcp_prune_ofo_queue() drops a fraction of ooo queue,
> to make room for incoming packet.
> 
> However it makes no sense to drop packets that are
> before the incoming packet, in sequence space.
> 
> In order to recover from packet losses faster,
> it makes more sense to only drop ooo packets
> which are after the incoming packet.
> 
> Tested:
> packetdrill test:
>    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
> 
>    +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
>   +.1 < . 1:1(0) ack 1 win 1024
>    +0 accept(3, ..., ...) = 4
> 
>  +.01 < . 200:300(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>
> 
>  +.01 < . 400:500(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>
> 
>  +.01 < . 600:700(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>
> 
>  +.01 < . 800:900(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>
> 
>  +.01 < . 1000:1100(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>
> 
>  +.01 < . 1200:1300(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
> 
> // this packet is dropped because we have no room left.
>  +.01 < . 1400:1500(100) ack 1 win 1024
> 
>  +.01 < . 1:200(199) ack 1 win 1024
> // Make sure kernel did not drop 200:300 sequence
>    +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
> // Make room, since our RCVBUF is very small
>    +0 read(4, ..., 299) = 299
> 
>  +.01 < . 300:400(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
> 
>  +.01 < . 500:600(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>
> 
>    +0 read(4, ..., 400) = 400
> 
>  +.01 < . 700:800(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>
> 
>  +.01 < . 900:1000(100) ack 1 win 1024
>    +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>
> 
>  +.01 < . 1100:1200(100) ack 1 win 1024
> // This checks that 1200:1300 has not been removed from ooo queue
>    +0 > . 1:1(0) ack 1300
> 
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Sounds good!

Thank you.


> ---
>  net/ipv4/tcp_input.c | 51 +++++++++++++++++++++++++++-----------------
>  1 file changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 0640453fce54b6daae0861d948f3db075830daf6..d764b5854dfcc865207b5eb749c29013ef18bdbc 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4764,8 +4764,8 @@ static void tcp_ofo_queue(struct sock *sk)
>  	}
>  }
>  
> -static bool tcp_prune_ofo_queue(struct sock *sk);
> -static int tcp_prune_queue(struct sock *sk);
> +static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
> +static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
>  
>  static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>  				 unsigned int size)
> @@ -4773,11 +4773,11 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>  	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
>  	    !sk_rmem_schedule(sk, skb, size)) {
>  
> -		if (tcp_prune_queue(sk) < 0)
> +		if (tcp_prune_queue(sk, skb) < 0)
>  			return -1;
>  
>  		while (!sk_rmem_schedule(sk, skb, size)) {
> -			if (!tcp_prune_ofo_queue(sk))
> +			if (!tcp_prune_ofo_queue(sk, skb))
>  				return -1;
>  		}
>  	}
> @@ -5329,6 +5329,8 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>   * Clean the out-of-order queue to make room.
>   * We drop high sequences packets to :
>   * 1) Let a chance for holes to be filled.
> + *    This means we do not drop packets from ooo queue if their sequence
> + *    is before incoming packet sequence.
>   * 2) not add too big latencies if thousands of packets sit there.
>   *    (But if application shrinks SO_RCVBUF, we could still end up
>   *     freeing whole queue here)
> @@ -5336,24 +5338,31 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>   *
>   * Return true if queue has shrunk.
>   */
> -static bool tcp_prune_ofo_queue(struct sock *sk)
> +static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct rb_node *node, *prev;
> +	bool pruned = false;
>  	int goal;
>  
>  	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
>  		return false;
>  
> -	NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
>  	goal = sk->sk_rcvbuf >> 3;
>  	node = &tp->ooo_last_skb->rbnode;
> +
>  	do {
> +		struct sk_buff *skb = rb_to_skb(node);
> +
> +		/* If incoming skb would land last in ofo queue, stop pruning. */
> +		if (after(TCP_SKB_CB(in_skb)->seq, TCP_SKB_CB(skb)->seq))
> +			break;
> +		pruned = true;
>  		prev = rb_prev(node);
>  		rb_erase(node, &tp->out_of_order_queue);
> -		goal -= rb_to_skb(node)->truesize;
> -		tcp_drop_reason(sk, rb_to_skb(node),
> -				SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
> +		goal -= skb->truesize;
> +		tcp_drop_reason(sk, skb, SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
> +		tp->ooo_last_skb = rb_to_skb(prev);
>  		if (!prev || goal <= 0) {
>  			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
>  			    !tcp_under_memory_pressure(sk))
> @@ -5362,16 +5371,18 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
>  		}
>  		node = prev;
>  	} while (node);
> -	tp->ooo_last_skb = rb_to_skb(prev);
>  
> -	/* Reset SACK state.  A conforming SACK implementation will
> -	 * do the same at a timeout based retransmit.  When a connection
> -	 * is in a sad state like this, we care only about integrity
> -	 * of the connection not performance.
> -	 */
> -	if (tp->rx_opt.sack_ok)
> -		tcp_sack_reset(&tp->rx_opt);
> -	return true;
> +	if (pruned) {
> +		NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
> +		/* Reset SACK state.  A conforming SACK implementation will
> +		 * do the same at a timeout based retransmit.  When a connection
> +		 * is in a sad state like this, we care only about integrity
> +		 * of the connection not performance.
> +		 */
> +		if (tp->rx_opt.sack_ok)
> +			tcp_sack_reset(&tp->rx_opt);
> +	}
> +	return pruned;
>  }
>  
>  /* Reduce allocated memory if we can, trying to get
> @@ -5381,7 +5392,7 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
>   * until the socket owning process reads some of the data
>   * to stabilize the situation.
>   */
> -static int tcp_prune_queue(struct sock *sk)
> +static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
>  
> @@ -5408,7 +5419,7 @@ static int tcp_prune_queue(struct sock *sk)
>  	/* Collapsing did not help, destructive actions follow.
>  	 * This must not ever occur. */
>  
> -	tcp_prune_ofo_queue(sk);
> +	tcp_prune_ofo_queue(sk, in_skb);
>  
>  	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
>  		return 0;
> -- 
> 2.38.1.273.g43a17bfeac-goog
patchwork-bot+netdevbpf@kernel.org Nov. 2, 2022, 4:30 a.m. UTC | #3
Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Tue,  1 Nov 2022 03:52:34 +0000 you wrote:
> After commits 36a6503fedda ("tcp: refine tcp_prune_ofo_queue()
> to not drop all packets") and 72cd43ba64fc1
> ("tcp: free batches of packets in tcp_prune_ofo_queue()")
> tcp_prune_ofo_queue() drops a fraction of ooo queue,
> to make room for incoming packet.
> 
> However it makes no sense to drop packets that are
> before the incoming packet, in sequence space.
> 
> [...]

Here is the summary with links:
  - [net-next] tcp: refine tcp_prune_ofo_queue() logic
    https://git.kernel.org/netdev/net-next/c/b0e01253a764

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0640453fce54b6daae0861d948f3db075830daf6..d764b5854dfcc865207b5eb749c29013ef18bdbc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4764,8 +4764,8 @@  static void tcp_ofo_queue(struct sock *sk)
 	}
 }
 
-static bool tcp_prune_ofo_queue(struct sock *sk);
-static int tcp_prune_queue(struct sock *sk);
+static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
+static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
 
 static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 				 unsigned int size)
@@ -4773,11 +4773,11 @@  static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
 	    !sk_rmem_schedule(sk, skb, size)) {
 
-		if (tcp_prune_queue(sk) < 0)
+		if (tcp_prune_queue(sk, skb) < 0)
 			return -1;
 
 		while (!sk_rmem_schedule(sk, skb, size)) {
-			if (!tcp_prune_ofo_queue(sk))
+			if (!tcp_prune_ofo_queue(sk, skb))
 				return -1;
 		}
 	}
@@ -5329,6 +5329,8 @@  static void tcp_collapse_ofo_queue(struct sock *sk)
  * Clean the out-of-order queue to make room.
  * We drop high sequences packets to :
  * 1) Let a chance for holes to be filled.
+ *    This means we do not drop packets from ooo queue if their sequence
+ *    is before incoming packet sequence.
  * 2) not add too big latencies if thousands of packets sit there.
  *    (But if application shrinks SO_RCVBUF, we could still end up
  *     freeing whole queue here)
@@ -5336,24 +5338,31 @@  static void tcp_collapse_ofo_queue(struct sock *sk)
  *
  * Return true if queue has shrunk.
  */
-static bool tcp_prune_ofo_queue(struct sock *sk)
+static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct rb_node *node, *prev;
+	bool pruned = false;
 	int goal;
 
 	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 		return false;
 
-	NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
 	goal = sk->sk_rcvbuf >> 3;
 	node = &tp->ooo_last_skb->rbnode;
+
 	do {
+		struct sk_buff *skb = rb_to_skb(node);
+
+		/* If incoming skb would land last in ofo queue, stop pruning. */
+		if (after(TCP_SKB_CB(in_skb)->seq, TCP_SKB_CB(skb)->seq))
+			break;
+		pruned = true;
 		prev = rb_prev(node);
 		rb_erase(node, &tp->out_of_order_queue);
-		goal -= rb_to_skb(node)->truesize;
-		tcp_drop_reason(sk, rb_to_skb(node),
-				SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
+		goal -= skb->truesize;
+		tcp_drop_reason(sk, skb, SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
+		tp->ooo_last_skb = rb_to_skb(prev);
 		if (!prev || goal <= 0) {
 			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
 			    !tcp_under_memory_pressure(sk))
@@ -5362,16 +5371,18 @@  static bool tcp_prune_ofo_queue(struct sock *sk)
 		}
 		node = prev;
 	} while (node);
-	tp->ooo_last_skb = rb_to_skb(prev);
 
-	/* Reset SACK state.  A conforming SACK implementation will
-	 * do the same at a timeout based retransmit.  When a connection
-	 * is in a sad state like this, we care only about integrity
-	 * of the connection not performance.
-	 */
-	if (tp->rx_opt.sack_ok)
-		tcp_sack_reset(&tp->rx_opt);
-	return true;
+	if (pruned) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
+		/* Reset SACK state.  A conforming SACK implementation will
+		 * do the same at a timeout based retransmit.  When a connection
+		 * is in a sad state like this, we care only about integrity
+		 * of the connection not performance.
+		 */
+		if (tp->rx_opt.sack_ok)
+			tcp_sack_reset(&tp->rx_opt);
+	}
+	return pruned;
 }
 
 /* Reduce allocated memory if we can, trying to get
@@ -5381,7 +5392,7 @@  static bool tcp_prune_ofo_queue(struct sock *sk)
  * until the socket owning process reads some of the data
  * to stabilize the situation.
  */
-static int tcp_prune_queue(struct sock *sk)
+static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -5408,7 +5419,7 @@  static int tcp_prune_queue(struct sock *sk)
 	/* Collapsing did not help, destructive actions follow.
 	 * This must not ever occur. */
 
-	tcp_prune_ofo_queue(sk);
+	tcp_prune_ofo_queue(sk, in_skb);
 
 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
 		return 0;