Message ID | 20231023-send-net-next-20231023-2-v1-9-9dc60939d371@kernel.org (mailing list archive) |
---|---|
State | Mainlined, archived |
Delegated to: | Mat Martineau |
Headers | show |
Series | mptcp: Features and fixes for v6.7 | expand |
On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote: > > From: Paolo Abeni <pabeni@redhat.com> > > The MPTCP protocol account for the data enqueued on all the subflows > to the main socket send buffer, while the send buffer auto-tuning > algorithm set the main socket send buffer size as the max size among > the subflows. > > That causes bad performances when at least one subflow is sndbuf > limited, e.g. due to very high latency, as the MPTCP scheduler can't > even fill such buffer. > > Change the send-buffer auto-tuning algorithm to compute the main socket > send buffer size as the sum of all the subflows buffer size. > > Reviewed-by: Mat Martineau <martineau@kernel.org> > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > Signed-off-by: Mat Martineau <martineau@kernel.org ... > diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c > index df208666fd19..2b43577f952e 100644 > --- a/net/mptcp/subflow.c > +++ b/net/mptcp/subflow.c > @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc > > void __mptcp_set_connected(struct sock *sk) > { > + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first); ->first can be NULL here, according to syzbot. > if (sk->sk_state == TCP_SYN_SENT) { > inet_sk_state_store(sk, TCP_ESTABLISHED); > sk->sk_state_change(sk);
On Thu, 2 Nov 2023, Eric Dumazet wrote: > On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote: >> >> From: Paolo Abeni <pabeni@redhat.com> >> >> The MPTCP protocol account for the data enqueued on all the subflows >> to the main socket send buffer, while the send buffer auto-tuning >> algorithm set the main socket send buffer size as the max size among >> the subflows. >> >> That causes bad performances when at least one subflow is sndbuf >> limited, e.g. due to very high latency, as the MPTCP scheduler can't >> even fill such buffer. >> >> Change the send-buffer auto-tuning algorithm to compute the main socket >> send buffer size as the sum of all the subflows buffer size. >> >> Reviewed-by: Mat Martineau <martineau@kernel.org> >> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >> Signed-off-by: Mat Martineau <martineau@kernel.org > > ... > >> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c >> index df208666fd19..2b43577f952e 100644 >> --- a/net/mptcp/subflow.c >> +++ b/net/mptcp/subflow.c >> @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc >> >> void __mptcp_set_connected(struct sock *sk) >> { >> + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first); > > ->first can be NULL here, according to syzbot. > Thanks for reporting this, Eric. We will get a fix to the net tree. Paolo & Matthieu, I created a github issue to track this: https://github.com/multipath-tcp/mptcp_net-next/issues/454 - Mat
Hi, On Thu, 2023-11-02 at 18:19 +0100, Eric Dumazet wrote: > On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote: > > > > From: Paolo Abeni <pabeni@redhat.com> > > > > The MPTCP protocol account for the data enqueued on all the subflows > > to the main socket send buffer, while the send buffer auto-tuning > > algorithm set the main socket send buffer size as the max size among > > the subflows. > > > > That causes bad performances when at least one subflow is sndbuf > > limited, e.g. due to very high latency, as the MPTCP scheduler can't > > even fill such buffer. > > > > Change the send-buffer auto-tuning algorithm to compute the main socket > > send buffer size as the sum of all the subflows buffer size. > > > > Reviewed-by: Mat Martineau <martineau@kernel.org> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > Signed-off-by: Mat Martineau <martineau@kernel.org > > ... > > > diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c > > index df208666fd19..2b43577f952e 100644 > > --- a/net/mptcp/subflow.c > > +++ b/net/mptcp/subflow.c > > @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc > > > > void __mptcp_set_connected(struct sock *sk) > > { > > + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first); > > ->first can be NULL here, according to syzbot. I'm sorry for the latency on my side, I had a different kind of crash to handle here. Do you have a syzkaller report available? Or the call trace landing here? Thanks! Paolo
On Mon, Nov 6, 2023 at 8:22 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Hi, > > On Thu, 2023-11-02 at 18:19 +0100, Eric Dumazet wrote: > > On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote: > > > > > > From: Paolo Abeni <pabeni@redhat.com> > > > > > > The MPTCP protocol account for the data enqueued on all the subflows > > > to the main socket send buffer, while the send buffer auto-tuning > > > algorithm set the main socket send buffer size as the max size among > > > the subflows. > > > > > > That causes bad performances when at least one subflow is sndbuf > > > limited, e.g. due to very high latency, as the MPTCP scheduler can't > > > even fill such buffer. > > > > > > Change the send-buffer auto-tuning algorithm to compute the main socket > > > send buffer size as the sum of all the subflows buffer size. > > > > > > Reviewed-by: Mat Martineau <martineau@kernel.org> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > > Signed-off-by: Mat Martineau <martineau@kernel.org > > > > ... > > > > > diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c > > > index df208666fd19..2b43577f952e 100644 > > > --- a/net/mptcp/subflow.c > > > +++ b/net/mptcp/subflow.c > > > @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc > > > > > > void __mptcp_set_connected(struct sock *sk) > > > { > > > + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first); > > > > ->first can be NULL here, according to syzbot. > > I'm sorry for the latency on my side, I had a different kind of crash > to handle here. > > Do you have a syzkaller report available? Or the call trace landing > here? Sure, let me release the report.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index e44a3da12b96..1dacc072dcca 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -890,6 +890,7 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk) mptcp_sockopt_sync_locked(msk, ssk); mptcp_subflow_joined(msk, ssk); mptcp_stop_tout_timer(sk); + __mptcp_propagate_sndbuf(sk, ssk); return true; } @@ -1076,15 +1077,16 @@ static void mptcp_enter_memory_pressure(struct sock *sk) struct mptcp_sock *msk = mptcp_sk(sk); bool first = true; - sk_stream_moderate_sndbuf(sk); mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); if (first) tcp_enter_memory_pressure(ssk); sk_stream_moderate_sndbuf(ssk); + first = false; } + __mptcp_sync_sndbuf(sk); } /* ensure we get enough memory for the frag hdr, beyond some minimal amount of @@ -2458,6 +2460,7 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk, WRITE_ONCE(msk->first, NULL); out: + __mptcp_sync_sndbuf(sk); if (need_push) __mptcp_push_pending(sk, 0); @@ -3224,7 +3227,7 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk, * uses the correct data */ mptcp_copy_inaddrs(nsk, ssk); - mptcp_propagate_sndbuf(nsk, ssk); + __mptcp_propagate_sndbuf(nsk, ssk); mptcp_rcv_space_init(msk, ssk); bh_unlock_sock(nsk); @@ -3402,6 +3405,8 @@ static void mptcp_release_cb(struct sock *sk) __mptcp_set_connected(sk); if (__test_and_clear_bit(MPTCP_ERROR_REPORT, &msk->cb_flags)) __mptcp_error_report(sk); + if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags)) + __mptcp_sync_sndbuf(sk); } __mptcp_update_rmem(sk); @@ -3446,6 +3451,14 @@ void mptcp_subflow_process_delegated(struct sock *ssk, long status) __set_bit(MPTCP_PUSH_PENDING, &mptcp_sk(sk)->cb_flags); mptcp_data_unlock(sk); } + if (status & BIT(MPTCP_DELEGATE_SNDBUF)) { + mptcp_data_lock(sk); + if (!sock_owned_by_user(sk)) + __mptcp_sync_sndbuf(sk); + else + __set_bit(MPTCP_SYNC_SNDBUF, &mptcp_sk(sk)->cb_flags); + mptcp_data_unlock(sk); + } if (status & BIT(MPTCP_DELEGATE_ACK)) schedule_3rdack_retransmission(ssk); } @@ -3530,6 +3543,7 @@ bool mptcp_finish_join(struct sock *ssk) /* active subflow, already present inside the conn_list */ if (!list_empty(&subflow->node)) { mptcp_subflow_joined(msk, ssk); + mptcp_propagate_sndbuf(parent, ssk); return true; } diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 620a82cd4c10..296d01965943 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -123,6 +123,7 @@ #define MPTCP_RETRANSMIT 4 #define MPTCP_FLUSH_JOIN_LIST 5 #define MPTCP_CONNECTED 6 +#define MPTCP_SYNC_SNDBUF 7 struct mptcp_skb_cb { u64 map_seq; @@ -443,6 +444,7 @@ DECLARE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions); #define MPTCP_DELEGATE_SCHEDULED 0 #define MPTCP_DELEGATE_SEND 1 #define MPTCP_DELEGATE_ACK 2 +#define MPTCP_DELEGATE_SNDBUF 3 #define MPTCP_DELEGATE_ACTIONS_MASK (~BIT(MPTCP_DELEGATE_SCHEDULED)) /* MPTCP subflow context */ @@ -516,6 +518,9 @@ struct mptcp_subflow_context { u32 setsockopt_seq; u32 stale_rcv_tstamp; + int cached_sndbuf; /* sndbuf size when last synced with the msk sndbuf, + * protected by the msk socket lock + */ struct sock *tcp_sock; /* tcp sk backpointer */ struct sock *conn; /* parent mptcp_sock */ @@ -778,13 +783,52 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk) READ_ONCE(msk->write_seq) == READ_ONCE(msk->snd_nxt); } -static inline bool mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk) +static inline void __mptcp_sync_sndbuf(struct sock *sk) { - if ((sk->sk_userlocks & SOCK_SNDBUF_LOCK) || ssk->sk_sndbuf <= READ_ONCE(sk->sk_sndbuf)) - return false; + struct mptcp_subflow_context *subflow; + int ssk_sndbuf, new_sndbuf; + + if (sk->sk_userlocks & SOCK_SNDBUF_LOCK) + return; + + new_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[0]; + mptcp_for_each_subflow(mptcp_sk(sk), subflow) { + ssk_sndbuf = READ_ONCE(mptcp_subflow_tcp_sock(subflow)->sk_sndbuf); + + subflow->cached_sndbuf = ssk_sndbuf; + new_sndbuf += ssk_sndbuf; + } + + /* the msk max wmem limit is <nr_subflows> * tcp wmem[2] */ + WRITE_ONCE(sk->sk_sndbuf, new_sndbuf); +} + +/* The called held both the msk socket and the subflow socket locks, + * possibly under BH + */ +static inline void __mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); + + if (READ_ONCE(ssk->sk_sndbuf) != subflow->cached_sndbuf) + __mptcp_sync_sndbuf(sk); +} + +/* the caller held only the subflow socket lock, either in process or + * BH context. Additionally this can be called under the msk data lock, + * so we can't acquire such lock here: let the delegate action acquires + * the needed locks in suitable order. + */ +static inline void mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); + + if (likely(READ_ONCE(ssk->sk_sndbuf) == subflow->cached_sndbuf)) + return; - WRITE_ONCE(sk->sk_sndbuf, ssk->sk_sndbuf); - return true; + local_bh_disable(); + mptcp_subflow_delegate(subflow, MPTCP_DELEGATE_SNDBUF); + local_bh_enable(); } static inline void mptcp_write_space(struct sock *sk) diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c index 72858d7d8974..574e221bb765 100644 --- a/net/mptcp/sockopt.c +++ b/net/mptcp/sockopt.c @@ -95,6 +95,7 @@ static void mptcp_sol_socket_sync_intval(struct mptcp_sock *msk, int optname, in case SO_SNDBUFFORCE: ssk->sk_userlocks |= SOCK_SNDBUF_LOCK; WRITE_ONCE(ssk->sk_sndbuf, sk->sk_sndbuf); + mptcp_subflow_ctx(ssk)->cached_sndbuf = sk->sk_sndbuf; break; case SO_RCVBUF: case SO_RCVBUFFORCE: @@ -1415,8 +1416,10 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk) if (sk->sk_userlocks & tx_rx_locks) { ssk->sk_userlocks |= sk->sk_userlocks & tx_rx_locks; - if (sk->sk_userlocks & SOCK_SNDBUF_LOCK) + if (sk->sk_userlocks & SOCK_SNDBUF_LOCK) { WRITE_ONCE(ssk->sk_sndbuf, sk->sk_sndbuf); + mptcp_subflow_ctx(ssk)->cached_sndbuf = sk->sk_sndbuf; + } if (sk->sk_userlocks & SOCK_RCVBUF_LOCK) WRITE_ONCE(ssk->sk_rcvbuf, sk->sk_rcvbuf); } diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index df208666fd19..2b43577f952e 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc void __mptcp_set_connected(struct sock *sk) { + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first); if (sk->sk_state == TCP_SYN_SENT) { inet_sk_state_store(sk, TCP_ESTABLISHED); sk->sk_state_change(sk); @@ -472,7 +473,6 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb) return; msk = mptcp_sk(parent); - mptcp_propagate_sndbuf(parent, sk); subflow->rel_write_seq = 1; subflow->conn_finished = 1; subflow->ssn_offset = TCP_SKB_CB(skb)->seq; @@ -1736,7 +1736,6 @@ static void subflow_state_change(struct sock *sk) msk = mptcp_sk(parent); if (subflow_simultaneous_connect(sk)) { - mptcp_propagate_sndbuf(parent, sk); mptcp_do_fallback(sk); mptcp_rcv_space_init(msk, sk); pr_fallback(msk);