Message ID | 433320c3a9db77bea53a34fc9c43a3c7e3320399.1709693691.git.tanggeliang@kylinos.cn (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Paolo Abeni |
Headers | show |
Series | [RFC] mptcp: react scheduler when subflow events pop up | expand |
Context | Check | Description |
---|---|---|
matttbe/build | success | Build and static analysis OK |
matttbe/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 76 lines checked |
matttbe/shellcheck | success | MPTCP selftests files have not been modified |
matttbe/KVM_Validation__normal | success | Success! ✅ |
matttbe/KVM_Validation__debug__except_selftest_mptcp_join_ | success | Success! ✅ |
matttbe/KVM_Validation__debug__only_selftest_mptcp_join_ | success | Success! ✅ |
Hi Geliang, Thank you for your modifications, that's great! Our CI (GitHub Action) did some validations and here is its report: - KVM Validation: normal: - Success! ✅: - Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/8166369924 Initiator: Patchew Applier Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/218fdba4f2b5 If there are some issues, you can reproduce them using the same environment as the one used by the CI thanks to a docker image, e.g.: $ cd [kernel source code] $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \ --pull always mptcp/mptcp-upstream-virtme-docker:latest \ auto-normal For more details: https://github.com/multipath-tcp/mptcp-upstream-virtme-docker Please note that despite all the efforts that have been already done to have a stable tests suite when executed on a public CI like here, it is possible some reported issues are not due to your modifications. Still, do not hesitate to help us improve that ;-) Cheers, MPTCP GH Action bot Bot operated by Matthieu Baerts (NGI0 Core)
Hi Geliang, Thank you for your modifications, that's great! Our CI (Cirrus) did some validations with a debug kernel and here is its report: - KVM Validation: debug (except selftest_mptcp_join): - Success! ✅: - Task: https://cirrus-ci.com/task/6251159791337472 - Summary: https://api.cirrus-ci.com/v1/artifact/task/6251159791337472/summary/summary.txt - KVM Validation: debug (only selftest_mptcp_join): - Success! ✅: - Task: https://cirrus-ci.com/task/4843784907784192 - Summary: https://api.cirrus-ci.com/v1/artifact/task/4843784907784192/summary/summary.txt Initiator: Patchew Applier Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/218fdba4f2b5 If there are some issues, you can reproduce them using the same environment as the one used by the CI thanks to a docker image, e.g.: $ cd [kernel source code] $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \ --pull always mptcp/mptcp-upstream-virtme-docker:latest \ auto-debug For more details: https://github.com/multipath-tcp/mptcp-upstream-virtme-docker Please note that despite all the efforts that have been already done to have a stable tests suite when executed on a public CI like here, it is possible some reported issues are not due to your modifications. Still, do not hesitate to help us improve that ;-) Cheers, MPTCP GH Action bot Bot operated by Matthieu Baerts (NGI0 Core)
On Wed, 2024-03-06 at 10:55 +0800, Geliang Tang wrote: > From: Geliang Tang <tanggeliang@kylinos.cn> > > The current packet scheduler will not react by queuing more packets if > some subflows only events are emitted, e.g. new TCP ACKs are received > only acking things at TCP-level but not at MPTCP level. The scheduler > should be called when such events happen. > > ack_update_msk() is invoked when an ACK is received, so it's the right > place to call the scheduler by invoking __mptcp_check_push(). > > mptcp_subflow_timeout() is implemented to call the scheduler when a > subflow timeout happens. But I'm not sure where is the right place to > invoke it. I mean where is the right place when a RTO is fired. So I > temporarily do it in mptcp_worker(). > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/343 > Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> > --- > include/net/tcp.h | 3 +++ > net/ipv4/tcp_timer.c | 6 +++--- > net/mptcp/options.c | 1 + > net/mptcp/protocol.c | 23 ++++++++++++++++++++++- > net/mptcp/protocol.h | 1 + > 5 files changed, 30 insertions(+), 4 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 6ae35199d3b3..431a23a8ebfa 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -347,6 +347,9 @@ void tcp_release_cb(struct sock *sk); > void tcp_wfree(struct sk_buff *skb); > void tcp_write_timer_handler(struct sock *sk); > void tcp_delack_timer_handler(struct sock *sk); > +bool retransmits_timed_out(struct sock *sk, > + unsigned int boundary, > + unsigned int timeout); > int tcp_ioctl(struct sock *sk, int cmd, int *karg); > enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb); > void tcp_rcv_established(struct sock *sk, struct sk_buff *skb); > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > index d1ad20ce1c8c..cfedafc4ca36 100644 > --- a/net/ipv4/tcp_timer.c > +++ b/net/ipv4/tcp_timer.c > @@ -212,9 +212,9 @@ static unsigned int tcp_model_timeout(struct sock *sk, > * after "boundary" unsuccessful, exponentially backed-off > * retransmissions with an initial RTO of TCP_RTO_MIN. > */ > -static bool retransmits_timed_out(struct sock *sk, > - unsigned int boundary, > - unsigned int timeout) > +bool retransmits_timed_out(struct sock *sk, > + unsigned int boundary, > + unsigned int timeout) > { > struct tcp_sock *tp = tcp_sk(sk); > unsigned int start_ts, delta; If we access the function from outside the tcp code, we additionally need to rename it as 'tcp_retransmits_timed_out' or the like. > diff --git a/net/mptcp/options.c b/net/mptcp/options.c > index 5926955625cf..4b091d3bce00 100644 > --- a/net/mptcp/options.c > +++ b/net/mptcp/options.c > @@ -1069,6 +1069,7 @@ static void ack_update_msk(struct mptcp_sock *msk, > __mptcp_snd_una_update(msk, new_snd_una); > __mptcp_data_acked(sk); > } > + __mptcp_check_push(sk, ssk); It we call __mptcp_check_push() unconditionally here, we must remove the previous conditional call, a few lines above. Still I think we should not call it unconditionally: only if msk ack seq or the subflow ack seq changed. Additionally we need some performance figures the ensure this is not impacting [too much] the bulk transfer performances Somewhat related. I *think* there is a possible issue WRT the current (prior to this patch) packet scheduler strategy: When the msk ack sequence increases, we unconditionally push data on the subflow receiving such update: https://elixir.bootlin.com/linux/latest/source/net/mptcp/protocol.c#L1643 That choice could be sup-optimal. I *think* we should always call the scheduler, and I'm not sure (I can't recall) the why of the current implementation. This could also explain some simult-subflow self-tests failures > mptcp_data_unlock(sk); > > trace_ack_update_msk(mp_opt->data_ack, > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > index cdf9ec67795e..c935f6a7adb0 100644 > --- a/net/mptcp/protocol.c > +++ b/net/mptcp/protocol.c > @@ -486,6 +486,26 @@ static long mptcp_timeout_from_subflow(const struct mptcp_subflow_context *subfl > inet_csk(ssk)->icsk_timeout - jiffies : 0; > } > > +static void __mptcp_subflow_timeout(struct sock *sk, struct sock *ssk, long tout) > +{ > + unsigned int boundary = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_retries1) + 1; > + > + if (retransmits_timed_out(ssk, boundary, tout)) > + __mptcp_check_push(sk, ssk); Both retransmits_timed_out() and __mptcp_check_push() must be called under the ssk socket lock. I think we should look for a better solution then acquiring such lock unconditionally for each subflow. More importantly why this hook is needed? the subflow did make any progress, we can't enqueue more data there. A possible alternative would be use this hook to mark the subflow as stale, but... do we have any issues with the current schema to detect stale subflows? Thanks! Paolo
Hi Paolo, Thanks for your review. On Mon, Mar 11, 2024 at 02:10:25PM +0100, Paolo Abeni wrote: > On Wed, 2024-03-06 at 10:55 +0800, Geliang Tang wrote: > > From: Geliang Tang <tanggeliang@kylinos.cn> > > > > The current packet scheduler will not react by queuing more packets if > > some subflows only events are emitted, e.g. new TCP ACKs are received > > only acking things at TCP-level but not at MPTCP level. The scheduler > > should be called when such events happen. > > > > ack_update_msk() is invoked when an ACK is received, so it's the right > > place to call the scheduler by invoking __mptcp_check_push(). > > > > mptcp_subflow_timeout() is implemented to call the scheduler when a > > subflow timeout happens. But I'm not sure where is the right place to > > invoke it. I mean where is the right place when a RTO is fired. So I > > temporarily do it in mptcp_worker(). > > > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/343 > > Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> > > --- > > include/net/tcp.h | 3 +++ > > net/ipv4/tcp_timer.c | 6 +++--- > > net/mptcp/options.c | 1 + > > net/mptcp/protocol.c | 23 ++++++++++++++++++++++- > > net/mptcp/protocol.h | 1 + > > 5 files changed, 30 insertions(+), 4 deletions(-) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 6ae35199d3b3..431a23a8ebfa 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -347,6 +347,9 @@ void tcp_release_cb(struct sock *sk); > > void tcp_wfree(struct sk_buff *skb); > > void tcp_write_timer_handler(struct sock *sk); > > void tcp_delack_timer_handler(struct sock *sk); > > +bool retransmits_timed_out(struct sock *sk, > > + unsigned int boundary, > > + unsigned int timeout); > > int tcp_ioctl(struct sock *sk, int cmd, int *karg); > > enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb); > > void tcp_rcv_established(struct sock *sk, struct sk_buff *skb); > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > > index d1ad20ce1c8c..cfedafc4ca36 100644 > > --- a/net/ipv4/tcp_timer.c > > +++ b/net/ipv4/tcp_timer.c > > @@ -212,9 +212,9 @@ static unsigned int tcp_model_timeout(struct sock *sk, > > * after "boundary" unsuccessful, exponentially backed-off > > * retransmissions with an initial RTO of TCP_RTO_MIN. > > */ > > -static bool retransmits_timed_out(struct sock *sk, > > - unsigned int boundary, > > - unsigned int timeout) > > +bool retransmits_timed_out(struct sock *sk, > > + unsigned int boundary, > > + unsigned int timeout) > > { > > struct tcp_sock *tp = tcp_sk(sk); > > unsigned int start_ts, delta; > > If we access the function from outside the tcp code, we additionally > need to rename it as 'tcp_retransmits_timed_out' or the like. > > > diff --git a/net/mptcp/options.c b/net/mptcp/options.c > > index 5926955625cf..4b091d3bce00 100644 > > --- a/net/mptcp/options.c > > +++ b/net/mptcp/options.c > > @@ -1069,6 +1069,7 @@ static void ack_update_msk(struct mptcp_sock *msk, > > __mptcp_snd_una_update(msk, new_snd_una); > > __mptcp_data_acked(sk); > > } > > + __mptcp_check_push(sk, ssk); > > It we call __mptcp_check_push() unconditionally here, we must remove > the previous conditional call, a few lines above. > > Still I think we should not call it unconditionally: only if msk ack > seq or the subflow ack seq changed. I didn't notice that __mptcp_check_push() is just invoked a few lines above. Yes, you're right, no need to add another one. Here addresses first issue in #343, call the scheduler when TCP ACKs are received. It looks like the scheduler has already been called, (the previous conditional call) when TCP ACKs are received. So we don't need to add any more code, do we? > > Additionally we need some performance figures the ensure this is not > impacting [too much] the bulk transfer performances > > Somewhat related. I *think* there is a possible issue WRT the current > (prior to this patch) packet scheduler strategy: > > When the msk ack sequence increases, we unconditionally push data on > the subflow receiving such update: > > https://elixir.bootlin.com/linux/latest/source/net/mptcp/protocol.c#L1643 > > That choice could be sup-optimal. I *think* we should always call the > scheduler, and I'm not sure (I can't recall) the why of the current > implementation. The original code of this part is: /* the caller already invoked the packet scheduler, * check for a different subflow usage only after * spooling the first chunk of data */ xmit_ssk = first ? ssk : mptcp_subflow_get_send(msk); if (!xmit_ssk) goto out; if (xmit_ssk != ssk) { mptcp_subflow_delegate(mptcp_subflow_ctx(xmit_ssk), MPTCP_DELEGATE_SEND); goto out; } I guess the reason is in the comment: "the caller already invoked the packet scheduler". > > This could also explain some simult-subflow self-tests failures > > > mptcp_data_unlock(sk); > > > > trace_ack_update_msk(mp_opt->data_ack, > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > > index cdf9ec67795e..c935f6a7adb0 100644 > > --- a/net/mptcp/protocol.c > > +++ b/net/mptcp/protocol.c > > @@ -486,6 +486,26 @@ static long mptcp_timeout_from_subflow(const struct mptcp_subflow_context *subfl > > inet_csk(ssk)->icsk_timeout - jiffies : 0; > > } > > > > +static void __mptcp_subflow_timeout(struct sock *sk, struct sock *ssk, long tout) > > +{ > > + unsigned int boundary = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_retries1) + 1; > > + > > + if (retransmits_timed_out(ssk, boundary, tout)) > > + __mptcp_check_push(sk, ssk); > > Both retransmits_timed_out() and __mptcp_check_push() must be called > under the ssk socket lock. > > I think we should look for a better solution then acquiring such lock > unconditionally for each subflow. > > More importantly why this hook is needed? the subflow did make any > progress, we can't enqueue more data there. Here addresses second issue in #343, call the scheduler when a subflow timeout happens. As I described in the commit log: "I'm not sure where is the right place when a RTO is fired. So I temporarily do it in mptcp_worker()." Thanks, -Geliang > > A possible alternative would be use this hook to mark the subflow as > stale, but... do we have any issues with the current schema to detect > stale subflows? > > Thanks! > > Paolo >
diff --git a/include/net/tcp.h b/include/net/tcp.h index 6ae35199d3b3..431a23a8ebfa 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -347,6 +347,9 @@ void tcp_release_cb(struct sock *sk); void tcp_wfree(struct sk_buff *skb); void tcp_write_timer_handler(struct sock *sk); void tcp_delack_timer_handler(struct sock *sk); +bool retransmits_timed_out(struct sock *sk, + unsigned int boundary, + unsigned int timeout); int tcp_ioctl(struct sock *sk, int cmd, int *karg); enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb); void tcp_rcv_established(struct sock *sk, struct sk_buff *skb); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index d1ad20ce1c8c..cfedafc4ca36 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -212,9 +212,9 @@ static unsigned int tcp_model_timeout(struct sock *sk, * after "boundary" unsuccessful, exponentially backed-off * retransmissions with an initial RTO of TCP_RTO_MIN. */ -static bool retransmits_timed_out(struct sock *sk, - unsigned int boundary, - unsigned int timeout) +bool retransmits_timed_out(struct sock *sk, + unsigned int boundary, + unsigned int timeout) { struct tcp_sock *tp = tcp_sk(sk); unsigned int start_ts, delta; diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 5926955625cf..4b091d3bce00 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -1069,6 +1069,7 @@ static void ack_update_msk(struct mptcp_sock *msk, __mptcp_snd_una_update(msk, new_snd_una); __mptcp_data_acked(sk); } + __mptcp_check_push(sk, ssk); mptcp_data_unlock(sk); trace_ack_update_msk(mp_opt->data_ack, diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index cdf9ec67795e..c935f6a7adb0 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -486,6 +486,26 @@ static long mptcp_timeout_from_subflow(const struct mptcp_subflow_context *subfl inet_csk(ssk)->icsk_timeout - jiffies : 0; } +static void __mptcp_subflow_timeout(struct sock *sk, struct sock *ssk, long tout) +{ + unsigned int boundary = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_retries1) + 1; + + if (retransmits_timed_out(ssk, boundary, tout)) + __mptcp_check_push(sk, ssk); +} + +static void mptcp_subflow_timeout(struct sock *sk) +{ + struct mptcp_subflow_context *subflow; + + mptcp_for_each_subflow(mptcp_sk(sk), subflow) { + struct sock *ssk = mptcp_subflow_tcp_sock(subflow); + long tout = mptcp_timeout_from_subflow(subflow); + + __mptcp_subflow_timeout(sk, ssk, tout); + } +} + void mptcp_set_timeout(struct sock *sk) { struct mptcp_subflow_context *subflow; @@ -1624,7 +1644,7 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags) mptcp_check_send_data_fin(sk); } -static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first) +void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first) { struct mptcp_sock *msk = mptcp_sk(sk); struct mptcp_sendmsg_info info = { @@ -2741,6 +2761,7 @@ static void mptcp_worker(struct work_struct *work) mptcp_check_fastclose(msk); mptcp_pm_nl_work(msk); + mptcp_subflow_timeout(sk); mptcp_check_send_data_fin(sk); mptcp_check_data_fin_ack(sk); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 7905783c95e4..f4b75be49690 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -656,6 +656,7 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk, void __mptcp_subflow_send_ack(struct sock *ssk); void mptcp_subflow_reset(struct sock *ssk); void mptcp_subflow_queue_clean(struct sock *sk, struct sock *ssk); +void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first); void mptcp_sock_graft(struct sock *sk, struct socket *parent); u64 mptcp_wnd_end(const struct mptcp_sock *msk); void mptcp_set_timeout(struct sock *sk);