diff mbox series

[RFC] mptcp: react scheduler when subflow events pop up

Message ID 433320c3a9db77bea53a34fc9c43a3c7e3320399.1709693691.git.tanggeliang@kylinos.cn (mailing list archive)
State Changes Requested
Delegated to: Paolo Abeni
Headers show
Series [RFC] mptcp: react scheduler when subflow events pop up | expand

Checks

Context Check Description
matttbe/build success Build and static analysis OK
matttbe/checkpatch success total: 0 errors, 0 warnings, 0 checks, 76 lines checked
matttbe/shellcheck success MPTCP selftests files have not been modified
matttbe/KVM_Validation__normal success Success! ✅
matttbe/KVM_Validation__debug__except_selftest_mptcp_join_ success Success! ✅
matttbe/KVM_Validation__debug__only_selftest_mptcp_join_ success Success! ✅

Commit Message

Geliang Tang March 6, 2024, 2:55 a.m. UTC
From: Geliang Tang <tanggeliang@kylinos.cn>

The current packet scheduler will not react by queuing more packets if
some subflows only events are emitted, e.g. new TCP ACKs are received
only acking things at TCP-level but not at MPTCP level. The scheduler
should be called when such events happen.

ack_update_msk() is invoked when an ACK is received, so it's the right
place to call the scheduler by invoking __mptcp_check_push().

mptcp_subflow_timeout() is implemented to call the scheduler when a
subflow timeout happens. But I'm not sure where is the right place to
invoke it. I mean where is the right place when a RTO is fired. So I
temporarily do it in mptcp_worker().

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/343
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
 include/net/tcp.h    |  3 +++
 net/ipv4/tcp_timer.c |  6 +++---
 net/mptcp/options.c  |  1 +
 net/mptcp/protocol.c | 23 ++++++++++++++++++++++-
 net/mptcp/protocol.h |  1 +
 5 files changed, 30 insertions(+), 4 deletions(-)

Comments

MPTCP CI March 6, 2024, 3:49 a.m. UTC | #1
Hi Geliang,

Thank you for your modifications, that's great!

Our CI (GitHub Action) did some validations and here is its report:

- KVM Validation: normal:
  - Success! ✅:
  - Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/8166369924

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/218fdba4f2b5


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
MPTCP CI March 6, 2024, 4:06 a.m. UTC | #2
Hi Geliang,

Thank you for your modifications, that's great!

Our CI (Cirrus) did some validations with a debug kernel and here is its report:

- KVM Validation: debug (except selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/6251159791337472
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6251159791337472/summary/summary.txt

- KVM Validation: debug (only selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/4843784907784192
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/4843784907784192/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/218fdba4f2b5


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
Paolo Abeni March 11, 2024, 1:10 p.m. UTC | #3
On Wed, 2024-03-06 at 10:55 +0800, Geliang Tang wrote:
> From: Geliang Tang <tanggeliang@kylinos.cn>
> 
> The current packet scheduler will not react by queuing more packets if
> some subflows only events are emitted, e.g. new TCP ACKs are received
> only acking things at TCP-level but not at MPTCP level. The scheduler
> should be called when such events happen.
> 
> ack_update_msk() is invoked when an ACK is received, so it's the right
> place to call the scheduler by invoking __mptcp_check_push().
> 
> mptcp_subflow_timeout() is implemented to call the scheduler when a
> subflow timeout happens. But I'm not sure where is the right place to
> invoke it. I mean where is the right place when a RTO is fired. So I
> temporarily do it in mptcp_worker().
> 
> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/343
> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
> ---
>  include/net/tcp.h    |  3 +++
>  net/ipv4/tcp_timer.c |  6 +++---
>  net/mptcp/options.c  |  1 +
>  net/mptcp/protocol.c | 23 ++++++++++++++++++++++-
>  net/mptcp/protocol.h |  1 +
>  5 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 6ae35199d3b3..431a23a8ebfa 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -347,6 +347,9 @@ void tcp_release_cb(struct sock *sk);
>  void tcp_wfree(struct sk_buff *skb);
>  void tcp_write_timer_handler(struct sock *sk);
>  void tcp_delack_timer_handler(struct sock *sk);
> +bool retransmits_timed_out(struct sock *sk,
> +			   unsigned int boundary,
> +			   unsigned int timeout);
>  int tcp_ioctl(struct sock *sk, int cmd, int *karg);
>  enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
>  void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index d1ad20ce1c8c..cfedafc4ca36 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -212,9 +212,9 @@ static unsigned int tcp_model_timeout(struct sock *sk,
>   * after "boundary" unsuccessful, exponentially backed-off
>   * retransmissions with an initial RTO of TCP_RTO_MIN.
>   */
> -static bool retransmits_timed_out(struct sock *sk,
> -				  unsigned int boundary,
> -				  unsigned int timeout)
> +bool retransmits_timed_out(struct sock *sk,
> +			   unsigned int boundary,
> +			   unsigned int timeout)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	unsigned int start_ts, delta;

If we access the function from outside the tcp code, we additionally
need to rename it as 'tcp_retransmits_timed_out' or the like.

> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
> index 5926955625cf..4b091d3bce00 100644
> --- a/net/mptcp/options.c
> +++ b/net/mptcp/options.c
> @@ -1069,6 +1069,7 @@ static void ack_update_msk(struct mptcp_sock *msk,
>  		__mptcp_snd_una_update(msk, new_snd_una);
>  		__mptcp_data_acked(sk);
>  	}
> +	__mptcp_check_push(sk, ssk);

It we call __mptcp_check_push() unconditionally here, we must remove
the previous conditional call, a few lines above.

Still I think we should not call it unconditionally: only if msk ack
seq or the subflow ack seq changed.

Additionally we need some performance figures the ensure this is not
impacting [too much] the bulk transfer performances

Somewhat related. I *think* there is a possible issue WRT the current
(prior to this patch) packet scheduler strategy:

When the msk ack sequence increases, we unconditionally push data on
the subflow receiving such update:

https://elixir.bootlin.com/linux/latest/source/net/mptcp/protocol.c#L1643

That choice could be sup-optimal. I *think* we should always call the
scheduler, and I'm not sure (I can't recall) the why of the current
implementation.

This could also explain some simult-subflow self-tests failures

>  	mptcp_data_unlock(sk);
>  
>  	trace_ack_update_msk(mp_opt->data_ack,
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index cdf9ec67795e..c935f6a7adb0 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -486,6 +486,26 @@ static long mptcp_timeout_from_subflow(const struct mptcp_subflow_context *subfl
>  	       inet_csk(ssk)->icsk_timeout - jiffies : 0;
>  }
>  
> +static void __mptcp_subflow_timeout(struct sock *sk, struct sock *ssk, long tout)
> +{
> +	unsigned int boundary = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_retries1) + 1;
> +
> +	if (retransmits_timed_out(ssk, boundary, tout))
> +		__mptcp_check_push(sk, ssk);

Both retransmits_timed_out() and __mptcp_check_push() must be called
under the ssk socket lock.

I think we should look for a better solution then acquiring such lock
unconditionally for each subflow.

More importantly why this hook is needed? the subflow did make any
progress, we can't enqueue more data there.

A possible alternative would be use this hook to mark the subflow as
stale, but... do we have any issues with the current schema to detect
stale subflows?

Thanks!

Paolo
Geliang Tang March 12, 2024, 6 a.m. UTC | #4
Hi Paolo,

Thanks for your review.

On Mon, Mar 11, 2024 at 02:10:25PM +0100, Paolo Abeni wrote:
> On Wed, 2024-03-06 at 10:55 +0800, Geliang Tang wrote:
> > From: Geliang Tang <tanggeliang@kylinos.cn>
> > 
> > The current packet scheduler will not react by queuing more packets if
> > some subflows only events are emitted, e.g. new TCP ACKs are received
> > only acking things at TCP-level but not at MPTCP level. The scheduler
> > should be called when such events happen.
> > 
> > ack_update_msk() is invoked when an ACK is received, so it's the right
> > place to call the scheduler by invoking __mptcp_check_push().
> > 
> > mptcp_subflow_timeout() is implemented to call the scheduler when a
> > subflow timeout happens. But I'm not sure where is the right place to
> > invoke it. I mean where is the right place when a RTO is fired. So I
> > temporarily do it in mptcp_worker().
> > 
> > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/343
> > Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
> > ---
> >  include/net/tcp.h    |  3 +++
> >  net/ipv4/tcp_timer.c |  6 +++---
> >  net/mptcp/options.c  |  1 +
> >  net/mptcp/protocol.c | 23 ++++++++++++++++++++++-
> >  net/mptcp/protocol.h |  1 +
> >  5 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 6ae35199d3b3..431a23a8ebfa 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -347,6 +347,9 @@ void tcp_release_cb(struct sock *sk);
> >  void tcp_wfree(struct sk_buff *skb);
> >  void tcp_write_timer_handler(struct sock *sk);
> >  void tcp_delack_timer_handler(struct sock *sk);
> > +bool retransmits_timed_out(struct sock *sk,
> > +			   unsigned int boundary,
> > +			   unsigned int timeout);
> >  int tcp_ioctl(struct sock *sk, int cmd, int *karg);
> >  enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
> >  void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index d1ad20ce1c8c..cfedafc4ca36 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -212,9 +212,9 @@ static unsigned int tcp_model_timeout(struct sock *sk,
> >   * after "boundary" unsuccessful, exponentially backed-off
> >   * retransmissions with an initial RTO of TCP_RTO_MIN.
> >   */
> > -static bool retransmits_timed_out(struct sock *sk,
> > -				  unsigned int boundary,
> > -				  unsigned int timeout)
> > +bool retransmits_timed_out(struct sock *sk,
> > +			   unsigned int boundary,
> > +			   unsigned int timeout)
> >  {
> >  	struct tcp_sock *tp = tcp_sk(sk);
> >  	unsigned int start_ts, delta;
> 
> If we access the function from outside the tcp code, we additionally
> need to rename it as 'tcp_retransmits_timed_out' or the like.
> 
> > diff --git a/net/mptcp/options.c b/net/mptcp/options.c
> > index 5926955625cf..4b091d3bce00 100644
> > --- a/net/mptcp/options.c
> > +++ b/net/mptcp/options.c
> > @@ -1069,6 +1069,7 @@ static void ack_update_msk(struct mptcp_sock *msk,
> >  		__mptcp_snd_una_update(msk, new_snd_una);
> >  		__mptcp_data_acked(sk);
> >  	}
> > +	__mptcp_check_push(sk, ssk);
> 
> It we call __mptcp_check_push() unconditionally here, we must remove
> the previous conditional call, a few lines above.
> 
> Still I think we should not call it unconditionally: only if msk ack
> seq or the subflow ack seq changed.

I didn't notice that __mptcp_check_push() is just invoked a few lines
above. Yes, you're right, no need to add another one.

Here addresses first issue in #343, call the scheduler when TCP ACKs are
received.

It looks like the scheduler has already been called, (the previous
conditional call) when TCP ACKs are received. So we don't need to add
any more code, do we?

> 
> Additionally we need some performance figures the ensure this is not
> impacting [too much] the bulk transfer performances
> 
> Somewhat related. I *think* there is a possible issue WRT the current
> (prior to this patch) packet scheduler strategy:
> 
> When the msk ack sequence increases, we unconditionally push data on
> the subflow receiving such update:
> 
> https://elixir.bootlin.com/linux/latest/source/net/mptcp/protocol.c#L1643
> 
> That choice could be sup-optimal. I *think* we should always call the
> scheduler, and I'm not sure (I can't recall) the why of the current
> implementation.

The original code of this part is:

         /* the caller already invoked the packet scheduler,
          * check for a different subflow usage only after
          * spooling the first chunk of data
          */
         xmit_ssk = first ? ssk : mptcp_subflow_get_send(msk);
         if (!xmit_ssk)
                 goto out;
         if (xmit_ssk != ssk) {
                 mptcp_subflow_delegate(mptcp_subflow_ctx(xmit_ssk),
                                        MPTCP_DELEGATE_SEND);
                 goto out;
         }

I guess the reason is in the comment: "the caller already invoked the
packet scheduler".

> 
> This could also explain some simult-subflow self-tests failures
> 
> >  	mptcp_data_unlock(sk);
> >  
> >  	trace_ack_update_msk(mp_opt->data_ack,
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index cdf9ec67795e..c935f6a7adb0 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -486,6 +486,26 @@ static long mptcp_timeout_from_subflow(const struct mptcp_subflow_context *subfl
> >  	       inet_csk(ssk)->icsk_timeout - jiffies : 0;
> >  }
> >  
> > +static void __mptcp_subflow_timeout(struct sock *sk, struct sock *ssk, long tout)
> > +{
> > +	unsigned int boundary = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_retries1) + 1;
> > +
> > +	if (retransmits_timed_out(ssk, boundary, tout))
> > +		__mptcp_check_push(sk, ssk);
> 
> Both retransmits_timed_out() and __mptcp_check_push() must be called
> under the ssk socket lock.
> 
> I think we should look for a better solution then acquiring such lock
> unconditionally for each subflow.
> 
> More importantly why this hook is needed? the subflow did make any
> progress, we can't enqueue more data there.

Here addresses second issue in #343, call the scheduler when a subflow
timeout happens. As I described in the commit log:

"I'm not sure where is the right place when a RTO is fired. So I temporarily
do it in mptcp_worker()."

Thanks,
-Geliang

> 
> A possible alternative would be use this hook to mark the subflow as
> stale, but... do we have any issues with the current schema to detect
> stale subflows?
> 
> Thanks!
> 
> Paolo
>
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6ae35199d3b3..431a23a8ebfa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -347,6 +347,9 @@  void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
 void tcp_delack_timer_handler(struct sock *sk);
+bool retransmits_timed_out(struct sock *sk,
+			   unsigned int boundary,
+			   unsigned int timeout);
 int tcp_ioctl(struct sock *sk, int cmd, int *karg);
 enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
 void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index d1ad20ce1c8c..cfedafc4ca36 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -212,9 +212,9 @@  static unsigned int tcp_model_timeout(struct sock *sk,
  * after "boundary" unsuccessful, exponentially backed-off
  * retransmissions with an initial RTO of TCP_RTO_MIN.
  */
-static bool retransmits_timed_out(struct sock *sk,
-				  unsigned int boundary,
-				  unsigned int timeout)
+bool retransmits_timed_out(struct sock *sk,
+			   unsigned int boundary,
+			   unsigned int timeout)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int start_ts, delta;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 5926955625cf..4b091d3bce00 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1069,6 +1069,7 @@  static void ack_update_msk(struct mptcp_sock *msk,
 		__mptcp_snd_una_update(msk, new_snd_una);
 		__mptcp_data_acked(sk);
 	}
+	__mptcp_check_push(sk, ssk);
 	mptcp_data_unlock(sk);
 
 	trace_ack_update_msk(mp_opt->data_ack,
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index cdf9ec67795e..c935f6a7adb0 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -486,6 +486,26 @@  static long mptcp_timeout_from_subflow(const struct mptcp_subflow_context *subfl
 	       inet_csk(ssk)->icsk_timeout - jiffies : 0;
 }
 
+static void __mptcp_subflow_timeout(struct sock *sk, struct sock *ssk, long tout)
+{
+	unsigned int boundary = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_retries1) + 1;
+
+	if (retransmits_timed_out(ssk, boundary, tout))
+		__mptcp_check_push(sk, ssk);
+}
+
+static void mptcp_subflow_timeout(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow;
+
+	mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+		long tout = mptcp_timeout_from_subflow(subflow);
+
+		__mptcp_subflow_timeout(sk, ssk, tout);
+	}
+}
+
 void mptcp_set_timeout(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -1624,7 +1644,7 @@  void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 		mptcp_check_send_data_fin(sk);
 }
 
-static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first)
+void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_sendmsg_info info = {
@@ -2741,6 +2761,7 @@  static void mptcp_worker(struct work_struct *work)
 	mptcp_check_fastclose(msk);
 
 	mptcp_pm_nl_work(msk);
+	mptcp_subflow_timeout(sk);
 
 	mptcp_check_send_data_fin(sk);
 	mptcp_check_data_fin_ack(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 7905783c95e4..f4b75be49690 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -656,6 +656,7 @@  void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 void __mptcp_subflow_send_ack(struct sock *ssk);
 void mptcp_subflow_reset(struct sock *ssk);
 void mptcp_subflow_queue_clean(struct sock *sk, struct sock *ssk);
+void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first);
 void mptcp_sock_graft(struct sock *sk, struct socket *parent);
 u64 mptcp_wnd_end(const struct mptcp_sock *msk);
 void mptcp_set_timeout(struct sock *sk);