Message ID | 20231016125934.1970789-1-vschneid@redhat.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] tcp/dcpp: Un-pin tw_timer | expand |
On Mon, Oct 16, 2023 at 3:00 PM Valentin Schneider <vschneid@redhat.com> wrote: > > The TCP timewait timer is proving to be problematic for setups where scheduler > CPU isolation is achieved at runtime via cpusets (as opposed to statically via > isolcpus=domains). > > What happens there is a CPU goes through tcp_time_wait(), arming the time_wait > timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing > interference for the now-isolated CPU. This is conceptually similar to the issue > described in > e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") > > Making the timer un-pinned would resolve this, as it would be queued onto > HK_FLAG_TIMER CPUs. It would Unfortunately go against > ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") > as we'd need to arm the timer after the *hashdance() to not have it fire before > we've finished setting up the timewait_socket. > > However, looking into this, I cannot grok what race is fixed by having the timer > *armed* before the hashdance. That was because : 1) the timer could expire before we had a chance to set refcnt to a non zero value. I guess this is fine if we use an extra atomic decrement. OR 2) another cpu could find the TW and delete it (trying to cancel the tw_timer) before we could arm the timer. ( inet_twsk_deschedule_put() is using del_timer_sync() followed by inet_twsk_kill()) Thus the tw timer would be armed for 60 seconds, then we would have to wait for the timer to really get rid of the tw structure. I think you also need to change inet_twsk_deschedule_put() logic ? > > [this next segment is brought to you by Cunningham's Law] > I guess this is not really relevant to the potential issue. > Using [1] as an example, inet_twsk_schedule() only arms the timer and increments > the deathrow refcount, which by itself does not affect > __inet_lookup_established(). AFAICT it only comes in handy if: > 1) A CPU ends up livelocking in __inet_lookup_established() (cf. [1], though per > inet_twsk_alloc() I don't see how a timewait socket can hit the > forever-looping conditions with how the sk_hash and addr/port pairs are copied) > 2) the initialization context can be interrupted by NET_RX (it can, cf. > cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer")) > > In this scenario, we need the timer to fire to go through > inet_twsk_kill() > sk_nulls_del_node_init_rcu() > and break out of the loop. > > Keep softirqs disabled, but make the timer un-pinned and arm it after the > hashdance. Remote CPUs may start using the timewait socket before the timer is > armed, but their execution of __inet_lookup_established() won't prevent the > arming of the timer. OK, I guess we can live with the following race : CPU0 allocates a tw, insert it in hash table CPU1: finds the TW and removes it (timer cancel does nothing) CPU0 arms a TW timer, lasting > > This partially reverts > ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") > and > ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") > > Link: [1] https://lore.kernel.org/lkml/56941035.9040000@fastly.com/ > Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/ > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > net/dccp/minisocks.c | 18 ++++++++++-------- > net/ipv4/inet_timewait_sock.c | 9 ++++----- > net/ipv4/tcp_minisocks.c | 18 ++++++++++-------- > 3 files changed, 24 insertions(+), 21 deletions(-) > > diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c > index 64d805b27adde..188a29a1aef49 100644 > --- a/net/dccp/minisocks.c > +++ b/net/dccp/minisocks.c > @@ -53,16 +53,18 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) > if (state == DCCP_TIME_WAIT) > timeo = DCCP_TIMEWAIT_LEN; > > - /* tw_timer is pinned, so we need to make sure BH are disabled > - * in following section, otherwise timer handler could run before > - * we complete the initialization. > - */ > - local_bh_disable(); > - inet_twsk_schedule(tw, timeo); > - /* Linkage updates. > - * Note that access to tw after this point is illegal. > + /* tw_timer is armed after the hashdance and recount update, so > + * we need to make sure BH are disabled in following section to > + * ensure the timer is armed before we handle any further skb's. > */ > + local_bh_disable(); > + > + // Linkage updates > inet_twsk_hashdance(tw, sk, &dccp_hashinfo); > + inet_twsk_schedule(tw, timeo); > + // Access to tw after this point is illegal. > + inet_twsk_put(tw); > + > local_bh_enable(); > } else { > /* Sorry, if we're out of memory, just CLOSE this > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > index dd37a5bf68811..ba59c2c6ef4a2 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -152,16 +152,15 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, > > spin_unlock(lock); > > - /* tw_refcnt is set to 3 because we have : > + /* tw_refcnt is set to 4 because we have : > * - one reference for bhash chain. > * - one reference for ehash chain. > * - one reference for timer. > + * - One reference for ourself (our caller will release it). > * We can use atomic_set() because prior spin_lock()/spin_unlock() > * committed into memory all tw fields. > - * Also note that after this point, we lost our implicit reference > - * so we are not allowed to use tw anymore. > */ > - refcount_set(&tw->tw_refcnt, 3); > + refcount_set(&tw->tw_refcnt, 4); > } > EXPORT_SYMBOL_GPL(inet_twsk_hashdance); > > @@ -207,7 +206,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, > tw->tw_prot = sk->sk_prot_creator; > atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie)); > twsk_net_set(tw, sock_net(sk)); > - timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED); > + timer_setup(&tw->tw_timer, tw_timer_handler, 0); > /* > * Because we use RCU lookups, we should not set tw_refcnt > * to a non null value before everything is setup for this > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index b98d476f1594b..269d4aa14a49e 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -324,16 +324,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) > if (state == TCP_TIME_WAIT) > timeo = TCP_TIMEWAIT_LEN; > > - /* tw_timer is pinned, so we need to make sure BH are disabled > - * in following section, otherwise timer handler could run before > - * we complete the initialization. > - */ > - local_bh_disable(); > - inet_twsk_schedule(tw, timeo); > - /* Linkage updates. > - * Note that access to tw after this point is illegal. > + /* tw_timer is armed after the hashdance and recount update, so > + * we need to make sure BH are disabled in following section to > + * ensure the timer is armed before we handle any further skb's. > */ > + local_bh_disable(); > + > + // Linkage updates. > inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo); > + inet_twsk_schedule(tw, timeo); > + // Access to tw after this point is illegal. > + inet_twsk_put(tw); > + > local_bh_enable(); > } else { > /* Sorry, if we're out of memory, just CLOSE this > -- > 2.39.3 >
On 16/10/23 17:40, Eric Dumazet wrote: > On Mon, Oct 16, 2023 at 3:00 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> The TCP timewait timer is proving to be problematic for setups where scheduler >> CPU isolation is achieved at runtime via cpusets (as opposed to statically via >> isolcpus=domains). >> >> What happens there is a CPU goes through tcp_time_wait(), arming the time_wait >> timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing >> interference for the now-isolated CPU. This is conceptually similar to the issue >> described in >> e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") >> >> Making the timer un-pinned would resolve this, as it would be queued onto >> HK_FLAG_TIMER CPUs. It would Unfortunately go against >> ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") >> as we'd need to arm the timer after the *hashdance() to not have it fire before >> we've finished setting up the timewait_socket. >> >> However, looking into this, I cannot grok what race is fixed by having the timer >> *armed* before the hashdance. > > That was because : > > 1) the timer could expire before we had a chance to set refcnt to > a non zero value. I guess this is fine if we use an extra atomic decrement. > > OR > > 2) another cpu could find the TW and delete it (trying to cancel the > tw_timer) before > we could arm the timer. ( inet_twsk_deschedule_put() is using > del_timer_sync() followed by inet_twsk_kill()) > > Thus the tw timer would be armed for 60 seconds, then we would have to > wait for the timer to really > get rid of the tw structure. > > I think you also need to change inet_twsk_deschedule_put() logic ? > Gotcha, thank you for pointing it out. >> Keep softirqs disabled, but make the timer un-pinned and arm it after the >> hashdance. Remote CPUs may start using the timewait socket before the timer is >> armed, but their execution of __inet_lookup_established() won't prevent the >> arming of the timer. > > OK, I guess we can live with the following race : > > CPU0 > > allocates a tw, insert it in hash table > > CPU1: finds the TW and removes it (timer > cancel does nothing) > > CPU0 > arms a TW timer, lasting > Looks reasonable to me, I'll go write v2. Thanks for the help!
On Wed, Oct 18, 2023 at 4:57 PM Valentin Schneider <vschneid@redhat.com> wrote: > > Looks reasonable to me, I'll go write v2. > > Thanks for the help! Sure thing ! BTW, we also use TIMER_PINNED for req->rsk_timer, are you working on it too ?
On 18/10/23 17:00, Eric Dumazet wrote: > On Wed, Oct 18, 2023 at 4:57 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> >> Looks reasonable to me, I'll go write v2. >> >> Thanks for the help! > > Sure thing ! > > BTW, we also use TIMER_PINNED for req->rsk_timer, are you working on it too ? Ah, no, that wasn't on my radar. This hasn't shown up on our systems yet. From a cursory look it does look like it could lead to similar issues, I'll add that to my todolist. Thanks!
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c index 64d805b27adde..188a29a1aef49 100644 --- a/net/dccp/minisocks.c +++ b/net/dccp/minisocks.c @@ -53,16 +53,18 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) if (state == DCCP_TIME_WAIT) timeo = DCCP_TIMEWAIT_LEN; - /* tw_timer is pinned, so we need to make sure BH are disabled - * in following section, otherwise timer handler could run before - * we complete the initialization. - */ - local_bh_disable(); - inet_twsk_schedule(tw, timeo); - /* Linkage updates. - * Note that access to tw after this point is illegal. + /* tw_timer is armed after the hashdance and recount update, so + * we need to make sure BH are disabled in following section to + * ensure the timer is armed before we handle any further skb's. */ + local_bh_disable(); + + // Linkage updates inet_twsk_hashdance(tw, sk, &dccp_hashinfo); + inet_twsk_schedule(tw, timeo); + // Access to tw after this point is illegal. + inet_twsk_put(tw); + local_bh_enable(); } else { /* Sorry, if we're out of memory, just CLOSE this diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index dd37a5bf68811..ba59c2c6ef4a2 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -152,16 +152,15 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, spin_unlock(lock); - /* tw_refcnt is set to 3 because we have : + /* tw_refcnt is set to 4 because we have : * - one reference for bhash chain. * - one reference for ehash chain. * - one reference for timer. + * - One reference for ourself (our caller will release it). * We can use atomic_set() because prior spin_lock()/spin_unlock() * committed into memory all tw fields. - * Also note that after this point, we lost our implicit reference - * so we are not allowed to use tw anymore. */ - refcount_set(&tw->tw_refcnt, 3); + refcount_set(&tw->tw_refcnt, 4); } EXPORT_SYMBOL_GPL(inet_twsk_hashdance); @@ -207,7 +206,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, tw->tw_prot = sk->sk_prot_creator; atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie)); twsk_net_set(tw, sock_net(sk)); - timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED); + timer_setup(&tw->tw_timer, tw_timer_handler, 0); /* * Because we use RCU lookups, we should not set tw_refcnt * to a non null value before everything is setup for this diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index b98d476f1594b..269d4aa14a49e 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -324,16 +324,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) if (state == TCP_TIME_WAIT) timeo = TCP_TIMEWAIT_LEN; - /* tw_timer is pinned, so we need to make sure BH are disabled - * in following section, otherwise timer handler could run before - * we complete the initialization. - */ - local_bh_disable(); - inet_twsk_schedule(tw, timeo); - /* Linkage updates. - * Note that access to tw after this point is illegal. + /* tw_timer is armed after the hashdance and recount update, so + * we need to make sure BH are disabled in following section to + * ensure the timer is armed before we handle any further skb's. */ + local_bh_disable(); + + // Linkage updates. inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo); + inet_twsk_schedule(tw, timeo); + // Access to tw after this point is illegal. + inet_twsk_put(tw); + local_bh_enable(); } else { /* Sorry, if we're out of memory, just CLOSE this
The TCP timewait timer is proving to be problematic for setups where scheduler CPU isolation is achieved at runtime via cpusets (as opposed to statically via isolcpus=domains). What happens there is a CPU goes through tcp_time_wait(), arming the time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing interference for the now-isolated CPU. This is conceptually similar to the issue described in e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") Making the timer un-pinned would resolve this, as it would be queued onto HK_FLAG_TIMER CPUs. It would Unfortunately go against ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") as we'd need to arm the timer after the *hashdance() to not have it fire before we've finished setting up the timewait_socket. However, looking into this, I cannot grok what race is fixed by having the timer *armed* before the hashdance. [this next segment is brought to you by Cunningham's Law] Using [1] as an example, inet_twsk_schedule() only arms the timer and increments the deathrow refcount, which by itself does not affect __inet_lookup_established(). AFAICT it only comes in handy if: 1) A CPU ends up livelocking in __inet_lookup_established() (cf. [1], though per inet_twsk_alloc() I don't see how a timewait socket can hit the forever-looping conditions with how the sk_hash and addr/port pairs are copied) 2) the initialization context can be interrupted by NET_RX (it can, cf. cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer")) In this scenario, we need the timer to fire to go through inet_twsk_kill() sk_nulls_del_node_init_rcu() and break out of the loop. Keep softirqs disabled, but make the timer un-pinned and arm it after the hashdance. Remote CPUs may start using the timewait socket before the timer is armed, but their execution of __inet_lookup_established() won't prevent the arming of the timer. This partially reverts ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") and ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") Link: [1] https://lore.kernel.org/lkml/56941035.9040000@fastly.com/ Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/ Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- net/dccp/minisocks.c | 18 ++++++++++-------- net/ipv4/inet_timewait_sock.c | 9 ++++----- net/ipv4/tcp_minisocks.c | 18 ++++++++++-------- 3 files changed, 24 insertions(+), 21 deletions(-)