Message ID | 20240524193630.2007563-3-edumazet@google.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: fix tcp_poll() races | expand |
From: Eric Dumazet > Sent: 24 May 2024 20:36 > > I noticed flakes in a packetdrill test, expecting an epoll_wait() > to return EPOLLERR | EPOLLHUP on a failed connect() attempt, > after multiple SYN retransmits. It sometimes return EPOLLERR only. > > The issue is that tcp_write_err(): > 1) writes an error in sk->sk_err, > 2) calls sk_error_report(), > 3) then calls tcp_done(). > > tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown, > among other things. > > Problem is that the awaken user thread (from 2) sk_error_report()) > might call tcp_poll() before tcp_done() has written sk->sk_shutdown. > > tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR. > > This patch fixes the issue by making sure to call sk_error_report() > after tcp_done(). Isn't there still the potential for a program to call poll() at 'just the wrong time' and still see an unexpected status? ... > WRITE_ONCE(sk->sk_err, READ_ONCE(sk->sk_err_soft) ? : ETIMEDOUT); > - sk_error_report(sk); > > - tcp_write_queue_purge(sk); > - tcp_done(sk); > + tcp_done_with_error(sk); Is there scope for moving the write to sk->sk_err inside the function? Looks like it'll need a larger change to tcp_reset(). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, 2024-05-28 at 09:19 +0000, David Laight wrote: > From: Eric Dumazet > > Sent: 24 May 2024 20:36 > > > > I noticed flakes in a packetdrill test, expecting an epoll_wait() > > to return EPOLLERR | EPOLLHUP on a failed connect() attempt, > > after multiple SYN retransmits. It sometimes return EPOLLERR only. > > > > The issue is that tcp_write_err(): > > 1) writes an error in sk->sk_err, > > 2) calls sk_error_report(), > > 3) then calls tcp_done(). > > > > tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown, > > among other things. > > > > Problem is that the awaken user thread (from 2) sk_error_report()) > > might call tcp_poll() before tcp_done() has written sk->sk_shutdown. > > > > tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR. > > > > This patch fixes the issue by making sure to call sk_error_report() > > after tcp_done(). > > Isn't there still the potential for a program to call poll() at > 'just the wrong time' and still see an unexpected status? > > ... > > WRITE_ONCE(sk->sk_err, READ_ONCE(sk->sk_err_soft) ? : ETIMEDOUT); > > - sk_error_report(sk); > > > > - tcp_write_queue_purge(sk); > > - tcp_done(sk); > > + tcp_done_with_error(sk); > > Is there scope for moving the write to sk->sk_err inside the function? Do you mean that the compiler or the CPU can reorder the WRITE_ONCE wrt tcp_done_with_error()? I think the function call prevents that. Cheers, Paolo
From: Paolo Abeni > Sent: 28 May 2024 11:36 > > On Tue, 2024-05-28 at 09:19 +0000, David Laight wrote: > > From: Eric Dumazet > > > Sent: 24 May 2024 20:36 > > > > > > I noticed flakes in a packetdrill test, expecting an epoll_wait() > > > to return EPOLLERR | EPOLLHUP on a failed connect() attempt, > > > after multiple SYN retransmits. It sometimes return EPOLLERR only. > > > > > > The issue is that tcp_write_err(): > > > 1) writes an error in sk->sk_err, > > > 2) calls sk_error_report(), > > > 3) then calls tcp_done(). > > > > > > tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown, > > > among other things. > > > > > > Problem is that the awaken user thread (from 2) sk_error_report()) > > > might call tcp_poll() before tcp_done() has written sk->sk_shutdown. > > > > > > tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR. > > > > > > This patch fixes the issue by making sure to call sk_error_report() > > > after tcp_done(). > > > > Isn't there still the potential for a program to call poll() at > > 'just the wrong time' and still see an unexpected status? > > > > ... > > > WRITE_ONCE(sk->sk_err, READ_ONCE(sk->sk_err_soft) ? : ETIMEDOUT); > > > - sk_error_report(sk); > > > > > > - tcp_write_queue_purge(sk); > > > - tcp_done(sk); > > > + tcp_done_with_error(sk); > > > > Is there scope for moving the write to sk->sk_err inside the function? > > Do you mean that the compiler or the CPU can reorder the WRITE_ONCE wrt > tcp_done_with_error()? I think the function call prevents that. No, just that the code would be easier to read with (say): tcp_done_with_error(sk, ETIMEDOUT); rather than requiring the caller do a WRITE_ONCE() prior to the call. This might also be needed in order to ensure that both POLLERR and POLLHUP always get reported. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 28, 2024 at 11:20 AM David Laight <David.Laight@aculab.com> wrote: > > From: Eric Dumazet > > Sent: 24 May 2024 20:36 > > > > I noticed flakes in a packetdrill test, expecting an epoll_wait() > > to return EPOLLERR | EPOLLHUP on a failed connect() attempt, > > after multiple SYN retransmits. It sometimes return EPOLLERR only. > > > > The issue is that tcp_write_err(): > > 1) writes an error in sk->sk_err, > > 2) calls sk_error_report(), > > 3) then calls tcp_done(). > > > > tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown, > > among other things. > > > > Problem is that the awaken user thread (from 2) sk_error_report()) > > might call tcp_poll() before tcp_done() has written sk->sk_shutdown. > > > > tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR. > > > > This patch fixes the issue by making sure to call sk_error_report() > > after tcp_done(). > > Isn't there still the potential for a program to call poll() at > 'just the wrong time' and still see an unexpected status? This patch does not cope with poll() results being volatile. Only epoll, because epoll logic intercepts sk_error_report() and wakes up a thread, this thread is calling back tcp_poll() shortly after. The 'after' starts really at sk_error_report(). > > ... > > WRITE_ONCE(sk->sk_err, READ_ONCE(sk->sk_err_soft) ? : ETIMEDOUT); > > - sk_error_report(sk); > > > > - tcp_write_queue_purge(sk); > > - tcp_done(sk); > > + tcp_done_with_error(sk); > > Is there scope for moving the write to sk->sk_err inside the function? > Looks like it'll need a larger change to tcp_reset(). This seems feasible, yes.
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 83fe7f62f7f10ab111512a3ef15a97a04c79cb4a..ed614fe67a587dc4238458155dda4c66d58d2687 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -75,10 +75,9 @@ u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when) static void tcp_write_err(struct sock *sk) { WRITE_ONCE(sk->sk_err, READ_ONCE(sk->sk_err_soft) ? : ETIMEDOUT); - sk_error_report(sk); - tcp_write_queue_purge(sk); - tcp_done(sk); + tcp_done_with_error(sk); + __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONTIMEOUT); }
I noticed flakes in a packetdrill test, expecting an epoll_wait() to return EPOLLERR | EPOLLHUP on a failed connect() attempt, after multiple SYN retransmits. It sometimes return EPOLLERR only. The issue is that tcp_write_err(): 1) writes an error in sk->sk_err, 2) calls sk_error_report(), 3) then calls tcp_done(). tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown, among other things. Problem is that the awaken user thread (from 2) sk_error_report()) might call tcp_poll() before tcp_done() has written sk->sk_shutdown. tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR. This patch fixes the issue by making sure to call sk_error_report() after tcp_done(). tcp_write_err() also lacks an smp_wmb(). We can reuse tcp_done_with_error() to factor out the details, as Neal suggested. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/ipv4/tcp_timer.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)