diff mbox series

[net,v2] Revert "tcp: change pingpong threshold to 3"

Message ID 20220721204404.388396-1-weiwan@google.com (mailing list archive)
State Accepted
Commit 4d8f24eeedc58d5f87b650ddda73c16e8ba56559
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] Revert "tcp: change pingpong threshold to 3" | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2278 this patch: 2278
netdev/cc_maintainers warning 3 maintainers not CCed: dsahern@kernel.org yoshfuji@linux-ipv6.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 599 this patch: 599
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2412 this patch: 2412
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 44 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Wei Wang July 21, 2022, 8:44 p.m. UTC
This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.

This to-be-reverted commit was meant to apply a stricter rule for the
stack to enter pingpong mode. However, the condition used to check for
interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
jiffy based and might be too coarse, which delays the stack entering
pingpong mode.
We revert this patch so that we no longer use the above condition to
determine interactive session, and also reduce pingpong threshold to 1.

Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
Reported-by: LemmyHuang <hlm3280@163.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>

---
v2: added Fixes tag

---
 include/net/inet_connection_sock.h | 10 +---------
 net/ipv4/tcp_output.c              | 15 ++++++---------
 2 files changed, 7 insertions(+), 18 deletions(-)

Comments

Neal Cardwell July 21, 2022, 8:55 p.m. UTC | #1
On Thu, Jul 21, 2022 at 4:44 PM Wei Wang <weiwan@google.com> wrote:
>
> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
>
> This to-be-reverted commit was meant to apply a stricter rule for the
> stack to enter pingpong mode. However, the condition used to check for
> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> jiffy based and might be too coarse, which delays the stack entering
> pingpong mode.
> We revert this patch so that we no longer use the above condition to
> determine interactive session, and also reduce pingpong threshold to 1.
>
> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> Reported-by: LemmyHuang <hlm3280@163.com>
> Suggested-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
>
> ---
> v2: added Fixes tag

Acked-by: Neal Cardwell <ncardwell@google.com>

Looks great to me. Thanks, Wei!

neal
Eric Dumazet July 22, 2022, 9:12 a.m. UTC | #2
On Thu, Jul 21, 2022 at 10:56 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Thu, Jul 21, 2022 at 4:44 PM Wei Wang <weiwan@google.com> wrote:
> >
> > This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> >
> > This to-be-reverted commit was meant to apply a stricter rule for the
> > stack to enter pingpong mode. However, the condition used to check for
> > interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> > jiffy based and might be too coarse, which delays the stack entering
> > pingpong mode.
> > We revert this patch so that we no longer use the above condition to
> > determine interactive session, and also reduce pingpong threshold to 1.
> >
> > Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> > Reported-by: LemmyHuang <hlm3280@163.com>
> > Suggested-by: Neal Cardwell <ncardwell@google.com>
> > Signed-off-by: Wei Wang <weiwan@google.com>
> >
> > ---
> > v2: added Fixes tag
>
> Acked-by: Neal Cardwell <ncardwell@google.com>
>
> Looks great to me. Thanks, Wei!

Reviewed-by: Eric Dumazet <edumazet@google.com>
patchwork-bot+netdevbpf@kernel.org July 22, 2022, 11:40 p.m. UTC | #3
Hello:

This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 21 Jul 2022 20:44:04 +0000 you wrote:
> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> 
> This to-be-reverted commit was meant to apply a stricter rule for the
> stack to enter pingpong mode. However, the condition used to check for
> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> jiffy based and might be too coarse, which delays the stack entering
> pingpong mode.
> We revert this patch so that we no longer use the above condition to
> determine interactive session, and also reduce pingpong threshold to 1.
> 
> [...]

Here is the summary with links:
  - [net,v2] Revert "tcp: change pingpong threshold to 3"
    https://git.kernel.org/netdev/net/c/4d8f24eeedc5

You are awesome, thank you!
Jiri Slaby Aug. 6, 2022, 10:02 a.m. UTC | #4
On 21. 07. 22, 22:44, Wei Wang wrote:
> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> 
> This to-be-reverted commit was meant to apply a stricter rule for the
> stack to enter pingpong mode. However, the condition used to check for
> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> jiffy based and might be too coarse, which delays the stack entering
> pingpong mode.
> We revert this patch so that we no longer use the above condition to
> determine interactive session, and also reduce pingpong threshold to 1.
> 
> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> Reported-by: LemmyHuang <hlm3280@163.com>
> Suggested-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Wei Wang <weiwan@google.com>


This breaks python-eventlet [1] (and was backported to stable trees):
________________ TestHttpd.test_018b_http_10_keepalive_framing
Neal Cardwell Aug. 6, 2022, 11:24 a.m. UTC | #5
On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>
> On 21. 07. 22, 22:44, Wei Wang wrote:
> > This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> >
> > This to-be-reverted commit was meant to apply a stricter rule for the
> > stack to enter pingpong mode. However, the condition used to check for
> > interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> > jiffy based and might be too coarse, which delays the stack entering
> > pingpong mode.
> > We revert this patch so that we no longer use the above condition to
> > determine interactive session, and also reduce pingpong threshold to 1.
> >
> > Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> > Reported-by: LemmyHuang <hlm3280@163.com>
> > Suggested-by: Neal Cardwell <ncardwell@google.com>
> > Signed-off-by: Wei Wang <weiwan@google.com>
>
>
> This breaks python-eventlet [1] (and was backported to stable trees):
> ________________ TestHttpd.test_018b_http_10_keepalive_framing
> _________________
>
> self = <tests.wsgi_test.TestHttpd
> testMethod=test_018b_http_10_keepalive_framing>
>
>      def test_018b_http_10_keepalive_framing(self):
>          # verify that if an http/1.0 client sends connection: keep-alive
>          # that we don't mangle the request framing if the app doesn't
> read the request
>          def app(environ, start_response):
>              resp_body = {
>                  '/1': b'first response',
>                  '/2': b'second response',
>                  '/3': b'third response',
>              }.get(environ['PATH_INFO'])
>              if resp_body is None:
>                  resp_body = 'Unexpected path: ' + environ['PATH_INFO']
>                  if six.PY3:
>                      resp_body = resp_body.encode('latin1')
>              # Never look at wsgi.input!
>              start_response('200 OK', [('Content-type', 'text/plain')])
>              return [resp_body]
>
>          self.site.application = app
>          sock = eventlet.connect(self.server_addr)
>          req_body = b'GET /tricksy HTTP/1.1\r\n'
>          body_len = str(len(req_body)).encode('ascii')
>
>          sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
> localhost\r\nConnection: keep-alive\r\n'
>                       b'Content-Length: ' + body_len + b'\r\n\r\n' +
> req_body)
>          result1 = read_http(sock)
>          self.assertEqual(b'first response', result1.body)
>          self.assertEqual(result1.headers_original.get('Connection'),
> 'keep-alive')
>
>          sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
> localhost\r\nConnection: keep-alive\r\n'
>                       b'Content-Length: ' + body_len + b'\r\nExpect:
> 100-continue\r\n\r\n')
>          # Client may have a short timeout waiting on that 100 Continue
>          # and basically immediately send its body
>          sock.sendall(req_body)
>          result2 = read_http(sock)
>          self.assertEqual(b'second response', result2.body)
>          self.assertEqual(result2.headers_original.get('Connection'),
> 'close')
>
>  >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
> localhost\r\nConnection: close\r\n\r\n')
>
> tests/wsgi_test.py:648:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _ _ _
> eventlet/greenio/base.py:407: in sendall
>      tail = self.send(data, flags)
> eventlet/greenio/base.py:401: in send
>      return self._send_loop(self.fd.send, data, flags)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _ _ _
>
> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection: close\r\n\r\n'
> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
>
>      def _send_loop(self, send_method, data, *args):
>          if self.act_non_blocking:
>              return send_method(data, *args)
>
>          _timeout_exc = socket_timeout('timed out')
>          while True:
>              try:
>  >               return send_method(data, *args)
> E               BrokenPipeError: [Errno 32] Broken pipe
>
> eventlet/greenio/base.py:388: BrokenPipeError
> ====================
>
> Reverting this revert on the top of 5.19 solves the issue.
>
> Any ideas?

Interesting. This revert should return the kernel back to the delayed
ACK behavior it had for many years before May 2019 and Linux 5.1,
which contains the commit it is reverting:

  4a41f453bedfd tcp: change pingpong threshold to 3

It sounds like perhaps this test you mention has an implicit
dependence on the timing of delayed ACKs.

A few questions:

(1) What are the timeout values in this test? If there is some
implicit or explicit timeout value less than the typical Linux TCP
40ms delayed ACK timer value then this could be the problem. If you
make sure all timeouts are at least, say, 300ms then this should
remove dependencies on delayed ACK behavior (and make the test more
portable).

(2) Does this test use the TCP_NODELAY socket option to disable
Nagle's algorithm? Presumably it should, given that it's a network app
that cares about latency. Omitting the TCP_NODELAY socket option can
cause request/response traffic to depend on delayed ACK behavior.

(3) If (1) and (2) do not fix the test, would you be able to provide
binary .pcap traces of the behavior with the test (a) passing and (b)
failing? For example:
   sudo tcpdump -i any -w /tmp/trace.pcap -s 100 port 80 &
   # run test
   killall tcpdump

thanks,
neal
Jiri Slaby Aug. 6, 2022, 2:41 p.m. UTC | #6
On 06. 08. 22, 13:24, Neal Cardwell wrote:
> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>>
>> On 21. 07. 22, 22:44, Wei Wang wrote:
>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
>>>
>>> This to-be-reverted commit was meant to apply a stricter rule for the
>>> stack to enter pingpong mode. However, the condition used to check for
>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
>>> jiffy based and might be too coarse, which delays the stack entering
>>> pingpong mode.
>>> We revert this patch so that we no longer use the above condition to
>>> determine interactive session, and also reduce pingpong threshold to 1.
>>>
>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
>>> Reported-by: LemmyHuang <hlm3280@163.com>
>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
>>> Signed-off-by: Wei Wang <weiwan@google.com>
>>
>>
>> This breaks python-eventlet [1] (and was backported to stable trees):
>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
>> _________________
>>
>> self = <tests.wsgi_test.TestHttpd
>> testMethod=test_018b_http_10_keepalive_framing>
>>
>>       def test_018b_http_10_keepalive_framing(self):
>>           # verify that if an http/1.0 client sends connection: keep-alive
>>           # that we don't mangle the request framing if the app doesn't
>> read the request
>>           def app(environ, start_response):
>>               resp_body = {
>>                   '/1': b'first response',
>>                   '/2': b'second response',
>>                   '/3': b'third response',
>>               }.get(environ['PATH_INFO'])
>>               if resp_body is None:
>>                   resp_body = 'Unexpected path: ' + environ['PATH_INFO']
>>                   if six.PY3:
>>                       resp_body = resp_body.encode('latin1')
>>               # Never look at wsgi.input!
>>               start_response('200 OK', [('Content-type', 'text/plain')])
>>               return [resp_body]
>>
>>           self.site.application = app
>>           sock = eventlet.connect(self.server_addr)
>>           req_body = b'GET /tricksy HTTP/1.1\r\n'
>>           body_len = str(len(req_body)).encode('ascii')
>>
>>           sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
>> localhost\r\nConnection: keep-alive\r\n'
>>                        b'Content-Length: ' + body_len + b'\r\n\r\n' +
>> req_body)
>>           result1 = read_http(sock)
>>           self.assertEqual(b'first response', result1.body)
>>           self.assertEqual(result1.headers_original.get('Connection'),
>> 'keep-alive')
>>
>>           sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
>> localhost\r\nConnection: keep-alive\r\n'
>>                        b'Content-Length: ' + body_len + b'\r\nExpect:
>> 100-continue\r\n\r\n')
>>           # Client may have a short timeout waiting on that 100 Continue
>>           # and basically immediately send its body
>>           sock.sendall(req_body)
>>           result2 = read_http(sock)
>>           self.assertEqual(b'second response', result2.body)
>>           self.assertEqual(result2.headers_original.get('Connection'),
>> 'close')
>>
>>   >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
>> localhost\r\nConnection: close\r\n\r\n')
>>
>> tests/wsgi_test.py:648:
>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>> _ _ _ _
>> eventlet/greenio/base.py:407: in sendall
>>       tail = self.send(data, flags)
>> eventlet/greenio/base.py:401: in send
>>       return self._send_loop(self.fd.send, data, flags)
>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>> _ _ _ _
>>
>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection: close\r\n\r\n'
>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
>>
>>       def _send_loop(self, send_method, data, *args):
>>           if self.act_non_blocking:
>>               return send_method(data, *args)
>>
>>           _timeout_exc = socket_timeout('timed out')
>>           while True:
>>               try:
>>   >               return send_method(data, *args)
>> E               BrokenPipeError: [Errno 32] Broken pipe
>>
>> eventlet/greenio/base.py:388: BrokenPipeError
>> ====================
>>
>> Reverting this revert on the top of 5.19 solves the issue.
>>
>> Any ideas?
> 
> Interesting. This revert should return the kernel back to the delayed
> ACK behavior it had for many years before May 2019 and Linux 5.1,
> which contains the commit it is reverting:
> 
>    4a41f453bedfd tcp: change pingpong threshold to 3
> 
> It sounds like perhaps this test you mention has an implicit
> dependence on the timing of delayed ACKs.
> 
> A few questions:

Dunno. I am only an openSUSE kernel maintainer and this popped out at 
me. Feel free to dig to eventlet's sources on your own :P.

> (1) What are the timeout values in this test? If there is some
> implicit or explicit timeout value less than the typical Linux TCP
> 40ms delayed ACK timer value then this could be the problem. If you
> make sure all timeouts are at least, say, 300ms then this should
> remove dependencies on delayed ACK behavior (and make the test more
> portable).
> 
> (2) Does this test use the TCP_NODELAY socket option to disable
> Nagle's algorithm? Presumably it should, given that it's a network app
> that cares about latency. Omitting the TCP_NODELAY socket option can
> cause request/response traffic to depend on delayed ACK behavior.
> 
> (3) If (1) and (2) do not fix the test, would you be able to provide
> binary .pcap traces of the behavior with the test (a) passing and (b)
> failing? For example:
>     sudo tcpdump -i any -w /tmp/trace.pcap -s 100 port 80 &
>     # run test
>     killall tcpdump
> 
> thanks,
> neal
Jiri Slaby Aug. 15, 2022, 7:48 a.m. UTC | #7
On 06. 08. 22, 16:41, Jiri Slaby wrote:
> On 06. 08. 22, 13:24, Neal Cardwell wrote:
>> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>>>
>>> On 21. 07. 22, 22:44, Wei Wang wrote:
>>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
>>>>
>>>> This to-be-reverted commit was meant to apply a stricter rule for the
>>>> stack to enter pingpong mode. However, the condition used to check for
>>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
>>>> jiffy based and might be too coarse, which delays the stack entering
>>>> pingpong mode.
>>>> We revert this patch so that we no longer use the above condition to
>>>> determine interactive session, and also reduce pingpong threshold to 1.
>>>>
>>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
>>>> Reported-by: LemmyHuang <hlm3280@163.com>
>>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
>>>> Signed-off-by: Wei Wang <weiwan@google.com>
>>>
>>>
>>> This breaks python-eventlet [1] (and was backported to stable trees):
>>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
>>> _________________
>>>
>>> self = <tests.wsgi_test.TestHttpd
>>> testMethod=test_018b_http_10_keepalive_framing>
>>>
>>>       def test_018b_http_10_keepalive_framing(self):
>>>           # verify that if an http/1.0 client sends connection: 
>>> keep-alive
>>>           # that we don't mangle the request framing if the app doesn't
>>> read the request
>>>           def app(environ, start_response):
>>>               resp_body = {
>>>                   '/1': b'first response',
>>>                   '/2': b'second response',
>>>                   '/3': b'third response',
>>>               }.get(environ['PATH_INFO'])
>>>               if resp_body is None:
>>>                   resp_body = 'Unexpected path: ' + environ['PATH_INFO']
>>>                   if six.PY3:
>>>                       resp_body = resp_body.encode('latin1')
>>>               # Never look at wsgi.input!
>>>               start_response('200 OK', [('Content-type', 'text/plain')])
>>>               return [resp_body]
>>>
>>>           self.site.application = app
>>>           sock = eventlet.connect(self.server_addr)
>>>           req_body = b'GET /tricksy HTTP/1.1\r\n'
>>>           body_len = str(len(req_body)).encode('ascii')
>>>
>>>           sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
>>> localhost\r\nConnection: keep-alive\r\n'
>>>                        b'Content-Length: ' + body_len + b'\r\n\r\n' +
>>> req_body)
>>>           result1 = read_http(sock)
>>>           self.assertEqual(b'first response', result1.body)
>>>           self.assertEqual(result1.headers_original.get('Connection'),
>>> 'keep-alive')
>>>
>>>           sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
>>> localhost\r\nConnection: keep-alive\r\n'
>>>                        b'Content-Length: ' + body_len + b'\r\nExpect:
>>> 100-continue\r\n\r\n')
>>>           # Client may have a short timeout waiting on that 100 Continue
>>>           # and basically immediately send its body
>>>           sock.sendall(req_body)
>>>           result2 = read_http(sock)
>>>           self.assertEqual(b'second response', result2.body)
>>>           self.assertEqual(result2.headers_original.get('Connection'),
>>> 'close')
>>>
>>>   >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
>>> localhost\r\nConnection: close\r\n\r\n')
>>>
>>> tests/wsgi_test.py:648:
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> _ _ _ _
>>> eventlet/greenio/base.py:407: in sendall
>>>       tail = self.send(data, flags)
>>> eventlet/greenio/base.py:401: in send
>>>       return self._send_loop(self.fd.send, data, flags)
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> _ _ _ _
>>>
>>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
>>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
>>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection: 
>>> close\r\n\r\n'
>>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
>>>
>>>       def _send_loop(self, send_method, data, *args):
>>>           if self.act_non_blocking:
>>>               return send_method(data, *args)
>>>
>>>           _timeout_exc = socket_timeout('timed out')
>>>           while True:
>>>               try:
>>>   >               return send_method(data, *args)
>>> E               BrokenPipeError: [Errno 32] Broken pipe
>>>
>>> eventlet/greenio/base.py:388: BrokenPipeError
>>> ====================
>>>
>>> Reverting this revert on the top of 5.19 solves the issue.
>>>
>>> Any ideas?
>>
>> Interesting. This revert should return the kernel back to the delayed
>> ACK behavior it had for many years before May 2019 and Linux 5.1,
>> which contains the commit it is reverting:
>>
>>    4a41f453bedfd tcp: change pingpong threshold to 3
>>
>> It sounds like perhaps this test you mention has an implicit
>> dependence on the timing of delayed ACKs.
>>
>> A few questions:
> 
> Dunno. I am only an openSUSE kernel maintainer and this popped out at 
> me. Feel free to dig to eventlet's sources on your own :P.

Any updates on this or should I send a revert directly?

The "before() &&" part of the patch makes the difference. That is this diff:
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
          * and it is a reply for ato after last received packet,
          * increase pingpong count.
          */
-       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
-           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
+       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u 
ping=%u\n",
+                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
+                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
+                       inet_csk(sk)->icsk_ack.pingpong);
+       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
+           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
                 inet_csk_inc_pingpong_cnt(sk);
+               pr_info("\tINC ping=%u before=%u\n",
+                               inet_csk(sk)->icsk_ack.pingpong,
+                               before(tp->lsndtime, 
icsk->icsk_ack.lrcvtime));
+       }

         tp->lsndtime = now;
  }

makes it work again, and outputs this:

 > TCP: tcp_event_data_sent: sk=00000000fd67cf8d 
(100007f0100007f:e858b18b) now=4294902140 lsndtime=4294902140 
lrcvtime=4294902140 ping=0
 > TCP: tcp_event_data_sent: sk=00000000a4becf82 
(100007f0100007f:8bb158e8) now=4294902143 lsndtime=4294902140 
lrcvtime=4294902142 ping=0
 > TCP:     INC ping=1 before=1
 > TCP: tcp_event_data_sent: sk=00000000fd67cf8d 
(100007f0100007f:e858b18b) now=4294902145 lsndtime=4294902140 
lrcvtime=4294902144 ping=0
 > TCP:     INC ping=1 before=1
 > TCP: tcp_event_data_sent: sk=00000000fd67cf8d 
(100007f0100007f:e858b18b) now=4294902147 lsndtime=4294902145 
lrcvtime=4294902144 ping=1
 > TCP:     INC ping=2 before=0

IMO, this "before=0" is the "source" of the problem. But I have no idea 
what this means at all...

 > TCP: tcp_event_data_sent: sk=00000000a4becf82 
(100007f0100007f:8bb158e8) now=4294902149 lsndtime=4294902143 
lrcvtime=4294902148 ping=1
 > TCP:     INC ping=2 before=1
 > TCP: tcp_event_data_sent: sk=00000000fd67cf8d 
(100007f0100007f:e858b18b) now=4294902151 lsndtime=4294902147 
lrcvtime=4294902150 ping=3
 > TCP:     INC ping=4 before=1
 > TCP: tcp_event_data_sent: sk=00000000c7a417e9 
(100007f0100007f:e85ab18b) now=4294902153 lsndtime=4294902153 
lrcvtime=4294902153 ping=0
 > TCP: tcp_event_data_sent: sk=000000008681183e 
(100007f0100007f:8bb15ae8) now=4294902155 lsndtime=4294902153 
lrcvtime=4294902154 ping=0
 > TCP:     INC ping=1 before=1



>> (1) What are the timeout values in this test? If there is some
>> implicit or explicit timeout value less than the typical Linux TCP
>> 40ms delayed ACK timer value then this could be the problem. If you
>> make sure all timeouts are at least, say, 300ms then this should
>> remove dependencies on delayed ACK behavior (and make the test more
>> portable).
>>
>> (2) Does this test use the TCP_NODELAY socket option to disable
>> Nagle's algorithm? Presumably it should, given that it's a network app
>> that cares about latency. Omitting the TCP_NODELAY socket option can
>> cause request/response traffic to depend on delayed ACK behavior.
>>
>> (3) If (1) and (2) do not fix the test, would you be able to provide
>> binary .pcap traces of the behavior with the test (a) passing and (b)
>> failing? For example:
>>     sudo tcpdump -i any -w /tmp/trace.pcap -s 100 port 80 &
>>     # run test
>>     killall tcpdump
>>
>> thanks,
>> neal
> 
> 

thanks,
Neal Cardwell Aug. 15, 2022, 1:30 p.m. UTC | #8
On Mon, Aug 15, 2022 at 3:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>
> On 06. 08. 22, 16:41, Jiri Slaby wrote:
> > On 06. 08. 22, 13:24, Neal Cardwell wrote:
> >> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> >>>
> >>> On 21. 07. 22, 22:44, Wei Wang wrote:
> >>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> >>>>
> >>>> This to-be-reverted commit was meant to apply a stricter rule for the
> >>>> stack to enter pingpong mode. However, the condition used to check for
> >>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> >>>> jiffy based and might be too coarse, which delays the stack entering
> >>>> pingpong mode.
> >>>> We revert this patch so that we no longer use the above condition to
> >>>> determine interactive session, and also reduce pingpong threshold to 1.
> >>>>
> >>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> >>>> Reported-by: LemmyHuang <hlm3280@163.com>
> >>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
> >>>> Signed-off-by: Wei Wang <weiwan@google.com>
> >>>
> >>>
> >>> This breaks python-eventlet [1] (and was backported to stable trees):
> >>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
> >>> _________________
> >>>
> >>> self = <tests.wsgi_test.TestHttpd
> >>> testMethod=test_018b_http_10_keepalive_framing>
> >>>
> >>>       def test_018b_http_10_keepalive_framing(self):
> >>>           # verify that if an http/1.0 client sends connection:
> >>> keep-alive
> >>>           # that we don't mangle the request framing if the app doesn't
> >>> read the request
> >>>           def app(environ, start_response):
> >>>               resp_body = {
> >>>                   '/1': b'first response',
> >>>                   '/2': b'second response',
> >>>                   '/3': b'third response',
> >>>               }.get(environ['PATH_INFO'])
> >>>               if resp_body is None:
> >>>                   resp_body = 'Unexpected path: ' + environ['PATH_INFO']
> >>>                   if six.PY3:
> >>>                       resp_body = resp_body.encode('latin1')
> >>>               # Never look at wsgi.input!
> >>>               start_response('200 OK', [('Content-type', 'text/plain')])
> >>>               return [resp_body]
> >>>
> >>>           self.site.application = app
> >>>           sock = eventlet.connect(self.server_addr)
> >>>           req_body = b'GET /tricksy HTTP/1.1\r\n'
> >>>           body_len = str(len(req_body)).encode('ascii')
> >>>
> >>>           sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
> >>> localhost\r\nConnection: keep-alive\r\n'
> >>>                        b'Content-Length: ' + body_len + b'\r\n\r\n' +
> >>> req_body)
> >>>           result1 = read_http(sock)
> >>>           self.assertEqual(b'first response', result1.body)
> >>>           self.assertEqual(result1.headers_original.get('Connection'),
> >>> 'keep-alive')
> >>>
> >>>           sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
> >>> localhost\r\nConnection: keep-alive\r\n'
> >>>                        b'Content-Length: ' + body_len + b'\r\nExpect:
> >>> 100-continue\r\n\r\n')
> >>>           # Client may have a short timeout waiting on that 100 Continue
> >>>           # and basically immediately send its body
> >>>           sock.sendall(req_body)
> >>>           result2 = read_http(sock)
> >>>           self.assertEqual(b'second response', result2.body)
> >>>           self.assertEqual(result2.headers_original.get('Connection'),
> >>> 'close')
> >>>
> >>>   >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
> >>> localhost\r\nConnection: close\r\n\r\n')
> >>>
> >>> tests/wsgi_test.py:648:
> >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>> _ _ _ _
> >>> eventlet/greenio/base.py:407: in sendall
> >>>       tail = self.send(data, flags)
> >>> eventlet/greenio/base.py:401: in send
> >>>       return self._send_loop(self.fd.send, data, flags)
> >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>> _ _ _ _
> >>>
> >>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
> >>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
> >>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection:
> >>> close\r\n\r\n'
> >>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
> >>>
> >>>       def _send_loop(self, send_method, data, *args):
> >>>           if self.act_non_blocking:
> >>>               return send_method(data, *args)
> >>>
> >>>           _timeout_exc = socket_timeout('timed out')
> >>>           while True:
> >>>               try:
> >>>   >               return send_method(data, *args)
> >>> E               BrokenPipeError: [Errno 32] Broken pipe
> >>>
> >>> eventlet/greenio/base.py:388: BrokenPipeError
> >>> ====================
> >>>
> >>> Reverting this revert on the top of 5.19 solves the issue.
> >>>
> >>> Any ideas?
> >>
> >> Interesting. This revert should return the kernel back to the delayed
> >> ACK behavior it had for many years before May 2019 and Linux 5.1,
> >> which contains the commit it is reverting:
> >>
> >>    4a41f453bedfd tcp: change pingpong threshold to 3
> >>
> >> It sounds like perhaps this test you mention has an implicit
> >> dependence on the timing of delayed ACKs.
> >>
> >> A few questions:
> >
> > Dunno. I am only an openSUSE kernel maintainer and this popped out at
> > me. Feel free to dig to eventlet's sources on your own :P.
>
> Any updates on this or should I send a revert directly?
>
> The "before() &&" part of the patch makes the difference. That is this diff:
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
>           * and it is a reply for ato after last received packet,
>           * increase pingpong count.
>           */
> -       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> -           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> +       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u
> ping=%u\n",
> +                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
> +                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
> +                       inet_csk(sk)->icsk_ack.pingpong);
> +       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
>                  inet_csk_inc_pingpong_cnt(sk);
> +               pr_info("\tINC ping=%u before=%u\n",
> +                               inet_csk(sk)->icsk_ack.pingpong,
> +                               before(tp->lsndtime,
> icsk->icsk_ack.lrcvtime));
> +       }
>
>          tp->lsndtime = now;
>   }
>
> makes it work again, and outputs this:
>
>  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> (100007f0100007f:e858b18b) now=4294902140 lsndtime=4294902140
> lrcvtime=4294902140 ping=0
>  > TCP: tcp_event_data_sent: sk=00000000a4becf82
> (100007f0100007f:8bb158e8) now=4294902143 lsndtime=4294902140
> lrcvtime=4294902142 ping=0
>  > TCP:     INC ping=1 before=1
>  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> (100007f0100007f:e858b18b) now=4294902145 lsndtime=4294902140
> lrcvtime=4294902144 ping=0
>  > TCP:     INC ping=1 before=1
>  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> (100007f0100007f:e858b18b) now=4294902147 lsndtime=4294902145
> lrcvtime=4294902144 ping=1
>  > TCP:     INC ping=2 before=0
>
> IMO, this "before=0" is the "source" of the problem. But I have no idea
> what this means at all...
>
>  > TCP: tcp_event_data_sent: sk=00000000a4becf82
> (100007f0100007f:8bb158e8) now=4294902149 lsndtime=4294902143
> lrcvtime=4294902148 ping=1
>  > TCP:     INC ping=2 before=1
>  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> (100007f0100007f:e858b18b) now=4294902151 lsndtime=4294902147
> lrcvtime=4294902150 ping=3
>  > TCP:     INC ping=4 before=1
>  > TCP: tcp_event_data_sent: sk=00000000c7a417e9
> (100007f0100007f:e85ab18b) now=4294902153 lsndtime=4294902153
> lrcvtime=4294902153 ping=0
>  > TCP: tcp_event_data_sent: sk=000000008681183e
> (100007f0100007f:8bb15ae8) now=4294902155 lsndtime=4294902153
> lrcvtime=4294902154 ping=0
>  > TCP:     INC ping=1 before=1

It sounds like this test has a very specific dependence on the buggy
delayed ACK timing behavior from the buggy commit
4a41f453bedfd5e9cd040bad509d9da49feb3e2c.

IMHO I don't think we can revert a kernel bug fix based on a test that
decided to depend on the exact timing of delayed ACKs during a time
when that delayed ACK behavior was buggy. :-)

best regards,
neal
Wei Wang Aug. 15, 2022, 6:54 p.m. UTC | #9
On Mon, Aug 15, 2022 at 6:30 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Aug 15, 2022 at 3:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> >
> > On 06. 08. 22, 16:41, Jiri Slaby wrote:
> > > On 06. 08. 22, 13:24, Neal Cardwell wrote:
> > >> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> > >>>
> > >>> On 21. 07. 22, 22:44, Wei Wang wrote:
> > >>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> > >>>>
> > >>>> This to-be-reverted commit was meant to apply a stricter rule for the
> > >>>> stack to enter pingpong mode. However, the condition used to check for
> > >>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> > >>>> jiffy based and might be too coarse, which delays the stack entering
> > >>>> pingpong mode.
> > >>>> We revert this patch so that we no longer use the above condition to
> > >>>> determine interactive session, and also reduce pingpong threshold to 1.
> > >>>>
> > >>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> > >>>> Reported-by: LemmyHuang <hlm3280@163.com>
> > >>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
> > >>>> Signed-off-by: Wei Wang <weiwan@google.com>
> > >>>
> > >>>
> > >>> This breaks python-eventlet [1] (and was backported to stable trees):
> > >>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
> > >>> _________________
> > >>>
> > >>> self = <tests.wsgi_test.TestHttpd
> > >>> testMethod=test_018b_http_10_keepalive_framing>
> > >>>
> > >>>       def test_018b_http_10_keepalive_framing(self):
> > >>>           # verify that if an http/1.0 client sends connection:
> > >>> keep-alive
> > >>>           # that we don't mangle the request framing if the app doesn't
> > >>> read the request
> > >>>           def app(environ, start_response):
> > >>>               resp_body = {
> > >>>                   '/1': b'first response',
> > >>>                   '/2': b'second response',
> > >>>                   '/3': b'third response',
> > >>>               }.get(environ['PATH_INFO'])
> > >>>               if resp_body is None:
> > >>>                   resp_body = 'Unexpected path: ' + environ['PATH_INFO']
> > >>>                   if six.PY3:
> > >>>                       resp_body = resp_body.encode('latin1')
> > >>>               # Never look at wsgi.input!
> > >>>               start_response('200 OK', [('Content-type', 'text/plain')])
> > >>>               return [resp_body]
> > >>>
> > >>>           self.site.application = app
> > >>>           sock = eventlet.connect(self.server_addr)
> > >>>           req_body = b'GET /tricksy HTTP/1.1\r\n'
> > >>>           body_len = str(len(req_body)).encode('ascii')
> > >>>
> > >>>           sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
> > >>> localhost\r\nConnection: keep-alive\r\n'
> > >>>                        b'Content-Length: ' + body_len + b'\r\n\r\n' +
> > >>> req_body)
> > >>>           result1 = read_http(sock)
> > >>>           self.assertEqual(b'first response', result1.body)
> > >>>           self.assertEqual(result1.headers_original.get('Connection'),
> > >>> 'keep-alive')
> > >>>
> > >>>           sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
> > >>> localhost\r\nConnection: keep-alive\r\n'
> > >>>                        b'Content-Length: ' + body_len + b'\r\nExpect:
> > >>> 100-continue\r\n\r\n')
> > >>>           # Client may have a short timeout waiting on that 100 Continue
> > >>>           # and basically immediately send its body
> > >>>           sock.sendall(req_body)
> > >>>           result2 = read_http(sock)
> > >>>           self.assertEqual(b'second response', result2.body)
> > >>>           self.assertEqual(result2.headers_original.get('Connection'),
> > >>> 'close')
> > >>>
> > >>>   >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
> > >>> localhost\r\nConnection: close\r\n\r\n')
> > >>>
> > >>> tests/wsgi_test.py:648:
> > >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > >>> _ _ _ _
> > >>> eventlet/greenio/base.py:407: in sendall
> > >>>       tail = self.send(data, flags)
> > >>> eventlet/greenio/base.py:401: in send
> > >>>       return self._send_loop(self.fd.send, data, flags)
> > >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > >>> _ _ _ _
> > >>>
> > >>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
> > >>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
> > >>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection:
> > >>> close\r\n\r\n'
> > >>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
> > >>>
> > >>>       def _send_loop(self, send_method, data, *args):
> > >>>           if self.act_non_blocking:
> > >>>               return send_method(data, *args)
> > >>>
> > >>>           _timeout_exc = socket_timeout('timed out')
> > >>>           while True:
> > >>>               try:
> > >>>   >               return send_method(data, *args)
> > >>> E               BrokenPipeError: [Errno 32] Broken pipe
> > >>>
> > >>> eventlet/greenio/base.py:388: BrokenPipeError
> > >>> ====================
> > >>>
> > >>> Reverting this revert on the top of 5.19 solves the issue.
> > >>>
> > >>> Any ideas?
> > >>
> > >> Interesting. This revert should return the kernel back to the delayed
> > >> ACK behavior it had for many years before May 2019 and Linux 5.1,
> > >> which contains the commit it is reverting:
> > >>
> > >>    4a41f453bedfd tcp: change pingpong threshold to 3
> > >>
> > >> It sounds like perhaps this test you mention has an implicit
> > >> dependence on the timing of delayed ACKs.
> > >>
> > >> A few questions:
> > >
> > > Dunno. I am only an openSUSE kernel maintainer and this popped out at
> > > me. Feel free to dig to eventlet's sources on your own :P.
> >
> > Any updates on this or should I send a revert directly?
> >
> > The "before() &&" part of the patch makes the difference. That is this diff:
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
> >           * and it is a reply for ato after last received packet,
> >           * increase pingpong count.
> >           */
> > -       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > -           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > +       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u
> > ping=%u\n",
> > +                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
> > +                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
> > +                       inet_csk(sk)->icsk_ack.pingpong);
> > +       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
> >                  inet_csk_inc_pingpong_cnt(sk);
> > +               pr_info("\tINC ping=%u before=%u\n",
> > +                               inet_csk(sk)->icsk_ack.pingpong,
> > +                               before(tp->lsndtime,
> > icsk->icsk_ack.lrcvtime));
> > +       }
> >
> >          tp->lsndtime = now;
> >   }
> >
> > makes it work again, and outputs this:

Is the above patch made on top of my reverted patch? It seems not
according to this part of diff.
Then what is the definition of TCP_PINGPONG_THRESH in the working
case? I think that is the key, regardless of the result of:
    before(tp->lsndtime, icsk->icsk_ack.lrcvtime)

I tried to look into what exactly the test is doing, and can't tell
why it is failing. I don't see any check that is based on the timing
of the reply. :(
I hope someone could explain more about what this test is doing.

> >
> >  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> > (100007f0100007f:e858b18b) now=4294902140 lsndtime=4294902140
> > lrcvtime=4294902140 ping=0
> >  > TCP: tcp_event_data_sent: sk=00000000a4becf82
> > (100007f0100007f:8bb158e8) now=4294902143 lsndtime=4294902140
> > lrcvtime=4294902142 ping=0
> >  > TCP:     INC ping=1 before=1
> >  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> > (100007f0100007f:e858b18b) now=4294902145 lsndtime=4294902140
> > lrcvtime=4294902144 ping=0
> >  > TCP:     INC ping=1 before=1
> >  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> > (100007f0100007f:e858b18b) now=4294902147 lsndtime=4294902145
> > lrcvtime=4294902144 ping=1
> >  > TCP:     INC ping=2 before=0
> >
> > IMO, this "before=0" is the "source" of the problem. But I have no idea
> > what this means at all...
> >
> >  > TCP: tcp_event_data_sent: sk=00000000a4becf82
> > (100007f0100007f:8bb158e8) now=4294902149 lsndtime=4294902143
> > lrcvtime=4294902148 ping=1
> >  > TCP:     INC ping=2 before=1
> >  > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> > (100007f0100007f:e858b18b) now=4294902151 lsndtime=4294902147
> > lrcvtime=4294902150 ping=3
> >  > TCP:     INC ping=4 before=1
> >  > TCP: tcp_event_data_sent: sk=00000000c7a417e9
> > (100007f0100007f:e85ab18b) now=4294902153 lsndtime=4294902153
> > lrcvtime=4294902153 ping=0
> >  > TCP: tcp_event_data_sent: sk=000000008681183e
> > (100007f0100007f:8bb15ae8) now=4294902155 lsndtime=4294902153
> > lrcvtime=4294902154 ping=0
> >  > TCP:     INC ping=1 before=1
>
> It sounds like this test has a very specific dependence on the buggy
> delayed ACK timing behavior from the buggy commit
> 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
>
> IMHO I don't think we can revert a kernel bug fix based on a test that
> decided to depend on the exact timing of delayed ACKs during a time
> when that delayed ACK behavior was buggy. :-)
>
> best regards,
> neal
Neal Cardwell Aug. 15, 2022, 7:19 p.m. UTC | #10
=




On Mon, Aug 15, 2022 at 2:54 PM Wei Wang <weiwan@google.com> wrote:
>
> On Mon, Aug 15, 2022 at 6:30 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Aug 15, 2022 at 3:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> > >
> > > On 06. 08. 22, 16:41, Jiri Slaby wrote:
> > > > On 06. 08. 22, 13:24, Neal Cardwell wrote:
> > > >> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> > > >>>
> > > >>> On 21. 07. 22, 22:44, Wei Wang wrote:
> > > >>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> > > >>>>
> > > >>>> This to-be-reverted commit was meant to apply a stricter rule for the
> > > >>>> stack to enter pingpong mode. However, the condition used to check for
> > > >>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> > > >>>> jiffy based and might be too coarse, which delays the stack entering
> > > >>>> pingpong mode.
> > > >>>> We revert this patch so that we no longer use the above condition to
> > > >>>> determine interactive session, and also reduce pingpong threshold to 1.
> > > >>>>
> > > >>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> > > >>>> Reported-by: LemmyHuang <hlm3280@163.com>
> > > >>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
> > > >>>> Signed-off-by: Wei Wang <weiwan@google.com>
> > > >>>
> > > >>>
> > > >>> This breaks python-eventlet [1] (and was backported to stable trees):
> > > >>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
> > > >>> _________________
> > > >>>
> > > >>> self = <tests.wsgi_test.TestHttpd
> > > >>> testMethod=test_018b_http_10_keepalive_framing>
> > > >>>
> > > >>>       def test_018b_http_10_keepalive_framing(self):
> > > >>>           # verify that if an http/1.0 client sends connection:
> > > >>> keep-alive
> > > >>>           # that we don't mangle the request framing if the app doesn't
> > > >>> read the request
> > > >>>           def app(environ, start_response):
> > > >>>               resp_body = {
> > > >>>                   '/1': b'first response',
> > > >>>                   '/2': b'second response',
> > > >>>                   '/3': b'third response',
> > > >>>               }.get(environ['PATH_INFO'])
> > > >>>               if resp_body is None:
> > > >>>                   resp_body = 'Unexpected path: ' + environ['PATH_INFO']
> > > >>>                   if six.PY3:
> > > >>>                       resp_body = resp_body.encode('latin1')
> > > >>>               # Never look at wsgi.input!
> > > >>>               start_response('200 OK', [('Content-type', 'text/plain')])
> > > >>>               return [resp_body]
> > > >>>
> > > >>>           self.site.application = app
> > > >>>           sock = eventlet.connect(self.server_addr)
> > > >>>           req_body = b'GET /tricksy HTTP/1.1\r\n'
> > > >>>           body_len = str(len(req_body)).encode('ascii')
> > > >>>
> > > >>>           sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
> > > >>> localhost\r\nConnection: keep-alive\r\n'
> > > >>>                        b'Content-Length: ' + body_len + b'\r\n\r\n' +
> > > >>> req_body)
> > > >>>           result1 = read_http(sock)
> > > >>>           self.assertEqual(b'first response', result1.body)
> > > >>>           self.assertEqual(result1.headers_original.get('Connection'),
> > > >>> 'keep-alive')
> > > >>>
> > > >>>           sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
> > > >>> localhost\r\nConnection: keep-alive\r\n'
> > > >>>                        b'Content-Length: ' + body_len + b'\r\nExpect:
> > > >>> 100-continue\r\n\r\n')
> > > >>>           # Client may have a short timeout waiting on that 100 Continue
> > > >>>           # and basically immediately send its body
> > > >>>           sock.sendall(req_body)
> > > >>>           result2 = read_http(sock)
> > > >>>           self.assertEqual(b'second response', result2.body)
> > > >>>           self.assertEqual(result2.headers_original.get('Connection'),
> > > >>> 'close')
> > > >>>
> > > >>>   >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
> > > >>> localhost\r\nConnection: close\r\n\r\n')
> > > >>>
> > > >>> tests/wsgi_test.py:648:
> > > >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > > >>> _ _ _ _
> > > >>> eventlet/greenio/base.py:407: in sendall
> > > >>>       tail = self.send(data, flags)
> > > >>> eventlet/greenio/base.py:401: in send
> > > >>>       return self._send_loop(self.fd.send, data, flags)
> > > >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > > >>> _ _ _ _
> > > >>>
> > > >>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
> > > >>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
> > > >>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection:
> > > >>> close\r\n\r\n'
> > > >>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
> > > >>>
> > > >>>       def _send_loop(self, send_method, data, *args):
> > > >>>           if self.act_non_blocking:
> > > >>>               return send_method(data, *args)
> > > >>>
> > > >>>           _timeout_exc = socket_timeout('timed out')
> > > >>>           while True:
> > > >>>               try:
> > > >>>   >               return send_method(data, *args)
> > > >>> E               BrokenPipeError: [Errno 32] Broken pipe
> > > >>>
> > > >>> eventlet/greenio/base.py:388: BrokenPipeError
> > > >>> ====================
> > > >>>
> > > >>> Reverting this revert on the top of 5.19 solves the issue.
> > > >>>
> > > >>> Any ideas?
> > > >>
> > > >> Interesting. This revert should return the kernel back to the delayed
> > > >> ACK behavior it had for many years before May 2019 and Linux 5.1,
> > > >> which contains the commit it is reverting:
> > > >>
> > > >>    4a41f453bedfd tcp: change pingpong threshold to 3
> > > >>
> > > >> It sounds like perhaps this test you mention has an implicit
> > > >> dependence on the timing of delayed ACKs.
> > > >>
> > > >> A few questions:
> > > >
> > > > Dunno. I am only an openSUSE kernel maintainer and this popped out at
> > > > me. Feel free to dig to eventlet's sources on your own :P.
> > >
> > > Any updates on this or should I send a revert directly?
> > >
> > > The "before() &&" part of the patch makes the difference. That is this diff:
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
> > >           * and it is a reply for ato after last received packet,
> > >           * increase pingpong count.
> > >           */
> > > -       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > > -           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > > +       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u
> > > ping=%u\n",
> > > +                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
> > > +                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
> > > +                       inet_csk(sk)->icsk_ack.pingpong);
> > > +       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > > +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
> > >                  inet_csk_inc_pingpong_cnt(sk);
> > > +               pr_info("\tINC ping=%u before=%u\n",
> > > +                               inet_csk(sk)->icsk_ack.pingpong,
> > > +                               before(tp->lsndtime,
> > > icsk->icsk_ack.lrcvtime));
> > > +       }
> > >
> > >          tp->lsndtime = now;
> > >   }
> > >
> > > makes it work again, and outputs this:
>
> Is the above patch made on top of my reverted patch? It seems not
> according to this part of diff.
> Then what is the definition of TCP_PINGPONG_THRESH in the working
> case? I think that is the key, regardless of the result of:
>     before(tp->lsndtime, icsk->icsk_ack.lrcvtime)
>
> I tried to look into what exactly the test is doing, and can't tell
> why it is failing. I don't see any check that is based on the timing
> of the reply. :(
> I hope someone could explain more about what this test is doing.

Yes, the test case is a bit hard to read.

I have a conjecture about what might be going wrong: the eventlet
client code connect() method in
https://github.com/eventlet/eventlet/blob/master/eventlet/green/http/client.py
uses:
          self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
to disable Nagle's algorithm.

However, I don't see anything on the server side in
https://github.com/eventlet/eventlet/blob/master/eventlet/green/http/server.py
disabling Nagle's algorithm.

AFAICT the server code is using TCPServer, so from skimming
  https://github.com/python/cpython/blob/3.10/Lib/socketserver.py
...it seems that the code should probably be setting
disable_nagle_algorithm to True to disable Nagle's algorithm
(disable_nagle_algorithm defaults to False).

If the server code is indeed leaving Nagle's algorithm enabled (the
TCP default), as I suspect here, that's a classic
userspace/application bug, which  means the server performance can be
stalled by waiting for delayed ACKs from the remote side, which means
changes in delayed ACK timing behavior (such as from this kernel
patch) can cause the test to fail.

The appropriate fix is probably to have the eventlet server code use
disable_nagle_algorithm=True.

Just a conjecture. :-)

cheers,
neal
Jiri Slaby Aug. 16, 2022, 5:48 a.m. UTC | #11
Cc eventlet guys + Linus.

On 15. 08. 22, 15:30, Neal Cardwell wrote:
> On Mon, Aug 15, 2022 at 3:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>>
>> On 06. 08. 22, 16:41, Jiri Slaby wrote:
>>> On 06. 08. 22, 13:24, Neal Cardwell wrote:
>>>> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>>>>>
>>>>> On 21. 07. 22, 22:44, Wei Wang wrote:
>>>>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
>>>>>>
>>>>>> This to-be-reverted commit was meant to apply a stricter rule for the
>>>>>> stack to enter pingpong mode. However, the condition used to check for
>>>>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
>>>>>> jiffy based and might be too coarse, which delays the stack entering
>>>>>> pingpong mode.
>>>>>> We revert this patch so that we no longer use the above condition to
>>>>>> determine interactive session, and also reduce pingpong threshold to 1.
>>>>>>
>>>>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
>>>>>> Reported-by: LemmyHuang <hlm3280@163.com>
>>>>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
>>>>>> Signed-off-by: Wei Wang <weiwan@google.com>
>>>>>
>>>>>
>>>>> This breaks python-eventlet [1] (and was backported to stable trees):
>>>>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
>>>>> _________________
>>>>>
>>>>> self = <tests.wsgi_test.TestHttpd
>>>>> testMethod=test_018b_http_10_keepalive_framing>
>>>>>
>>>>>        def test_018b_http_10_keepalive_framing(self):
>>>>>            # verify that if an http/1.0 client sends connection:
>>>>> keep-alive
>>>>>            # that we don't mangle the request framing if the app doesn't
>>>>> read the request
>>>>>            def app(environ, start_response):
>>>>>                resp_body = {
>>>>>                    '/1': b'first response',
>>>>>                    '/2': b'second response',
>>>>>                    '/3': b'third response',
>>>>>                }.get(environ['PATH_INFO'])
>>>>>                if resp_body is None:
>>>>>                    resp_body = 'Unexpected path: ' + environ['PATH_INFO']
>>>>>                    if six.PY3:
>>>>>                        resp_body = resp_body.encode('latin1')
>>>>>                # Never look at wsgi.input!
>>>>>                start_response('200 OK', [('Content-type', 'text/plain')])
>>>>>                return [resp_body]
>>>>>
>>>>>            self.site.application = app
>>>>>            sock = eventlet.connect(self.server_addr)
>>>>>            req_body = b'GET /tricksy HTTP/1.1\r\n'
>>>>>            body_len = str(len(req_body)).encode('ascii')
>>>>>
>>>>>            sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
>>>>> localhost\r\nConnection: keep-alive\r\n'
>>>>>                         b'Content-Length: ' + body_len + b'\r\n\r\n' +
>>>>> req_body)
>>>>>            result1 = read_http(sock)
>>>>>            self.assertEqual(b'first response', result1.body)
>>>>>            self.assertEqual(result1.headers_original.get('Connection'),
>>>>> 'keep-alive')
>>>>>
>>>>>            sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
>>>>> localhost\r\nConnection: keep-alive\r\n'
>>>>>                         b'Content-Length: ' + body_len + b'\r\nExpect:
>>>>> 100-continue\r\n\r\n')
>>>>>            # Client may have a short timeout waiting on that 100 Continue
>>>>>            # and basically immediately send its body
>>>>>            sock.sendall(req_body)
>>>>>            result2 = read_http(sock)
>>>>>            self.assertEqual(b'second response', result2.body)
>>>>>            self.assertEqual(result2.headers_original.get('Connection'),
>>>>> 'close')
>>>>>
>>>>>    >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
>>>>> localhost\r\nConnection: close\r\n\r\n')
>>>>>
>>>>> tests/wsgi_test.py:648:
>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>> _ _ _ _
>>>>> eventlet/greenio/base.py:407: in sendall
>>>>>        tail = self.send(data, flags)
>>>>> eventlet/greenio/base.py:401: in send
>>>>>        return self._send_loop(self.fd.send, data, flags)
>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>> _ _ _ _
>>>>>
>>>>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
>>>>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
>>>>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection:
>>>>> close\r\n\r\n'
>>>>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
>>>>>
>>>>>        def _send_loop(self, send_method, data, *args):
>>>>>            if self.act_non_blocking:
>>>>>                return send_method(data, *args)
>>>>>
>>>>>            _timeout_exc = socket_timeout('timed out')
>>>>>            while True:
>>>>>                try:
>>>>>    >               return send_method(data, *args)
>>>>> E               BrokenPipeError: [Errno 32] Broken pipe
>>>>>
>>>>> eventlet/greenio/base.py:388: BrokenPipeError
>>>>> ====================
>>>>>
>>>>> Reverting this revert on the top of 5.19 solves the issue.
>>>>>
>>>>> Any ideas?
>>>>
>>>> Interesting. This revert should return the kernel back to the delayed
>>>> ACK behavior it had for many years before May 2019 and Linux 5.1,
>>>> which contains the commit it is reverting:
>>>>
>>>>     4a41f453bedfd tcp: change pingpong threshold to 3
>>>>
>>>> It sounds like perhaps this test you mention has an implicit
>>>> dependence on the timing of delayed ACKs.
>>>>
>>>> A few questions:
>>>
>>> Dunno. I am only an openSUSE kernel maintainer and this popped out at
>>> me. Feel free to dig to eventlet's sources on your own :P.
>>
>> Any updates on this or should I send a revert directly?
>>
>> The "before() &&" part of the patch makes the difference. That is this diff:
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
>>            * and it is a reply for ato after last received packet,
>>            * increase pingpong count.
>>            */
>> -       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
>> -           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
>> +       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u
>> ping=%u\n",
>> +                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
>> +                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
>> +                       inet_csk(sk)->icsk_ack.pingpong);
>> +       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
>> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
>>                   inet_csk_inc_pingpong_cnt(sk);
>> +               pr_info("\tINC ping=%u before=%u\n",
>> +                               inet_csk(sk)->icsk_ack.pingpong,
>> +                               before(tp->lsndtime,
>> icsk->icsk_ack.lrcvtime));
>> +       }
>>
>>           tp->lsndtime = now;
>>    }
>>
>> makes it work again, and outputs this:
>>
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902140 lsndtime=4294902140
>> lrcvtime=4294902140 ping=0
>>   > TCP: tcp_event_data_sent: sk=00000000a4becf82
>> (100007f0100007f:8bb158e8) now=4294902143 lsndtime=4294902140
>> lrcvtime=4294902142 ping=0
>>   > TCP:     INC ping=1 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902145 lsndtime=4294902140
>> lrcvtime=4294902144 ping=0
>>   > TCP:     INC ping=1 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902147 lsndtime=4294902145
>> lrcvtime=4294902144 ping=1
>>   > TCP:     INC ping=2 before=0
>>
>> IMO, this "before=0" is the "source" of the problem. But I have no idea
>> what this means at all...
>>
>>   > TCP: tcp_event_data_sent: sk=00000000a4becf82
>> (100007f0100007f:8bb158e8) now=4294902149 lsndtime=4294902143
>> lrcvtime=4294902148 ping=1
>>   > TCP:     INC ping=2 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902151 lsndtime=4294902147
>> lrcvtime=4294902150 ping=3
>>   > TCP:     INC ping=4 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000c7a417e9
>> (100007f0100007f:e85ab18b) now=4294902153 lsndtime=4294902153
>> lrcvtime=4294902153 ping=0
>>   > TCP: tcp_event_data_sent: sk=000000008681183e
>> (100007f0100007f:8bb15ae8) now=4294902155 lsndtime=4294902153
>> lrcvtime=4294902154 ping=0
>>   > TCP:     INC ping=1 before=1
> 
> It sounds like this test has a very specific dependence on the buggy
> delayed ACK timing behavior from the buggy commit
> 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> 
> IMHO I don't think we can revert a kernel bug fix based on a test that
> decided to depend on the exact timing of delayed ACKs during a time
> when that delayed ACK behavior was buggy. :-)

Unfortunately despite the test is likely bogus (I am unable to say it is 
or not), it does happen and the patch (the revert -- 4d8f24eeedc) breaks 
userspace. I'd say this is exactly the case where we apply "we do not 
break userspace". But I might be wrong as we might not care about silly 
tests.

In anyway, openSUSE has to have the patch (the revert) reverted, so that 
the distro actually builds/works. (Until this is fixed on the eventlet 
side at least. And more importantly _until_ it propagates to distros or 
is fixed otherwise (like disabling the test).) And I suppose other 
distros would have to do the same. That is quite unfortunate :/.

thanks,
Neal Cardwell Aug. 16, 2022, 10:19 p.m. UTC | #12
On Tue, Aug 16, 2022 at 1:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>
> Cc eventlet guys + Linus.
>
> On 15. 08. 22, 15:30, Neal Cardwell wrote:
> > On Mon, Aug 15, 2022 at 3:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> >>
> >> On 06. 08. 22, 16:41, Jiri Slaby wrote:
> >>> On 06. 08. 22, 13:24, Neal Cardwell wrote:
> >>>> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
> >>>>>
> >>>>> On 21. 07. 22, 22:44, Wei Wang wrote:
> >>>>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> >>>>>>
> >>>>>> This to-be-reverted commit was meant to apply a stricter rule for the
> >>>>>> stack to enter pingpong mode. However, the condition used to check for
> >>>>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> >>>>>> jiffy based and might be too coarse, which delays the stack entering
> >>>>>> pingpong mode.
> >>>>>> We revert this patch so that we no longer use the above condition to
> >>>>>> determine interactive session, and also reduce pingpong threshold to 1.
> >>>>>>
> >>>>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
> >>>>>> Reported-by: LemmyHuang <hlm3280@163.com>
> >>>>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
> >>>>>> Signed-off-by: Wei Wang <weiwan@google.com>
> >>>>>
> >>>>>
> >>>>> This breaks python-eventlet [1] (and was backported to stable trees):
> >>>>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
> >>>>> _________________
> >>>>>
> >>>>> self = <tests.wsgi_test.TestHttpd
> >>>>> testMethod=test_018b_http_10_keepalive_framing>
> >>>>>
> >>>>>        def test_018b_http_10_keepalive_framing(self):
> >>>>>            # verify that if an http/1.0 client sends connection:
> >>>>> keep-alive
> >>>>>            # that we don't mangle the request framing if the app doesn't
> >>>>> read the request
> >>>>>            def app(environ, start_response):
> >>>>>                resp_body = {
> >>>>>                    '/1': b'first response',
> >>>>>                    '/2': b'second response',
> >>>>>                    '/3': b'third response',
> >>>>>                }.get(environ['PATH_INFO'])
> >>>>>                if resp_body is None:
> >>>>>                    resp_body = 'Unexpected path: ' + environ['PATH_INFO']
> >>>>>                    if six.PY3:
> >>>>>                        resp_body = resp_body.encode('latin1')
> >>>>>                # Never look at wsgi.input!
> >>>>>                start_response('200 OK', [('Content-type', 'text/plain')])
> >>>>>                return [resp_body]
> >>>>>
> >>>>>            self.site.application = app
> >>>>>            sock = eventlet.connect(self.server_addr)
> >>>>>            req_body = b'GET /tricksy HTTP/1.1\r\n'
> >>>>>            body_len = str(len(req_body)).encode('ascii')
> >>>>>
> >>>>>            sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
> >>>>> localhost\r\nConnection: keep-alive\r\n'
> >>>>>                         b'Content-Length: ' + body_len + b'\r\n\r\n' +
> >>>>> req_body)
> >>>>>            result1 = read_http(sock)
> >>>>>            self.assertEqual(b'first response', result1.body)
> >>>>>            self.assertEqual(result1.headers_original.get('Connection'),
> >>>>> 'keep-alive')
> >>>>>
> >>>>>            sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
> >>>>> localhost\r\nConnection: keep-alive\r\n'
> >>>>>                         b'Content-Length: ' + body_len + b'\r\nExpect:
> >>>>> 100-continue\r\n\r\n')
> >>>>>            # Client may have a short timeout waiting on that 100 Continue
> >>>>>            # and basically immediately send its body
> >>>>>            sock.sendall(req_body)
> >>>>>            result2 = read_http(sock)
> >>>>>            self.assertEqual(b'second response', result2.body)
> >>>>>            self.assertEqual(result2.headers_original.get('Connection'),
> >>>>> 'close')
> >>>>>
> >>>>>    >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
> >>>>> localhost\r\nConnection: close\r\n\r\n')
> >>>>>
> >>>>> tests/wsgi_test.py:648:
> >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>> _ _ _ _
> >>>>> eventlet/greenio/base.py:407: in sendall
> >>>>>        tail = self.send(data, flags)
> >>>>> eventlet/greenio/base.py:401: in send
> >>>>>        return self._send_loop(self.fd.send, data, flags)
> >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>> _ _ _ _
> >>>>>
> >>>>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
> >>>>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
> >>>>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection:
> >>>>> close\r\n\r\n'
> >>>>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
> >>>>>
> >>>>>        def _send_loop(self, send_method, data, *args):
> >>>>>            if self.act_non_blocking:
> >>>>>                return send_method(data, *args)
> >>>>>
> >>>>>            _timeout_exc = socket_timeout('timed out')
> >>>>>            while True:
> >>>>>                try:
> >>>>>    >               return send_method(data, *args)
> >>>>> E               BrokenPipeError: [Errno 32] Broken pipe
> >>>>>
> >>>>> eventlet/greenio/base.py:388: BrokenPipeError
> >>>>> ====================
> >>>>>
> >>>>> Reverting this revert on the top of 5.19 solves the issue.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> Interesting. This revert should return the kernel back to the delayed
> >>>> ACK behavior it had for many years before May 2019 and Linux 5.1,
> >>>> which contains the commit it is reverting:
> >>>>
> >>>>     4a41f453bedfd tcp: change pingpong threshold to 3
> >>>>
> >>>> It sounds like perhaps this test you mention has an implicit
> >>>> dependence on the timing of delayed ACKs.
> >>>>
> >>>> A few questions:
> >>>
> >>> Dunno. I am only an openSUSE kernel maintainer and this popped out at
> >>> me. Feel free to dig to eventlet's sources on your own :P.
> >>
> >> Any updates on this or should I send a revert directly?
> >>
> >> The "before() &&" part of the patch makes the difference. That is this diff:
> >> --- a/net/ipv4/tcp_output.c
> >> +++ b/net/ipv4/tcp_output.c
> >> @@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
> >>            * and it is a reply for ato after last received packet,
> >>            * increase pingpong count.
> >>            */
> >> -       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> >> -           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> >> +       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u
> >> ping=%u\n",
> >> +                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
> >> +                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
> >> +                       inet_csk(sk)->icsk_ack.pingpong);
> >> +       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> >> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
> >>                   inet_csk_inc_pingpong_cnt(sk);
> >> +               pr_info("\tINC ping=%u before=%u\n",
> >> +                               inet_csk(sk)->icsk_ack.pingpong,
> >> +                               before(tp->lsndtime,
> >> icsk->icsk_ack.lrcvtime));
> >> +       }
> >>
> >>           tp->lsndtime = now;
> >>    }
> >>
> >> makes it work again, and outputs this:
> >>
> >>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> >> (100007f0100007f:e858b18b) now=4294902140 lsndtime=4294902140
> >> lrcvtime=4294902140 ping=0
> >>   > TCP: tcp_event_data_sent: sk=00000000a4becf82
> >> (100007f0100007f:8bb158e8) now=4294902143 lsndtime=4294902140
> >> lrcvtime=4294902142 ping=0
> >>   > TCP:     INC ping=1 before=1
> >>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> >> (100007f0100007f:e858b18b) now=4294902145 lsndtime=4294902140
> >> lrcvtime=4294902144 ping=0
> >>   > TCP:     INC ping=1 before=1
> >>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> >> (100007f0100007f:e858b18b) now=4294902147 lsndtime=4294902145
> >> lrcvtime=4294902144 ping=1
> >>   > TCP:     INC ping=2 before=0
> >>
> >> IMO, this "before=0" is the "source" of the problem. But I have no idea
> >> what this means at all...
> >>
> >>   > TCP: tcp_event_data_sent: sk=00000000a4becf82
> >> (100007f0100007f:8bb158e8) now=4294902149 lsndtime=4294902143
> >> lrcvtime=4294902148 ping=1
> >>   > TCP:     INC ping=2 before=1
> >>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
> >> (100007f0100007f:e858b18b) now=4294902151 lsndtime=4294902147
> >> lrcvtime=4294902150 ping=3
> >>   > TCP:     INC ping=4 before=1
> >>   > TCP: tcp_event_data_sent: sk=00000000c7a417e9
> >> (100007f0100007f:e85ab18b) now=4294902153 lsndtime=4294902153
> >> lrcvtime=4294902153 ping=0
> >>   > TCP: tcp_event_data_sent: sk=000000008681183e
> >> (100007f0100007f:8bb15ae8) now=4294902155 lsndtime=4294902153
> >> lrcvtime=4294902154 ping=0
> >>   > TCP:     INC ping=1 before=1
> >
> > It sounds like this test has a very specific dependence on the buggy
> > delayed ACK timing behavior from the buggy commit
> > 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> >
> > IMHO I don't think we can revert a kernel bug fix based on a test that
> > decided to depend on the exact timing of delayed ACKs during a time
> > when that delayed ACK behavior was buggy. :-)
>
> Unfortunately despite the test is likely bogus (I am unable to say it is
> or not), it does happen and the patch (the revert -- 4d8f24eeedc) breaks
> userspace. I'd say this is exactly the case where we apply "we do not
> break userspace". But I might be wrong as we might not care about silly
> tests.
>
> In anyway, openSUSE has to have the patch (the revert) reverted, so that
> the distro actually builds/works. (Until this is fixed on the eventlet
> side at least. And more importantly _until_ it propagates to distros or
> is fixed otherwise (like disabling the test).) And I suppose other
> distros would have to do the same. That is quite unfortunate :/.
>
> thanks,
> --
> js
> suse labs

Thanks, Jiri.

Here are some notes from an analysis of this test and its interaction
with the kernel, along with a suggested next step:

A summary:
-------------------

The python test:

  https://github.com/eventlet/eventlet/blob/890f320bbb56f915f128e5ed914be5e2a34b26d9/tests/wsgi_test.py

is a test of a python networking library:

  https://github.com/eventlet/eventlet

This user-space test has a race, and the change in the timing of TCP
ACK packets in v5.19 has uncovered this race and caused the test to
fail. But the change in the timing of the ACK is valid, and restores
ACK behavior that was in the kernel for a long time: at least between
the beginning of kernel git history in 2005 and 4d8f24eeedc5 (Revert
"tcp: change pingpong threshold to 3") in v5.1 in May 2019.

Slightly more details: The race in the test causes the pass/fail
outcome of the test to depend on the exact timing of TCP ACKs. The ACK
of the PUT /2 being delayed several hundred microseconds (delayed ACK
behavior for kernel v5.19 and kernels up through v5.0), in combination
with the client's Nagle behavior, delays the sending of the client PUT
/2 payload data, causing the payload data to be unread in the socket
when the server closes the connection in this test case, causing the
server to send a RST sooner than the client code is expecting (after
the PUT /2 payload data rather than after the PUT /3 request). This
causes the test to receive an unexpected "BrokenPipeError: [Errno 32]
Broken pipe" exception here, in line 648:

https://github.com/eventlet/eventlet/blob/890f320bbb56f915f128e5ed914be5e2a34b26d9/tests/wsgi_test.py#L648

IMHO the best solution here is to tweak the test code so that it does
not race and depend on the exact timing of TCP ACKs. One possible way
to achieve this would be to have the client TCP connection use
setsockopt(TCP_NODELAY) so that the body for the PUT /2 request is
transmitted immediately, whether or not the server delays the ACK of
the PUT /2 headers.

Detailed analysis:
-------------------

The client has apparently left the default Nagle algorithm behavior
enabled (did not use setsockopt(TCP_NODELAY), and so allows at most
one sub-MSS packet outstanding/un-ACKed, and so is sensitive to
delayed ACK behavior.

(1) With the behavior from 4a41f453bedf ("tcp: change pingpong
threshold to 3") in kernel 5.1 to 5.18 (passing test):

- The ACK of 'PUT /2' is immediate, so the "GET /tricksy" (HTTP
  content payload for the PUT /2") goes out quickly.

- Because the 'GET /tricksy' goes out quickly, the server probably
  reads it out of its receive buffer while reading the PUT itself.

- So when the server process closes the socket there is no data in the
  receive buffer, so the server socket does not sent a RST

- By the time the client test code has called sock.sendall(b'PUT
  /3...)  the client socket has received the FIN, and no RST has
  arrived yet, so the client gets the expected ConnectionClosed
  exception

(2) With the 4d8f24eeedc5 (Revert "tcp: change pingpong threshold to
  3") commit and the behavior from kernels up through v5.0 and now
  also 5.19 (failing test):

- ACKs are delayed, so the server's response to the PUT carries the
  ACK that triggers the client to sent the GET /tricksy.

- The server app code triggers the server to close the socket when the
  'GET /tricksy' is still in the server's receive buffer, so when the
  server closes the socket the server connection generates a RST.

- By the time the client test code has called sock.sendall(b'PUT /3')
  the client socket has received the server's FIN *and* the RST, so
  the client gets an unexpected:
         BrokenPipeError: [Errno 32] Broken pipe

tcpdump traces:
---------------

(1) passing test (kernel based on v5.10.31):

19:38:11.443592 IP 127.0.0.1.49786 > 127.0.0.1.39751: Flags [P.], seq
104:206, ack 141, win 512, options [nop,nop,TS val 981850760 ecr
981850760], length 102
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 009a 69c1 4000 4006 d29a 7f00 0001  E...i.@.@.......
        0x0020:  7f00 0001 c27a 9b47 142a 4bcf 389c f6cf  .....z.G.*K.8...
        0x0030:  8018 0200 fe8e 0000 0101 080a 3a85 da88  ............:...
        0x0040:  3a85 da88 5055 5420 2f32 2048 5454 502f  :...PUT./2.HTTP/
        0x0050:  312e 300d 0a48 6f73 743a 206c 6f63 616c  1.0..Host:.local
        0x0060:  686f 7374 0d0a 436f 6e6e 6563 7469 6f6e  host..Connection
        0x0070:  3a20 6b65 6570 2d61 6c69 7665 0d0a 436f  :.keep-alive..Co
        0x0080:  6e74 656e 742d 4c65 6e67 7468 3a20 3233  ntent-Length:.23
        0x0090:  0d0a 4578 7065 6374 3a20 3130 302d 636f  ..Expect:.100-co
        0x00a0:  6e74 696e 7565 0d0a 0d0a                 ntinue....
19:38:11.443597 IP 127.0.0.1.39751 > 127.0.0.1.49786: Flags [.], ack
206, win 512, options [nop,nop,TS val 981850760 ecr 981850760], length
0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0034 050f 4000 4006 37b3 7f00 0001  E..4..@.@.7.....
        0x0020:  7f00 0001 9b47 c27a 389c f6cf 142a 4c35  .....G.z8....*L5
        0x0030:  8010 0200 fe28 0000 0101 080a 3a85 da88  .....(......:...
        0x0040:  3a85 da88                                :...
19:38:11.443610 IP 127.0.0.1.49786 > 127.0.0.1.39751: Flags [P.], seq
206:229, ack 141, win 512, options [nop,nop,TS val 981850760 ecr
981850760], length 23
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 004b 69c2 4000 4006 d2e8 7f00 0001  E..Ki.@.@.......
        0x0020:  7f00 0001 c27a 9b47 142a 4c35 389c f6cf  .....z.G.*L58...
        0x0030:  8018 0200 fe3f 0000 0101 080a 3a85 da88  .....?......:...
        0x0040:  3a85 da88 4745 5420 2f74 7269 636b 7379  :...GET./tricksy
        0x0050:  2048 5454 502f 312e 310d 0a              .HTTP/1.1..
19:38:11.443613 IP 127.0.0.1.39751 > 127.0.0.1.49786: Flags [.], ack
229, win 512, options [nop,nop,TS val 981850760 ecr 981850760], length
0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0034 0510 4000 4006 37b2 7f00 0001  E..4..@.@.7.....
        0x0020:  7f00 0001 9b47 c27a 389c f6cf 142a 4c4c  .....G.z8....*LL
        0x0030:  8010 0200 fe28 0000 0101 080a 3a85 da88  .....(......:...
        0x0040:  3a85 da88                                :...
19:38:11.443921 IP 127.0.0.1.39751 > 127.0.0.1.49786: Flags [P.], seq
141:277, ack 229, win 512, options [nop,nop,TS val 981850760 ecr
981850760], length 136
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 00bc 0511 4000 4006 3729 7f00 0001  E.....@.@.7)....
        0x0020:  7f00 0001 9b47 c27a 389c f6cf 142a 4c4c  .....G.z8....*LL
        0x0030:  8018 0200 feb0 0000 0101 080a 3a85 da88  ............:...
        0x0040:  3a85 da88 4854 5450 2f31 2e31 2032 3030  :...HTTP/1.1.200
        0x0050:  204f 4b0d 0a43 6f6e 7465 6e74 2d54 7970  .OK..Content-Typ
        0x0060:  653a 2074 6578 742f 706c 6169 6e0d 0a43  e:.text/plain..C
        0x0070:  6f6e 7465 6e74 2d4c 656e 6774 683a 2031  ontent-Length:.1
        0x0080:  350d 0a44 6174 653a 2054 7565 2c20 3136  5..Date:.Tue,.16
        0x0090:  2041 7567 2032 3032 3220 3139 3a33 383a  .Aug.2022.19:38:
        0x00a0:  3131 2047 4d54 0d0a 436f 6e6e 6563 7469  11.GMT..Connecti
        0x00b0:  6f6e 3a20 636c 6f73 650d 0a0d 0a73 6563  on:.close....sec
        0x00c0:  6f6e 6420 7265 7370 6f6e 7365            ond.response
19:38:11.443926 IP 127.0.0.1.49786 > 127.0.0.1.39751: Flags [.], ack
277, win 511, options [nop,nop,TS val 981850760 ecr 981850760], length
0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0034 69c3 4000 4006 d2fe 7f00 0001  E..4i.@.@.......
        0x0020:  7f00 0001 c27a 9b47 142a 4c4c 389c f757  .....z.G.*LL8..W
        0x0030:  8010 01ff fe28 0000 0101 080a 3a85 da88  .....(......:...
        0x0040:  3a85 da88                                :...
19:38:11.443990 IP 127.0.0.1.39751 > 127.0.0.1.49786: Flags [F.], seq
277, ack 229, win 512, options [nop,nop,TS val 981850760 ecr
981850760], length 0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0034 0512 4000 4006 37b0 7f00 0001  E..4..@.@.7.....
        0x0020:  7f00 0001 9b47 c27a 389c f757 142a 4c4c  .....G.z8..W.*LL
        0x0030:  8011 0200 fe28 0000 0101 080a 3a85 da88  .....(......:...
19:38:11.444132 IP 127.0.0.1.49786 > 127.0.0.1.39751: Flags [P.], seq
229:284, ack 278, win 512, options [nop,nop,TS val 981850760 ecr
981850760], length 55
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 006b 69c4 4000 4006 d2c6 7f00 0001  E..ki.@.@.......
        0x0020:  7f00 0001 c27a 9b47 142a 4c4c 389c f758  .....z.G.*LL8..X
        0x0030:  8018 0200 fe5f 0000 0101 080a 3a85 da88  ....._......:...
        0x0040:  3a85 da88 5055 5420 2f33 2048 5454 502f  :...PUT./3.HTTP/
        0x0050:  312e 300d 0a48 6f73 743a 206c 6f63 616c  1.0..Host:.local
        0x0060:  686f 7374 0d0a 436f 6e6e 6563 7469 6f6e  host..Connection
        0x0070:  3a20 636c 6f73 650d 0a0d 0a              :.close....
19:38:11.444144 IP 127.0.0.1.39751 > 127.0.0.1.49786: Flags [R], seq
949811032, win 0, length 0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0028 0000 4000 4006 3cce 7f00 0001  E..(..@.@.<.....
        0x0020:  7f00 0001 9b47 c27a 389c f758 0000 0000  .....G.z8..X....
        0x0030:  5004 0000 2427 0000                      P...$'..


(2) failing test (kernel based on v5.10.31 with a cherry-pick of 4d8f24eeedc5
(Revert "tcp: change pingpong threshold to 3"):

19:13:26.543081 IP 127.0.0.1.33382 > 127.0.0.1.45645: Flags [P.], seq
104:206, ack 141, win 512, options [nop,nop,TS val 1506020510 ecr
1506020510], length 102
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 009a 5f26 4000 4006 dd35 7f00 0001  E..._&@.@..5....
        0x0020:  7f00 0001 8266 b24d 1931 9034 880c 382e  .....f.M.1.4..8.
        0x0030:  8018 0200 fe8e 0000 0101 080a 59c4 0c9e  ............Y...
        0x0040:  59c4 0c9e 5055 5420 2f32 2048 5454 502f  Y...PUT./2.HTTP/
        0x0050:  312e 300d 0a48 6f73 743a 206c 6f63 616c  1.0..Host:.local
        0x0060:  686f 7374 0d0a 436f 6e6e 6563 7469 6f6e  host..Connection
        0x0070:  3a20 6b65 6570 2d61 6c69 7665 0d0a 436f  :.keep-alive..Co
        0x0080:  6e74 656e 742d 4c65 6e67 7468 3a20 3233  ntent-Length:.23
        0x0090:  0d0a 4578 7065 6374 3a20 3130 302d 636f  ..Expect:.100-co
        0x00a0:  6e74 696e 7565 0d0a 0d0a                 ntinue....
19:13:26.543361 IP 127.0.0.1.45645 > 127.0.0.1.33382: Flags [P.], seq
141:277, ack 206, win 512, options [nop,nop,TS val 1506020510 ecr
1506020510], length 136
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 00bc cbe9 4000 4006 7050 7f00 0001  E.....@.@.pP....
        0x0020:  7f00 0001 b24d 8266 880c 382e 1931 909a  .....M.f..8..1..
        0x0030:  8018 0200 feb0 0000 0101 080a 59c4 0c9e  ............Y...
        0x0040:  59c4 0c9e 4854 5450 2f31 2e31 2032 3030  Y...HTTP/1.1.200
        0x0050:  204f 4b0d 0a43 6f6e 7465 6e74 2d54 7970  .OK..Content-Typ
        0x0060:  653a 2074 6578 742f 706c 6169 6e0d 0a43  e:.text/plain..C
        0x0070:  6f6e 7465 6e74 2d4c 656e 6774 683a 2031  ontent-Length:.1
        0x0080:  350d 0a44 6174 653a 2054 7565 2c20 3136  5..Date:.Tue,.16
        0x0090:  2041 7567 2032 3032 3220 3139 3a31 333a  .Aug.2022.19:13:
        0x00a0:  3236 2047 4d54 0d0a 436f 6e6e 6563 7469  26.GMT..Connecti
        0x00b0:  6f6e 3a20 636c 6f73 650d 0a0d 0a73 6563  on:.close....sec
        0x00c0:  6f6e 6420 7265 7370 6f6e 7365            ond.response
19:13:26.543368 IP 127.0.0.1.33382 > 127.0.0.1.45645: Flags [P.], seq
206:229, ack 277, win 511, options [nop,nop,TS val 1506020510 ecr
1506020510], length 23
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 004b 5f27 4000 4006 dd83 7f00 0001  E..K_'@.@.......
        0x0020:  7f00 0001 8266 b24d 1931 909a 880c 38b6  .....f.M.1....8.
        0x0030:  8018 01ff fe3f 0000 0101 080a 59c4 0c9e  .....?......Y...
        0x0040:  59c4 0c9e 4745 5420 2f74 7269 636b 7379  Y...GET./tricksy
        0x0050:  2048 5454 502f 312e 310d 0a              .HTTP/1.1..
19:13:26.543418 IP 127.0.0.1.45645 > 127.0.0.1.33382: Flags [F.], seq
277, ack 229, win 512, options [nop,nop,TS val 1506020510 ecr
1506020510], length 0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0034 cbea 4000 4006 70d7 7f00 0001  E..4..@.@.p.....
        0x0020:  7f00 0001 b24d 8266 880c 38b6 1931 90b1  .....M.f..8..1..
        0x0030:  8011 0200 fe28 0000 0101 080a 59c4 0c9e  .....(......Y...
        0x0040:  59c4 0c9e                                Y...
19:13:26.543459 IP 127.0.0.1.45645 > 127.0.0.1.33382: Flags [R.], seq
278, ack 229, win 512, options [nop,nop,TS val 1506020510 ecr
1506020510], length 0
        0x0000:  0000 0304 0006 0000 0000 0000 0000 0800  ................
        0x0010:  4500 0034 cbeb 4000 4006 70d6 7f00 0001  E..4..@.@.p.....
        0x0020:  7f00 0001 b24d 8266 880c 38b7 1931 90b1  .....M.f..8..1..
        0x0030:  8014 0200 fe28 0000 0101 080a 59c4 0c9e  .....(......Y...
        0x0040:  59c4 0c9e                                Y...

best regards,
neal
Jiri Slaby Aug. 17, 2022, 6:44 a.m. UTC | #13
On 17. 08. 22, 0:19, Neal Cardwell wrote:
> IMHO the best solution here is to tweak the test code so that it does
> not race and depend on the exact timing of TCP ACKs. One possible way
> to achieve this would be to have the client TCP connection use
> setsockopt(TCP_NODELAY) so that the body for the PUT /2 request is
> transmitted immediately, whether or not the server delays the ACK of
> the PUT /2 headers.

Thanks a lot, Neal! So for the time being, until this is resolved in 
eventlet, I pushed a change to disable the test in openSUSE.

thanks,
Linus Torvalds Aug. 18, 2022, 5:50 p.m. UTC | #14
On Tue, Aug 16, 2022 at 11:44 PM Jiri Slaby <jirislaby@kernel.org> wrote:
>
> Thanks a lot, Neal! So for the time being, until this is resolved in
> eventlet, I pushed a change to disable the test in openSUSE.

Yes, that is the right approach in this case, since it really does
seem to be purely about timing.

Note that "breaks user space" is not about random test suites. For all
we know, a test suite can exist literally to check for known bugs etc.

No user space breakage is about real loads by real people, ie "I used
to do this, and now that no longer works".

The test suites showing behavioral differences are certainly always a
cause of worry - because the test may exist for a very particular
reason, and may be indicative that yes, there are actual real programs
depending on this behavior - but isn't in itself necessarily a problem
in itself.

            Linus
diff mbox series

Patch

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 85cd695e7fd1..ee88f0f1350f 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -321,7 +321,7 @@  void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
 
 struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
 
-#define TCP_PINGPONG_THRESH	3
+#define TCP_PINGPONG_THRESH	1
 
 static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
 {
@@ -338,14 +338,6 @@  static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
 	return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
 }
 
-static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
-{
-	struct inet_connection_sock *icsk = inet_csk(sk);
-
-	if (icsk->icsk_ack.pingpong < U8_MAX)
-		icsk->icsk_ack.pingpong++;
-}
-
 static inline bool inet_csk_has_ulp(struct sock *sk)
 {
 	return inet_sk(sk)->is_icsk && !!inet_csk(sk)->icsk_ulp_ops;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c38e07b50639..d06e72e141ac 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -167,16 +167,13 @@  static void tcp_event_data_sent(struct tcp_sock *tp,
 	if (tcp_packets_in_flight(tp) == 0)
 		tcp_ca_event(sk, CA_EVENT_TX_START);
 
-	/* If this is the first data packet sent in response to the
-	 * previous received data,
-	 * and it is a reply for ato after last received packet,
-	 * increase pingpong count.
-	 */
-	if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
-	    (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
-		inet_csk_inc_pingpong_cnt(sk);
-
 	tp->lsndtime = now;
+
+	/* If it is a reply for ato after last received
+	 * packet, enter pingpong mode.
+	 */
+	if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
+		inet_csk_enter_pingpong_mode(sk);
 }
 
 /* Account for an ACK we sent. */