Message ID | 14459261ea9f9c7d7dfb28eb004ce8734fa83ade.1704185904.git.leonro@nvidia.com (mailing list archive) |
---|---|
State | Accepted |
Commit | b59db45d7eba9894e7834d768ec4236fca39bd7d |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: Revert no longer abort SYN_SENT when receiving some ICMP | expand |
On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > From: Shachar Kagan <skagan@nvidia.com> > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > very popular tool to manage fleet of VMs stopped to work after commit > citied in Fixes line. > > The issue appears while using Vagrant to manage nested VMs. > The steps are: > * create vagrant file > * vagrant up > * vagrant halt (VM is created but shut down) > * vagrant up - fail > I would rather have an explanation, instead of reverting a valid patch. I have been on vacation for some time. I may have missed a detailed explanation, please repost if needed.
On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > very popular tool to manage fleet of VMs stopped to work after commit > > citied in Fixes line. > > > > The issue appears while using Vagrant to manage nested VMs. > > The steps are: > > * create vagrant file > > * vagrant up > > * vagrant halt (VM is created but shut down) > > * vagrant up - fail > > > > I would rather have an explanation, instead of reverting a valid patch. > > I have been on vacation for some time. I may have missed a detailed > explanation, please repost if needed. Our detailed explanation that revert worked. You provided the patch that broke, so please let's not require from users to debug it. If you need a help to reproduce and/or test some hypothesis, Shachar will be happy to help you, just ask. Thanks
On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > very popular tool to manage fleet of VMs stopped to work after commit > > > citied in Fixes line. > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > The steps are: > > > * create vagrant file > > > * vagrant up > > > * vagrant halt (VM is created but shut down) > > > * vagrant up - fail > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > I have been on vacation for some time. I may have missed a detailed > > explanation, please repost if needed. > > Our detailed explanation that revert worked. You provided the patch that > broke, so please let's not require from users to debug it. > > If you need a help to reproduce and/or test some hypothesis, Shachar > will be happy to help you, just ask. I have asked already, and received files that showed no ICMP relevant interactions. Can someone from your team help Shachar to get a packet capture of both TCP _and_ ICMP packets ? Otherwise there is little I can do. I can not blindly trust someone that a valid patch broke something, just because 'something broke'
On Tue, Jan 2, 2024 at 11:03 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > > very popular tool to manage fleet of VMs stopped to work after commit > > > > citied in Fixes line. > > > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > > The steps are: > > > > * create vagrant file > > > > * vagrant up > > > > * vagrant halt (VM is created but shut down) > > > > * vagrant up - fail > > > > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > > > I have been on vacation for some time. I may have missed a detailed > > > explanation, please repost if needed. > > > > Our detailed explanation that revert worked. You provided the patch that > > broke, so please let's not require from users to debug it. > > > > If you need a help to reproduce and/or test some hypothesis, Shachar > > will be happy to help you, just ask. > > I have asked already, and received files that showed no ICMP relevant > interactions. > > Can someone from your team help Shachar to get a packet capture of > both TCP _and_ ICMP packets ? > > Otherwise there is little I can do. I can not blindly trust someone > that a valid patch broke something, just because 'something broke' Alternative to a packet capture would be a packetdrill test. Following test passes with the new kernel, and breaks with the old ones. // Set up config. `../common/defaults.sh ../common/set_sysctls.py /proc/sys/net/ipv4/tcp_syn_retries=12 \ /proc/sys/net/ipv4/tcp_syn_linear_timeouts=4 ` // Establish a connection. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0 > S 0:0(0) win 65535 <mss 1460,sackOK,TS val 80 ecr 0,nop,wscale 8> +1 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 0, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 1, tcpi_total_rto}% +1 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 0, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 2, tcpi_total_rto}% +1 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 0, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 3, tcpi_total_rto}% // RFC 6069 : this ICMP message is ignored while in linear timeout mode +.5 < icmp unreachable host_unreachable [0:0(0)] +0.5 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 0, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 4, tcpi_total_rto}% +1 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 1, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 5, tcpi_total_rto}% +2 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 2, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 6, tcpi_total_rto}% +4 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 3, tcpi_backoff}% +0 %{ assert tcpi_total_rto == 7, tcpi_total_rto}% // RFC 6069 : this ICMP message should undo one retransmission timer backoff +.5 < icmp unreachable host_unreachable [0:0(0)] +0 %{ assert tcpi_backoff == 2, tcpi_backoff }% // RFC 6069 : this ICMP message should undo one retransmission timer backoff +.4 < icmp unreachable host_unreachable [0:0(0)] +0 %{ assert tcpi_backoff == 1, tcpi_backoff }% // RFC 6069 : this ICMP message should undo one retransmission timer backoff +0 < icmp unreachable host_unreachable [0:0(0)] +0 %{ assert tcpi_backoff == 0, tcpi_backoff }% +.1 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 1, tcpi_backoff }% // RFC 6069 : this ICMP message should undo one retransmission timer backoff +.6 < icmp unreachable net_unreachable [0:0(0)] +0 %{ assert tcpi_backoff == 0, tcpi_backoff }% +.4 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 1, tcpi_backoff }% //RFC 6069 : other ICMP messages should not undo one timer backoff +0 < icmp unreachable source_route_failed [1:1(0)] +0 < icmp unreachable net_unreachable_for_tos [1:1(0)] +0 < icmp unreachable host_unreachable_for_tos [1:1(0)] +0 %{ assert tcpi_backoff == 1, tcpi_backoff }% +2 > S 0:0(0) win 65535 <...> +0 %{ assert tcpi_backoff == 2, tcpi_backoff }%
On Tue, Jan 02, 2024 at 11:03:55AM +0100, Eric Dumazet wrote: > On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > > very popular tool to manage fleet of VMs stopped to work after commit > > > > citied in Fixes line. > > > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > > The steps are: > > > > * create vagrant file > > > > * vagrant up > > > > * vagrant halt (VM is created but shut down) > > > > * vagrant up - fail > > > > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > > > I have been on vacation for some time. I may have missed a detailed > > > explanation, please repost if needed. > > > > Our detailed explanation that revert worked. You provided the patch that > > broke, so please let's not require from users to debug it. > > > > If you need a help to reproduce and/or test some hypothesis, Shachar > > will be happy to help you, just ask. > > I have asked already, and received files that showed no ICMP relevant > interactions. > > Can someone from your team help Shachar to get a packet capture of > both TCP _and_ ICMP packets ? I or Gal will help her, but for now let's revert it, before we will see this breakage in merge window and later in all other branches which will be based on -rc1. > > Otherwise there is little I can do. I can not blindly trust someone > that a valid patch broke something, just because 'something broke' We use standard Vagrant, you can try to reproduce the issue locally. Thanks
On Tue, Jan 2, 2024 at 12:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Jan 02, 2024 at 11:03:55AM +0100, Eric Dumazet wrote: > > On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > > > very popular tool to manage fleet of VMs stopped to work after commit > > > > > citied in Fixes line. > > > > > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > > > The steps are: > > > > > * create vagrant file > > > > > * vagrant up > > > > > * vagrant halt (VM is created but shut down) > > > > > * vagrant up - fail > > > > > > > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > > > > > I have been on vacation for some time. I may have missed a detailed > > > > explanation, please repost if needed. > > > > > > Our detailed explanation that revert worked. You provided the patch that > > > broke, so please let's not require from users to debug it. > > > > > > If you need a help to reproduce and/or test some hypothesis, Shachar > > > will be happy to help you, just ask. > > > > I have asked already, and received files that showed no ICMP relevant > > interactions. > > > > Can someone from your team help Shachar to get a packet capture of > > both TCP _and_ ICMP packets ? > > I or Gal will help her, but for now let's revert it, before we will see > this breakage in merge window and later in all other branches which will > be based on -rc1. Patch is in net-next, we have at least four weeks to find the root cause. I am a TCP maintainer, I will ask you to respect my choice, we have tests and reverting the patch is breaking one of them.
On Tue, Jan 02, 2024 at 04:31:15PM +0100, Eric Dumazet wrote: > On Tue, Jan 2, 2024 at 12:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Jan 02, 2024 at 11:03:55AM +0100, Eric Dumazet wrote: > > > On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > > > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > > > > very popular tool to manage fleet of VMs stopped to work after commit > > > > > > citied in Fixes line. > > > > > > > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > > > > The steps are: > > > > > > * create vagrant file > > > > > > * vagrant up > > > > > > * vagrant halt (VM is created but shut down) > > > > > > * vagrant up - fail > > > > > > > > > > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > > > > > > > I have been on vacation for some time. I may have missed a detailed > > > > > explanation, please repost if needed. > > > > > > > > Our detailed explanation that revert worked. You provided the patch that > > > > broke, so please let's not require from users to debug it. > > > > > > > > If you need a help to reproduce and/or test some hypothesis, Shachar > > > > will be happy to help you, just ask. > > > > > > I have asked already, and received files that showed no ICMP relevant > > > interactions. > > > > > > Can someone from your team help Shachar to get a packet capture of > > > both TCP _and_ ICMP packets ? > > > > I or Gal will help her, but for now let's revert it, before we will see > > this breakage in merge window and later in all other branches which will > > be based on -rc1. > > Patch is in net-next, we have at least four weeks to find the root cause. I saw more than once claims that netdev is fast to take patches but also fast in reverts. There is no need to keep patch with known regression, while we are in -rc8. > > I am a TCP maintainer, I will ask you to respect my choice, we have > tests and reverting the patch is breaking one of them. At least for ipv6, you changed code from 2016 and the patch which I'm asking to revert is not even marked as a fix. So I don't understand the urgency to keep the patch. There are two things to consider: 1. Linux rule number one is "do not break userspace". 2. Linux is a community project and people can have different opinions, which can be different from your/mine. Thanks
On Tue, Jan 2, 2024 at 7:01 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Jan 02, 2024 at 04:31:15PM +0100, Eric Dumazet wrote: > > On Tue, Jan 2, 2024 at 12:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Tue, Jan 02, 2024 at 11:03:55AM +0100, Eric Dumazet wrote: > > > > On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > > > > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > > > > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > > > > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > > > > > very popular tool to manage fleet of VMs stopped to work after commit > > > > > > > citied in Fixes line. > > > > > > > > > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > > > > > The steps are: > > > > > > > * create vagrant file > > > > > > > * vagrant up > > > > > > > * vagrant halt (VM is created but shut down) > > > > > > > * vagrant up - fail > > > > > > > > > > > > > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > > > > > > > > > I have been on vacation for some time. I may have missed a detailed > > > > > > explanation, please repost if needed. > > > > > > > > > > Our detailed explanation that revert worked. You provided the patch that > > > > > broke, so please let's not require from users to debug it. > > > > > > > > > > If you need a help to reproduce and/or test some hypothesis, Shachar > > > > > will be happy to help you, just ask. > > > > > > > > I have asked already, and received files that showed no ICMP relevant > > > > interactions. > > > > > > > > Can someone from your team help Shachar to get a packet capture of > > > > both TCP _and_ ICMP packets ? > > > > > > I or Gal will help her, but for now let's revert it, before we will see > > > this breakage in merge window and later in all other branches which will > > > be based on -rc1. > > > > Patch is in net-next, we have at least four weeks to find the root cause. > > I saw more than once claims that netdev is fast to take patches but also > fast in reverts. There is no need to keep patch with known regression, > while we are in -rc8. This patch is not in rc8, unless I am mistaken ? > > > > > I am a TCP maintainer, I will ask you to respect my choice, we have > > tests and reverting the patch is breaking one of them. > > At least for ipv6, you changed code from 2016 and the patch which I'm asking > to revert is not even marked as a fix. So I don't understand the urgency to keep > the patch. Do you have an issue with IPv4 code or IPv6 ? It would help to have details. > > There are two things to consider: > 1. Linux rule number one is "do not break userspace". No released kernel contains the issue yet. Nothing broke yet. net-next is for developers. > 2. Linux is a community project and people can have different opinions, > which can be different from your/mine. > > Thanks I think we have time, and getting this patch with potential users on it will help to debug the issue.
On Tue, Jan 02, 2024 at 07:33:23PM +0100, Eric Dumazet wrote: > On Tue, Jan 2, 2024 at 7:01 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Jan 02, 2024 at 04:31:15PM +0100, Eric Dumazet wrote: > > > On Tue, Jan 2, 2024 at 12:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Tue, Jan 02, 2024 at 11:03:55AM +0100, Eric Dumazet wrote: > > > > > On Tue, Jan 2, 2024 at 10:58 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > On Tue, Jan 02, 2024 at 10:46:13AM +0100, Eric Dumazet wrote: > > > > > > > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > > > > > > > > > > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > > > > > > > > > > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > > > > > > > very popular tool to manage fleet of VMs stopped to work after commit > > > > > > > > citied in Fixes line. > > > > > > > > > > > > > > > > The issue appears while using Vagrant to manage nested VMs. > > > > > > > > The steps are: > > > > > > > > * create vagrant file > > > > > > > > * vagrant up > > > > > > > > * vagrant halt (VM is created but shut down) > > > > > > > > * vagrant up - fail > > > > > > > > > > > > > > > > > > > > > > I would rather have an explanation, instead of reverting a valid patch. > > > > > > > > > > > > > > I have been on vacation for some time. I may have missed a detailed > > > > > > > explanation, please repost if needed. > > > > > > > > > > > > Our detailed explanation that revert worked. You provided the patch that > > > > > > broke, so please let's not require from users to debug it. > > > > > > > > > > > > If you need a help to reproduce and/or test some hypothesis, Shachar > > > > > > will be happy to help you, just ask. > > > > > > > > > > I have asked already, and received files that showed no ICMP relevant > > > > > interactions. > > > > > > > > > > Can someone from your team help Shachar to get a packet capture of > > > > > both TCP _and_ ICMP packets ? > > > > > > > > I or Gal will help her, but for now let's revert it, before we will see > > > > this breakage in merge window and later in all other branches which will > > > > be based on -rc1. > > > > > > Patch is in net-next, we have at least four weeks to find the root cause. > > > > I saw more than once claims that netdev is fast to take patches but also > > fast in reverts. There is no need to keep patch with known regression, > > while we are in -rc8. > > This patch is not in rc8, unless I am mistaken ? Patch is not, but the timeline is. Right now, we are in -rc8 and in next week, the merge window will start. > > > > > > > > > I am a TCP maintainer, I will ask you to respect my choice, we have > > > tests and reverting the patch is breaking one of them. > > > > At least for ipv6, you changed code from 2016 and the patch which I'm asking > > to revert is not even marked as a fix. So I don't understand the urgency to keep > > the patch. > > Do you have an issue with IPv4 code or IPv6 ? I think that IPv4, I looked on IPv6 dates just because it was easy to do. It looks like IPv4 code which you are changing existed in its current form even before git era. > > It would help to have details. We prepared raw PCAP files and I'm uploading them. It takes time. > > > > There are two things to consider: > > 1. Linux rule number one is "do not break userspace". > > No released kernel contains the issue yet. Nothing broke yet. After merge window, it will and I want to avoid it. > > net-next is for developers. It is encouraged to test net-next, so we did it and reported. > > > 2. Linux is a community project and people can have different opinions, > > which can be different from your/mine. > > > > Thanks > > I think we have time, and getting this patch with potential users on > it will help to debug the issue. Like I said, Shachar can test any debug patches, just ask her. Thanks
On Tue, 2 Jan 2024 10:46:13 +0100 Eric Dumazet wrote: > > The issue appears while using Vagrant to manage nested VMs. > > The steps are: > > * create vagrant file > > * vagrant up > > * vagrant halt (VM is created but shut down) > > * vagrant up - fail > > I would rather have an explanation, instead of reverting a valid patch. +1 obviously. Your refusal to debug this any further does not put nVidia's TCP / NVMe offload in a good light. On one hand you claim to have TCP experts in house and are pushing TCP offloads and on the other you can't debug a TCP issue for which you reportedly have an easy repro? Does not add up.
On Tue, Jan 02, 2024 at 02:02:32PM -0800, Jakub Kicinski wrote: > On Tue, 2 Jan 2024 10:46:13 +0100 Eric Dumazet wrote: > > > The issue appears while using Vagrant to manage nested VMs. > > > The steps are: > > > * create vagrant file > > > * vagrant up > > > * vagrant halt (VM is created but shut down) > > > * vagrant up - fail > > > > I would rather have an explanation, instead of reverting a valid patch. > > +1 obviously. Your refusal to debug this any further does not put > nVidia's TCP / NVMe offload in a good light. On one hand you > claim to have TCP experts in house and are pushing TCP offloads and > on the other you can't debug a TCP issue for which you reportedly > have an easy repro? Does not add up. Did I claim about TCP experts? No. Did we cause to Vagrant to stop working? No. Did we write problematic patch? No. So let's not ask from the people who by chance tested the code to debug it. Thanks
On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > From: Shachar Kagan <skagan@nvidia.com> > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > very popular tool to manage fleet of VMs stopped to work after commit > citied in Fixes line. > > The issue appears while using Vagrant to manage nested VMs. > The steps are: > * create vagrant file > * vagrant up > * vagrant halt (VM is created but shut down) > * vagrant up - fail > > Vagrant up stdout: > Bringing machine 'player1' up with 'libvirt' provider... > ==> player1: Creating shared folders metadata... > ==> player1: Starting domain. > ==> player1: Domain launching with graphics connection settings... > ==> player1: -- Graphics Port: 5900 > ==> player1: -- Graphics IP: 127.0.0.1 > ==> player1: -- Graphics Password: Not defined > ==> player1: -- Graphics Websocket: 5700 > ==> player1: Waiting for domain to get an IP address... > ==> player1: Waiting for machine to boot. This may take a few minutes... > player1: SSH address: 192.168.123.61:22 > player1: SSH username: vagrant > player1: SSH auth method: private key > ==> player1: Attempting graceful shutdown of VM... > ==> player1: Attempting graceful shutdown of VM... > ==> player1: Attempting graceful shutdown of VM... > player1: Guest communication could not be established! This is usually because > player1: SSH is not running, the authentication information was changed, > player1: or some other networking issue. Vagrant will force halt, if > player1: capable. > ==> player1: Attempting direct shutdown of domain... > > Fixes: 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving some ICMP") > Closes: https://lore.kernel.org/all/MN2PR12MB44863139E562A59329E89DBEB982A@MN2PR12MB4486.namprd12.prod.outlook.com > Signed-off-by: Shachar Kagan <skagan@nvidia.com> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > --- While IPv6 code has an issue (not calling tcp_ld_RTO_revert() helper for TCP_SYN_SENT as intended, I could not find the root cause for your case. We will submit the patch again for 6.9, once we get to the root cause. Reviewed-by: Eric Dumazet <edumazet@google.com>
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski <kuba@kernel.org>: On Tue, 2 Jan 2024 11:00:57 +0200 you wrote: > From: Shachar Kagan <skagan@nvidia.com> > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > very popular tool to manage fleet of VMs stopped to work after commit > citied in Fixes line. > > [...] Here is the summary with links: - [net-next] tcp: Revert no longer abort SYN_SENT when receiving some ICMP https://git.kernel.org/netdev/net-next/c/b59db45d7eba You are awesome, thank you!
On Mon, Jan 08, 2024 at 04:52:39PM +0100, Eric Dumazet wrote: > On Tue, Jan 2, 2024 at 10:01 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > From: Shachar Kagan <skagan@nvidia.com> > > > > This reverts commit 0a8de364ff7a14558e9676f424283148110384d6. > > > > Shachar reported that Vagrant (https://www.vagrantup.com/), which is > > very popular tool to manage fleet of VMs stopped to work after commit > > citied in Fixes line. > > > > The issue appears while using Vagrant to manage nested VMs. > > The steps are: > > * create vagrant file > > * vagrant up > > * vagrant halt (VM is created but shut down) > > * vagrant up - fail > > > > Vagrant up stdout: > > Bringing machine 'player1' up with 'libvirt' provider... > > ==> player1: Creating shared folders metadata... > > ==> player1: Starting domain. > > ==> player1: Domain launching with graphics connection settings... > > ==> player1: -- Graphics Port: 5900 > > ==> player1: -- Graphics IP: 127.0.0.1 > > ==> player1: -- Graphics Password: Not defined > > ==> player1: -- Graphics Websocket: 5700 > > ==> player1: Waiting for domain to get an IP address... > > ==> player1: Waiting for machine to boot. This may take a few minutes... > > player1: SSH address: 192.168.123.61:22 > > player1: SSH username: vagrant > > player1: SSH auth method: private key > > ==> player1: Attempting graceful shutdown of VM... > > ==> player1: Attempting graceful shutdown of VM... > > ==> player1: Attempting graceful shutdown of VM... > > player1: Guest communication could not be established! This is usually because > > player1: SSH is not running, the authentication information was changed, > > player1: or some other networking issue. Vagrant will force halt, if > > player1: capable. > > ==> player1: Attempting direct shutdown of domain... > > > > Fixes: 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving some ICMP") > > Closes: https://lore.kernel.org/all/MN2PR12MB44863139E562A59329E89DBEB982A@MN2PR12MB4486.namprd12.prod.outlook.com > > Signed-off-by: Shachar Kagan <skagan@nvidia.com> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > > --- > > While IPv6 code has an issue (not calling tcp_ld_RTO_revert() helper > for TCP_SYN_SENT as intended, > I could not find the root cause for your case. > > We will submit the patch again for 6.9, once we get to the root cause. Feel free to send to Shachar with CC me and/or Gal any patches which you would like to test in advance and we will be happy to do it. > > Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 4bac6e319aca..0c50c5a32b84 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -482,7 +482,6 @@ int tcp_v4_err(struct sk_buff *skb, u32 info) const int code = icmp_hdr(skb)->code; struct sock *sk; struct request_sock *fastopen; - bool harderr = false; u32 seq, snd_una; int err; struct net *net = dev_net(skb->dev); @@ -556,7 +555,6 @@ int tcp_v4_err(struct sk_buff *skb, u32 info) goto out; case ICMP_PARAMETERPROB: err = EPROTO; - harderr = true; break; case ICMP_DEST_UNREACH: if (code > NR_ICMP_UNREACH) @@ -581,7 +579,6 @@ int tcp_v4_err(struct sk_buff *skb, u32 info) } err = icmp_err_convert[code].errno; - harderr = icmp_err_convert[code].fatal; /* check if this ICMP message allows revert of backoff. * (see RFC 6069) */ @@ -607,9 +604,6 @@ int tcp_v4_err(struct sk_buff *skb, u32 info) ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th); - if (!harderr) - break; - if (!sock_owned_by_user(sk)) { WRITE_ONCE(sk->sk_err, err); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index d1307d77a6f0..57b25b1fc9d9 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -381,7 +381,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, struct tcp_sock *tp; __u32 seq, snd_una; struct sock *sk; - bool harderr; + bool fatal; int err; sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row.hashinfo, @@ -402,9 +402,9 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, return 0; } seq = ntohl(th->seq); - harderr = icmpv6_err_convert(type, code, &err); + fatal = icmpv6_err_convert(type, code, &err); if (sk->sk_state == TCP_NEW_SYN_RECV) { - tcp_req_err(sk, seq, harderr); + tcp_req_err(sk, seq, fatal); return 0; } @@ -489,9 +489,6 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, ipv6_icmp_error(sk, skb, err, th->dest, ntohl(info), (u8 *)th); - if (!harderr) - break; - if (!sock_owned_by_user(sk)) { WRITE_ONCE(sk->sk_err, err); sk_error_report(sk); /* Wake people up to see the error (see connect in sock.c) */