Message ID | de91843a7f59feb065475ca82be22c275bede3df.1673666803.git.lucien.xin@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: support ipv4 big tcp | expand |
On 1/13/23 8:31 PM, Xin Long wrote: > For IPv6 jumbogram packets, the packet size is bigger than 65535, > it's not right to get it from payload_len and save it to an u16 > variable. > > This patch only fixes it for IPv6 BIG TCP packets, so instead of > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only > gets the pktlen via 'skb->len - skb_network_offset(skb)' when > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 > BIG TCP packets. > > This fix will also help us add selftest for IPv6 BIG TCP in the > following patch. > If this is a bug fix for the existing IPv6 support, send it outside of this set for -net.
On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > On 1/13/23 8:31 PM, Xin Long wrote: > > For IPv6 jumbogram packets, the packet size is bigger than 65535, > > it's not right to get it from payload_len and save it to an u16 > > variable. > > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 > > BIG TCP packets. > > > > This fix will also help us add selftest for IPv6 BIG TCP in the > > following patch. > > > > If this is a bug fix for the existing IPv6 support, send it outside of > this set for -net. > Sure, I was thinking of adding it here to be able to support selftest for IPv6 too in the next patch. But it seems to make more sense to get it into -net first, then add this selftest after it goes to net-next. I will post it and all other fixes I mentioned in the cover-letter for IPv6 BIG TCP for -net. But before that, I hope Eric can confirm it is okay to read the length of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch, instead of parsing JUMBO exthdr? Thanks.
On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote: > > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > On 1/13/23 8:31 PM, Xin Long wrote: > > > For IPv6 jumbogram packets, the packet size is bigger than 65535, > > > it's not right to get it from payload_len and save it to an u16 > > > variable. > > > > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 > > > BIG TCP packets. > > > > > > This fix will also help us add selftest for IPv6 BIG TCP in the > > > following patch. > > > > > > > If this is a bug fix for the existing IPv6 support, send it outside of > > this set for -net. > > > Sure, > I was thinking of adding it here to be able to support selftest for > IPv6 too in the next patch. But it seems to make more sense to > get it into -net first, then add this selftest after it goes to net-next. > > I will post it and all other fixes I mentioned in the cover-letter for > IPv6 BIG TCP for -net. > > But before that, I hope Eric can confirm it is okay to read the length > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch, > instead of parsing JUMBO exthdr? > I do not think it is ok, but I will leave the question to netfilter maintainers. Guessing things in tcpdump or other tools is up to user space implementations, trying to work around some (kernel ?) deficiencies. Yes, IPv6 extensions headers are a pain, we all agree. Look at how ip6_rcv_core() properly dissects extension headers _and_ trim skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core() or ipv6_hop_jumbo()) So skb->len is not the root of trust. Some transport mediums might add paddings. Ipv4 has a similar logic in ip_rcv_core(). len = ntohs(iph->tot_len); if (skb->len < len) { drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL; __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } else if (len < (iph->ihl*4)) goto inhdr_error; /* Our transport medium may have padded the buffer out. Now we know it * is IP we can trim to the true length of the frame. * Note this now means skb->len holds ntohs(iph->tot_len). */ if (pskb_trim_rcsum(skb, len)) { __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); goto drop; } After your changes, we might accept illegal packets that were properly dropped before.
On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote: > > On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > > > On 1/13/23 8:31 PM, Xin Long wrote: > > > > For IPv6 jumbogram packets, the packet size is bigger than 65535, > > > > it's not right to get it from payload_len and save it to an u16 > > > > variable. > > > > > > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of > > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only > > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when > > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 > > > > BIG TCP packets. > > > > > > > > This fix will also help us add selftest for IPv6 BIG TCP in the > > > > following patch. > > > > > > > > > > If this is a bug fix for the existing IPv6 support, send it outside of > > > this set for -net. > > > > > Sure, > > I was thinking of adding it here to be able to support selftest for > > IPv6 too in the next patch. But it seems to make more sense to > > get it into -net first, then add this selftest after it goes to net-next. > > > > I will post it and all other fixes I mentioned in the cover-letter for > > IPv6 BIG TCP for -net. > > > > But before that, I hope Eric can confirm it is okay to read the length > > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch, > > instead of parsing JUMBO exthdr? > > > > I do not think it is ok, but I will leave the question to netfilter maintainers. Just note that the issue doesn't only exist in netfilter. All the changes in Patch 2-7 from this patchset are also needed for IPv6 BIG TCP packets. > > Guessing things in tcpdump or other tools is up to user space implementations, > trying to work around some (kernel ?) deficiencies. > > Yes, IPv6 extensions headers are a pain, we all agree. > > Look at how ip6_rcv_core() properly dissects extension headers _and_ trim > skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core() > or ipv6_hop_jumbo()) > > So skb->len is not the root of trust. Some transport mediums might add paddings. > > Ipv4 has a similar logic in ip_rcv_core(). > > len = ntohs(iph->tot_len); > if (skb->len < len) { > drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL; > __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); > goto drop; > } else if (len < (iph->ihl*4)) > goto inhdr_error; > > /* Our transport medium may have padded the buffer out. Now we know it > * is IP we can trim to the true length of the frame. > * Note this now means skb->len holds ntohs(iph->tot_len). > */ > if (pskb_trim_rcsum(skb, len)) { > __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); > goto drop; > } > > After your changes, we might accept illegal packets that were properly > dropped before. I think skb->len is trustable for GSO/GRO packets. In ipv6_gro_complete/inet_gro_complete(): The new length for payload_len or iph->tot_len are all calculated from skb->len. As I said in the cover-letter, "there is no padding in GSO/GRO packets". Or am I missing something? Thanks.
On 1/15/23 1:14 PM, Xin Long wrote: >> >> So skb->len is not the root of trust. Some transport mediums might add paddings. >> That padding is to meet minimum packet sizes AIUI. Ethernet for example has a minimum packet length of 60B. Some protocols running over ethernet can generate packets less than 60B. For example, ARP is 52B as I recall so something in the hardware pipeline must pad that out to 60B. IPv4 with UDP and small L4 payloads is another example. So the trim is to remove the padding added to meet a minimum size. That should not be relevant for large packets. >> Ipv4 has a similar logic in ip_rcv_core(). >> >> len = ntohs(iph->tot_len); >> if (skb->len < len) { >> drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL; >> __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); >> goto drop; >> } else if (len < (iph->ihl*4)) >> goto inhdr_error; >> >> /* Our transport medium may have padded the buffer out. Now we know it >> * is IP we can trim to the true length of the frame. >> * Note this now means skb->len holds ntohs(iph->tot_len). >> */ >> if (pskb_trim_rcsum(skb, len)) { >> __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); >> goto drop; >> } >> >> After your changes, we might accept illegal packets that were properly >> dropped before. > I think skb->len is trustable for GSO/GRO packets. That is my understanding as well. > In ipv6_gro_complete/inet_gro_complete(): > The new length for payload_len or iph->tot_len are all calculated from skb->len. > As I said in the cover-letter, "there is no padding in GSO/GRO packets". > Or am I missing something? > > Thanks.
On 1/15/23 12:40 PM, Eric Dumazet wrote: > On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote: >> >> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: >>> >>> On 1/13/23 8:31 PM, Xin Long wrote: >>>> For IPv6 jumbogram packets, the packet size is bigger than 65535, >>>> it's not right to get it from payload_len and save it to an u16 >>>> variable. >>>> >>>> This patch only fixes it for IPv6 BIG TCP packets, so instead of >>>> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only >>>> gets the pktlen via 'skb->len - skb_network_offset(skb)' when >>>> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 >>>> BIG TCP packets. >>>> >>>> This fix will also help us add selftest for IPv6 BIG TCP in the >>>> following patch. >>>> >>> >>> If this is a bug fix for the existing IPv6 support, send it outside of >>> this set for -net. >>> >> Sure, >> I was thinking of adding it here to be able to support selftest for >> IPv6 too in the next patch. But it seems to make more sense to >> get it into -net first, then add this selftest after it goes to net-next. >> >> I will post it and all other fixes I mentioned in the cover-letter for >> IPv6 BIG TCP for -net. >> >> But before that, I hope Eric can confirm it is okay to read the length >> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch, >> instead of parsing JUMBO exthdr? >> > > I do not think it is ok, but I will leave the question to netfilter maintainers. > > Guessing things in tcpdump or other tools is up to user space implementations, > trying to work around some (kernel ?) deficiencies. > > Yes, IPv6 extensions headers are a pain, we all agree. > > Look at how ip6_rcv_core() properly dissects extension headers _and_ trim > skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core() > or ipv6_hop_jumbo()) > > So skb->len is not the root of trust. Some transport mediums might add paddings. > > Ipv4 has a similar logic in ip_rcv_core(). > > len = ntohs(iph->tot_len); > if (skb->len < len) { > drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL; > __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); > goto drop; > } else if (len < (iph->ihl*4)) > goto inhdr_error; > > /* Our transport medium may have padded the buffer out. Now we know it > * is IP we can trim to the true length of the frame. > * Note this now means skb->len holds ntohs(iph->tot_len). > */ > if (pskb_trim_rcsum(skb, len)) { > __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); > goto drop; > } > > After your changes, we might accept illegal packets that were properly > dropped before.
On Sun, Jan 15, 2023 at 9:15 PM Xin Long <lucien.xin@gmail.com> wrote: > > On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > > > > > On 1/13/23 8:31 PM, Xin Long wrote: > > > > > For IPv6 jumbogram packets, the packet size is bigger than 65535, > > > > > it's not right to get it from payload_len and save it to an u16 > > > > > variable. > > > > > > > > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of > > > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only > > > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when > > > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 > > > > > BIG TCP packets. > > > > > > > > > > This fix will also help us add selftest for IPv6 BIG TCP in the > > > > > following patch. > > > > > > > > > > > > > If this is a bug fix for the existing IPv6 support, send it outside of > > > > this set for -net. > > > > > > > Sure, > > > I was thinking of adding it here to be able to support selftest for > > > IPv6 too in the next patch. But it seems to make more sense to > > > get it into -net first, then add this selftest after it goes to net-next. > > > > > > I will post it and all other fixes I mentioned in the cover-letter for > > > IPv6 BIG TCP for -net. > > > > > > But before that, I hope Eric can confirm it is okay to read the length > > > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch, > > > instead of parsing JUMBO exthdr? > > > > > > > I do not think it is ok, but I will leave the question to netfilter maintainers. > Just note that the issue doesn't only exist in netfilter. > All the changes in Patch 2-7 from this patchset are also needed for IPv6 > BIG TCP packets. > > > > > Guessing things in tcpdump or other tools is up to user space implementations, > > trying to work around some (kernel ?) deficiencies. > > > > Yes, IPv6 extensions headers are a pain, we all agree. > > > > Look at how ip6_rcv_core() properly dissects extension headers _and_ trim > > skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core() > > or ipv6_hop_jumbo()) > > > > So skb->len is not the root of trust. Some transport mediums might add paddings. > > > > Ipv4 has a similar logic in ip_rcv_core(). > > > > len = ntohs(iph->tot_len); > > if (skb->len < len) { > > drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL; > > __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); > > goto drop; > > } else if (len < (iph->ihl*4)) > > goto inhdr_error; > > > > /* Our transport medium may have padded the buffer out. Now we know it > > * is IP we can trim to the true length of the frame. > > * Note this now means skb->len holds ntohs(iph->tot_len). > > */ > > if (pskb_trim_rcsum(skb, len)) { > > __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); > > goto drop; > > } > > > > After your changes, we might accept illegal packets that were properly > > dropped before. > I think skb->len is trustable for GSO/GRO packets. > In ipv6_gro_complete/inet_gro_complete(): > The new length for payload_len or iph->tot_len are all calculated from skb->len. > As I said in the cover-letter, "there is no padding in GSO/GRO packets". > Or am I missing something? This seems to be a contract violation with user space providing GSO packets. In our changes we added some sanity checks, inherent to JUMBO specs. Here, a GSO packet can now have a zero ip length, no matter if it is BIG TCP or not. It seems we lower the bar for consistency, and allow bugs (say changing skb->len) to not be detected. As you said, user space sniffing packets now have to guess what is the intent, instead of headers carrying all the needed information that can be fully validated by parsers.
On 1/16/23 2:24 AM, Eric Dumazet wrote: > On Sun, Jan 15, 2023 at 9:15 PM Xin Long <lucien.xin@gmail.com> wrote: >> >> On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote: >>> >>> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote: >>>> >>>> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: >>>>> >>>>> On 1/13/23 8:31 PM, Xin Long wrote: >>>>>> For IPv6 jumbogram packets, the packet size is bigger than 65535, >>>>>> it's not right to get it from payload_len and save it to an u16 >>>>>> variable. >>>>>> >>>>>> This patch only fixes it for IPv6 BIG TCP packets, so instead of >>>>>> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only >>>>>> gets the pktlen via 'skb->len - skb_network_offset(skb)' when >>>>>> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 >>>>>> BIG TCP packets. >>>>>> >>>>>> This fix will also help us add selftest for IPv6 BIG TCP in the >>>>>> following patch. >>>>>> >>>>> >>>>> If this is a bug fix for the existing IPv6 support, send it outside of >>>>> this set for -net. >>>>> >>>> Sure, >>>> I was thinking of adding it here to be able to support selftest for >>>> IPv6 too in the next patch. But it seems to make more sense to >>>> get it into -net first, then add this selftest after it goes to net-next. >>>> >>>> I will post it and all other fixes I mentioned in the cover-letter for >>>> IPv6 BIG TCP for -net. >>>> >>>> But before that, I hope Eric can confirm it is okay to read the length >>>> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch, >>>> instead of parsing JUMBO exthdr? >>>> >>> >>> I do not think it is ok, but I will leave the question to netfilter maintainers. >> Just note that the issue doesn't only exist in netfilter. >> All the changes in Patch 2-7 from this patchset are also needed for IPv6 >> BIG TCP packets. >> >>> >>> Guessing things in tcpdump or other tools is up to user space implementations, >>> trying to work around some (kernel ?) deficiencies. >>> >>> Yes, IPv6 extensions headers are a pain, we all agree. >>> >>> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim >>> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core() >>> or ipv6_hop_jumbo()) >>> >>> So skb->len is not the root of trust. Some transport mediums might add paddings. >>> >>> Ipv4 has a similar logic in ip_rcv_core(). >>> >>> len = ntohs(iph->tot_len); >>> if (skb->len < len) { >>> drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL; >>> __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); >>> goto drop; >>> } else if (len < (iph->ihl*4)) >>> goto inhdr_error; >>> >>> /* Our transport medium may have padded the buffer out. Now we know it >>> * is IP we can trim to the true length of the frame. >>> * Note this now means skb->len holds ntohs(iph->tot_len). >>> */ >>> if (pskb_trim_rcsum(skb, len)) { >>> __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); >>> goto drop; >>> } >>> >>> After your changes, we might accept illegal packets that were properly >>> dropped before. >> I think skb->len is trustable for GSO/GRO packets. >> In ipv6_gro_complete/inet_gro_complete(): >> The new length for payload_len or iph->tot_len are all calculated from skb->len. >> As I said in the cover-letter, "there is no padding in GSO/GRO packets". >> Or am I missing something? > > This seems to be a contract violation with user space providing GSO packets. > > In our changes we added some sanity checks, inherent to JUMBO specs. > > Here, a GSO packet can now have a zero ip length, no matter if it is > BIG TCP or not. Meaning your preference is to set tot_len anytime it is <= 64kB so the only time tot_len == 0 is for large GRO/TSO packets? That is doable. > > It seems we lower the bar for consistency, and allow bugs (say > changing skb->len) to not be detected. not sure why you think it would not be detected. Today's model for gro sets tot_len based on skb->len. There is an inherent trust that the user's of the gro API set the length correctly. If it is not, the payload to userspace would ultimately be non-sense and hence detectable. I tend to use ssh to test changes like this for this reason - L4 payload must make sense. For the Tx path, there is a similar line of trust that the skb->len passed to the L3 layer is correct. IPv4/IPv6 blindly trust what it is told for length. > > As you said, user space sniffing packets now have to guess what is the > intent, instead of headers carrying all the needed information > that can be fully validated by parsers. This is a solveable problem within the packet socket API, and the entire thing is opt-in. If a user's tcpdump / packet capture program is out of date and does not support the new API for large packets, then that user does not have to enable large GRO/TSO.
On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote: > > not sure why you think it would not be detected. Today's model for gro > sets tot_len based on skb->len. There is an inherent trust that the > user's of the gro API set the length correctly. If it is not, the > payload to userspace would ultimately be non-sense and hence detectable. Only if you use some kind of upper protocol adding message integrity verification. > > > > As you said, user space sniffing packets now have to guess what is the > > intent, instead of headers carrying all the needed information > > that can be fully validated by parsers. > > This is a solveable problem within the packet socket API, and the entire > thing is opt-in. If a user's tcpdump / packet capture program is out of > date and does not support the new API for large packets, then that user > does not have to enable large GRO/TSO. I do not see this being solved yet. We have enabled BIG TCP only for IPv6, we do not want the same to magically be enabled for ipv4 when a new kernel is deployed. Make sure that IPV4 BIG TCP is guarded by a separate tunable. Note that our initial patches were adding IPv6 tunables for a reason. We agreed to change them to IFLA_GRO_MAX_SIZE, IFLA_TSO_MAX_SIZE, as long as they would not enable unwanted/untested behavior.
On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote: > > > > > not sure why you think it would not be detected. Today's model for gro > > sets tot_len based on skb->len. There is an inherent trust that the > > user's of the gro API set the length correctly. If it is not, the > > payload to userspace would ultimately be non-sense and hence detectable. > > Only if you use some kind of upper protocol adding message integrity > verification. > > > > > > > As you said, user space sniffing packets now have to guess what is the > > > intent, instead of headers carrying all the needed information > > > that can be fully validated by parsers. > > > > This is a solveable problem within the packet socket API, and the entire > > thing is opt-in. If a user's tcpdump / packet capture program is out of > > date and does not support the new API for large packets, then that user > > does not have to enable large GRO/TSO. > > I do not see this being solved yet. I think it's common that we add a feature that is disabled by default in the kernel if the userspace might not support it. > > We have enabled BIG TCP only for IPv6, we do not want the same to > magically be enabled for ipv4 > when a new kernel is deployed. > > Make sure that IPV4 BIG TCP is guarded by a separate tunable. I can understand this. Thanks.
On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote: > > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote: > > > > > > > > not sure why you think it would not be detected. Today's model for gro > > > sets tot_len based on skb->len. There is an inherent trust that the > > > user's of the gro API set the length correctly. If it is not, the > > > payload to userspace would ultimately be non-sense and hence detectable. > > > > Only if you use some kind of upper protocol adding message integrity > > verification. > > > > > > > > > > As you said, user space sniffing packets now have to guess what is the > > > > intent, instead of headers carrying all the needed information > > > > that can be fully validated by parsers. > > > > > > This is a solveable problem within the packet socket API, and the entire > > > thing is opt-in. If a user's tcpdump / packet capture program is out of > > > date and does not support the new API for large packets, then that user > > > does not have to enable large GRO/TSO. > > > > I do not see this being solved yet. > I think it's common that we add a feature that is disabled by > default in the kernel if the userspace might not support it. One critical feature for us is the ability to restrict packet capture to the headers only. Privacy is a key requirement. https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436 So please make sure that packet captures truncated to headers will still be understood by tcpdump > > > > > We have enabled BIG TCP only for IPv6, we do not want the same to > > magically be enabled for ipv4 > > when a new kernel is deployed. > > > > Make sure that IPV4 BIG TCP is guarded by a separate tunable. > I can understand this. > > Thanks.
On Mon, Jan 16, 2023 at 3:37 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote: > > > > > > > > > > > not sure why you think it would not be detected. Today's model for gro > > > > sets tot_len based on skb->len. There is an inherent trust that the > > > > user's of the gro API set the length correctly. If it is not, the > > > > payload to userspace would ultimately be non-sense and hence detectable. > > > > > > Only if you use some kind of upper protocol adding message integrity > > > verification. > > > > > > > > > > > > > As you said, user space sniffing packets now have to guess what is the > > > > > intent, instead of headers carrying all the needed information > > > > > that can be fully validated by parsers. > > > > > > > > This is a solveable problem within the packet socket API, and the entire > > > > thing is opt-in. If a user's tcpdump / packet capture program is out of > > > > date and does not support the new API for large packets, then that user > > > > does not have to enable large GRO/TSO. > > > > > > I do not see this being solved yet. > > I think it's common that we add a feature that is disabled by > > default in the kernel if the userspace might not support it. > > One critical feature for us is the ability to restrict packet capture > to the headers only. > > Privacy is a key requirement. > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436 > > So please make sure that packet captures truncated to headers will > still be understood by tcpdump IIUC we can try to adjust iph->tot_len to 65535 or at least the length of all headers for IPv4 BIG TCP packets on the PACKET socket rcv path? (or even truncate the ipv4 BIG TCP packets there.). This way tcpdump will be able to parse all the headers. But what if some applications want all the raw data via PACKET sockets? Maybe adjust iph->tot_len only, not truncate the packet? Thanks.
On Tue, Jan 17, 2023 at 10:47 AM Xin Long <lucien.xin@gmail.com> wrote: > > On Mon, Jan 16, 2023 at 3:37 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote: > > > > > > > > > > > > > > not sure why you think it would not be detected. Today's model for gro > > > > > sets tot_len based on skb->len. There is an inherent trust that the > > > > > user's of the gro API set the length correctly. If it is not, the > > > > > payload to userspace would ultimately be non-sense and hence detectable. > > > > > > > > Only if you use some kind of upper protocol adding message integrity > > > > verification. > > > > > > > > > > > > > > > > As you said, user space sniffing packets now have to guess what is the > > > > > > intent, instead of headers carrying all the needed information > > > > > > that can be fully validated by parsers. > > > > > > > > > > This is a solveable problem within the packet socket API, and the entire > > > > > thing is opt-in. If a user's tcpdump / packet capture program is out of > > > > > date and does not support the new API for large packets, then that user > > > > > does not have to enable large GRO/TSO. > > > > > > > > I do not see this being solved yet. > > > I think it's common that we add a feature that is disabled by > > > default in the kernel if the userspace might not support it. > > > > One critical feature for us is the ability to restrict packet capture > > to the headers only. > > > > Privacy is a key requirement. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436 > > > > So please make sure that packet captures truncated to headers will > > still be understood by tcpdump > IIUC we can try to adjust iph->tot_len to 65535 or at least the length > of all headers for IPv4 BIG TCP packets on the PACKET socket rcv > path? (or even truncate the ipv4 BIG TCP packets there.). This way > tcpdump will be able to parse all the headers. > > But what if some applications want all the raw data via PACKET sockets? > Maybe adjust iph->tot_len only, not truncate the packet? > I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in my env (RHEL-8), and it breaks too: 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] it doesn't show what we want from the TCP header either. For the latest tcpdump on upstream, it can display headers well for IPv6 BIG TCP. But we can't expect all systems to use the tcpdump that supports HBH parsing. For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," and 'tshark' even supports it by default. I think we should NOT go with "adjust tot_len" or "truncate packets" way, and it makes more sense to make it supported in "tcpdump" by default, just like in "tshark". I believe commit [1] was added for some problems they've met, we should enable it for both. [1] https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18 Thanks.
On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in > my env (RHEL-8), and it breaks too: > > 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > Please make sure to use a not too old tcpdump. > it doesn't show what we want from the TCP header either. > > For the latest tcpdump on upstream, it can display headers well for > IPv6 BIG TCP. But we can't expect all systems to use the tcpdump > that supports HBH parsing. User error. If an admin wants to diagnose TCP potential issues, it should use a correct version. > > For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," > and 'tshark' even supports it by default. Not with privacy _requirements_, where only the headers are captured. I am keeping a NACK, until you make sure you do not break this important feature. > > I think we should NOT go with "adjust tot_len" or "truncate packets" way, > and it makes more sense to make it supported in "tcpdump" by default, > just like in "tshark". I believe commit [1] was added for some problems > they've met, we should enable it for both. > > [1] https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18 > > Thanks.
On 1/18/23 8:13 PM, Eric Dumazet wrote: > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in >> my env (RHEL-8), and it breaks too: >> >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] >> > > Please make sure to use a not too old tcpdump. > >> it doesn't show what we want from the TCP header either. >> >> For the latest tcpdump on upstream, it can display headers well for >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump >> that supports HBH parsing. > > User error. If an admin wants to diagnose TCP potential issues, it should use > a correct version. Both of those just fall under "if you want a new feature, update your tools." > >> >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," >> and 'tshark' even supports it by default. > > Not with privacy _requirements_, where only the headers are captured. > > I am keeping a NACK, until you make sure you do not break this > important feature. I think the request here is to keep the snaplen in place (e.g., to make only headers visible to userspace) while also returning the >64kB packet length as meta data. My last pass on the packet socket code suggests this is possible; someone (Xin) needs to work through the details.
On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > On 1/18/23 8:13 PM, Eric Dumazet wrote: > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in > >> my env (RHEL-8), and it breaks too: > >> > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > >> > > > > Please make sure to use a not too old tcpdump. > > > >> it doesn't show what we want from the TCP header either. > >> > >> For the latest tcpdump on upstream, it can display headers well for > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump > >> that supports HBH parsing. > > > > User error. If an admin wants to diagnose TCP potential issues, it should use > > a correct version. > > Both of those just fall under "if you want a new feature, update your > tools." > > > > > >> > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," > >> and 'tshark' even supports it by default. > > > > Not with privacy _requirements_, where only the headers are captured. > > > > I am keeping a NACK, until you make sure you do not break this > > important feature. > > I think the request here is to keep the snaplen in place (e.g., to make > only headers visible to userspace) while also returning the >64kB packet > length as meta data. > > My last pass on the packet socket code suggests this is possible; > someone (Xin) needs to work through the details. > To be honest, I don't really like such a change in a packet socket, I tried, and the code doesn't look nice. I'm thinking since skb->len is trustable, why don't we use IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP? namely, only change these 2 helpers to: static inline unsigned int iph_totlen(const struct sk_buff *skb, const struct iphdr *iph) { u16 len = ntohs(iph->tot_len); return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len : skb->len - skb_network_offset(skb); } static inline void iph_set_totlen(struct iphdr *iph, unsigned int len) { iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU); } What do you think?
On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote: > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > On 1/18/23 8:13 PM, Eric Dumazet wrote: > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > > > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in > > >> my env (RHEL-8), and it breaks too: > > >> > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > >> > > > > > > Please make sure to use a not too old tcpdump. > > > > > >> it doesn't show what we want from the TCP header either. > > >> > > >> For the latest tcpdump on upstream, it can display headers well for > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump > > >> that supports HBH parsing. > > > > > > User error. If an admin wants to diagnose TCP potential issues, it should use > > > a correct version. > > > > Both of those just fall under "if you want a new feature, update your > > tools." > > > > > > > > > >> > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," > > >> and 'tshark' even supports it by default. > > > > > > Not with privacy _requirements_, where only the headers are captured. > > > > > > I am keeping a NACK, until you make sure you do not break this > > > important feature. > > > > I think the request here is to keep the snaplen in place (e.g., to make > > only headers visible to userspace) while also returning the >64kB packet > > length as meta data. > > > > My last pass on the packet socket code suggests this is possible; > > someone (Xin) needs to work through the details. > > > To be honest, I don't really like such a change in a packet socket, > I tried, and the code doesn't look nice. > > I'm thinking since skb->len is trustable, why don't we use > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP? > namely, only change these 2 helpers to: > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const > struct iphdr *iph) > { > u16 len = ntohs(iph->tot_len); > > return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len : > skb->len - skb_network_offset(skb); > } > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len) > { > iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU); > } > > What do you think? I think this is a no go for me. I think I stated clearly what was the problem. If you care about TCP diagnostics, you want the truth, not truncated sequence ranges, making it impossible to know if a packet was sent. Without headers describing precisely payload length (solution taken in IPv6 BI TCP), you have to augment AF_PACKET to provide this information in additional meta-data.
On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > > > On 1/18/23 8:13 PM, Eric Dumazet wrote: > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in > > > >> my env (RHEL-8), and it breaks too: > > > >> > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > >> > > > > > > > > Please make sure to use a not too old tcpdump. > > > > > > > >> it doesn't show what we want from the TCP header either. > > > >> > > > >> For the latest tcpdump on upstream, it can display headers well for > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump > > > >> that supports HBH parsing. > > > > > > > > User error. If an admin wants to diagnose TCP potential issues, it should use > > > > a correct version. > > > > > > Both of those just fall under "if you want a new feature, update your > > > tools." > > > > > > > > > > > > > >> > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," > > > >> and 'tshark' even supports it by default. > > > > > > > > Not with privacy _requirements_, where only the headers are captured. > > > > > > > > I am keeping a NACK, until you make sure you do not break this > > > > important feature. > > > > > > I think the request here is to keep the snaplen in place (e.g., to make > > > only headers visible to userspace) while also returning the >64kB packet > > > length as meta data. > > > > > > My last pass on the packet socket code suggests this is possible; > > > someone (Xin) needs to work through the details. > > > > > To be honest, I don't really like such a change in a packet socket, > > I tried, and the code doesn't look nice. > > > > I'm thinking since skb->len is trustable, why don't we use > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP? > > namely, only change these 2 helpers to: > > > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const > > struct iphdr *iph) > > { > > u16 len = ntohs(iph->tot_len); > > > > return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len : > > skb->len - skb_network_offset(skb); > > } > > > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len) > > { > > iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU); > > } > > > > What do you think? > > I think this is a no go for me. > > I think I stated clearly what was the problem. > If you care about TCP diagnostics, you want the truth, not truncated > sequence ranges, > making it impossible to know if a packet was sent. Sorry Eric if I didn't get you well. With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535), all TCP headers will display well, no truncated sequence ranges: # ip net exec router tcpdump -i link1 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val 2975547125 ecr 2379476018], length 65483 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val 2975547125 ecr 2379476018], length 65483 13:36:46.675542 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], seq 1532895495:1532960978, ack 1, win 504, options [nop,nop,TS val 2975547125 ecr 2379476018], length 65483 13:36:46.675550 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], seq 1533021985:1533087468, ack 1, win 504, options [nop,nop,TS val 2975547125 ecr 2379476018], length 65483 I just don't want to modify the iph tot_len in IPv4 header from the raw data in the packet socket. We're trying to avoid iph->tot_len too small for IPv4 BIG TCP to display tcphdr in tcpdump, aren't we? That's why I think using IP_MAX_MTU will avoid this. > > Without headers describing precisely payload length (solution taken in > IPv6 BI TCP), > you have to augment AF_PACKET to provide this information in > additional meta-data. For this, I agree to provide the >64kB packet into additional meta-data. Thanks.
On Thu, Jan 19, 2023 at 7:59 PM Xin Long <lucien.xin@gmail.com> wrote: > > On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > > > > > On 1/18/23 8:13 PM, Eric Dumazet wrote: > > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in > > > > >> my env (RHEL-8), and it breaks too: > > > > >> > > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > >> > > > > > > > > > > Please make sure to use a not too old tcpdump. > > > > > > > > > >> it doesn't show what we want from the TCP header either. > > > > >> > > > > >> For the latest tcpdump on upstream, it can display headers well for > > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump > > > > >> that supports HBH parsing. > > > > > > > > > > User error. If an admin wants to diagnose TCP potential issues, it should use > > > > > a correct version. > > > > > > > > Both of those just fall under "if you want a new feature, update your > > > > tools." > > > > > > > > > > > > > > > > > >> > > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," > > > > >> and 'tshark' even supports it by default. > > > > > > > > > > Not with privacy _requirements_, where only the headers are captured. > > > > > > > > > > I am keeping a NACK, until you make sure you do not break this > > > > > important feature. > > > > > > > > I think the request here is to keep the snaplen in place (e.g., to make > > > > only headers visible to userspace) while also returning the >64kB packet > > > > length as meta data. > > > > > > > > My last pass on the packet socket code suggests this is possible; > > > > someone (Xin) needs to work through the details. > > > > > > > To be honest, I don't really like such a change in a packet socket, > > > I tried, and the code doesn't look nice. > > > > > > I'm thinking since skb->len is trustable, why don't we use > > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP? > > > namely, only change these 2 helpers to: > > > > > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const > > > struct iphdr *iph) > > > { > > > u16 len = ntohs(iph->tot_len); > > > > > > return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len : > > > skb->len - skb_network_offset(skb); > > > } > > > > > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len) > > > { > > > iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU); > > > } > > > > > > What do you think? > > > > I think this is a no go for me. > > > > I think I stated clearly what was the problem. > > If you care about TCP diagnostics, you want the truth, not truncated > > sequence ranges, > > making it impossible to know if a packet was sent. > Sorry Eric if I didn't get you well. > > With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535), > all TCP headers will display well, no truncated sequence ranges: > > # ip net exec router tcpdump -i link1 > 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], > seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val > 2975547125 ecr 2379476018], length 65483 > 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], > seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val > 2975547125 ecr 2379476018], length 65483 This is completely truncated, don't you see this ? According to tcpdump, we sent sequences 1532642515:1532707998 and 1532769005:1532834488 And payload was of 65483 bytes per packet (this is not true) What happened for 1532707998 -> 1532769005 ??? How network engineers will know "oh wait, data was sent/received after all", and not dropped somewhere in the network or in netfilter or ... in a kernel bug. > 13:36:46.675542 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], > seq 1532895495:1532960978, ack 1, win 504, options [nop,nop,TS val > 2975547125 ecr 2379476018], length 65483 > 13:36:46.675550 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], > seq 1533021985:1533087468, ack 1, win 504, options [nop,nop,TS val > 2975547125 ecr 2379476018], length 65483
On Thu, Jan 19, 2023 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Jan 19, 2023 at 7:59 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote: > > > > > > > > > > On 1/18/23 8:13 PM, Eric Dumazet wrote: > > > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote: > > > > > > > > > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in > > > > > >> my env (RHEL-8), and it breaks too: > > > > > >> > > > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH] > > > > > >> > > > > > > > > > > > > Please make sure to use a not too old tcpdump. > > > > > > > > > > > >> it doesn't show what we want from the TCP header either. > > > > > >> > > > > > >> For the latest tcpdump on upstream, it can display headers well for > > > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump > > > > > >> that supports HBH parsing. > > > > > > > > > > > > User error. If an admin wants to diagnose TCP potential issues, it should use > > > > > > a correct version. > > > > > > > > > > Both of those just fall under "if you want a new feature, update your > > > > > tools." > > > > > > > > > > > > > > > > > > > > > >> > > > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump," > > > > > >> and 'tshark' even supports it by default. > > > > > > > > > > > > Not with privacy _requirements_, where only the headers are captured. > > > > > > > > > > > > I am keeping a NACK, until you make sure you do not break this > > > > > > important feature. > > > > > > > > > > I think the request here is to keep the snaplen in place (e.g., to make > > > > > only headers visible to userspace) while also returning the >64kB packet > > > > > length as meta data. > > > > > > > > > > My last pass on the packet socket code suggests this is possible; > > > > > someone (Xin) needs to work through the details. > > > > > > > > > To be honest, I don't really like such a change in a packet socket, > > > > I tried, and the code doesn't look nice. > > > > > > > > I'm thinking since skb->len is trustable, why don't we use > > > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP? > > > > namely, only change these 2 helpers to: > > > > > > > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const > > > > struct iphdr *iph) > > > > { > > > > u16 len = ntohs(iph->tot_len); > > > > > > > > return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len : > > > > skb->len - skb_network_offset(skb); > > > > } > > > > > > > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len) > > > > { > > > > iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU); > > > > } > > > > > > > > What do you think? > > > > > > I think this is a no go for me. > > > > > > I think I stated clearly what was the problem. > > > If you care about TCP diagnostics, you want the truth, not truncated > > > sequence ranges, > > > making it impossible to know if a packet was sent. > > Sorry Eric if I didn't get you well. > > > > With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535), > > all TCP headers will display well, no truncated sequence ranges: > > > > # ip net exec router tcpdump -i link1 > > 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], > > seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val > > 2975547125 ecr 2379476018], length 65483 > > 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.], > > seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val > > 2975547125 ecr 2379476018], length 65483 > > This is completely truncated, don't you see this ? OK, got you now. Thanks for the explanation. > > According to tcpdump, we sent sequences 1532642515:1532707998 and > 1532769005:1532834488 > > And payload was of 65483 bytes per packet (this is not true) > > What happened for 1532707998 -> 1532769005 ??? > > How network engineers will know "oh wait, data was sent/received after all", > and not dropped somewhere in the network or in netfilter or ... in a kernel bug. > I will work on providing the >64kB packet length in meta-data instead. Thanks.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 37dfdcfcdd54..b8edd6c599eb 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -175,6 +175,15 @@ static inline bool inet6_is_jumbogram(const struct sk_buff *skb) return !!(IP6CB(skb)->flags & IP6SKB_JUMBOGRAM); } +static inline unsigned int skb_ipv6_totlen(const struct sk_buff *skb) +{ + u32 pl = ntohs(ipv6_hdr(skb)->payload_len); + + return pl ? pl + sizeof(struct ipv6hdr) + : (skb_is_gso_v6(skb) ? skb->len - skb_network_offset(skb) + : pl + sizeof(struct ipv6hdr)); +} + /* can not be used in TCP layer after tcp_v6_fill_cb */ static inline int inet6_sdif(const struct sk_buff *skb) { diff --git a/net/netfilter/xt_length.c b/net/netfilter/xt_length.c index b3d623a52885..61518ec05c6e 100644 --- a/net/netfilter/xt_length.c +++ b/net/netfilter/xt_length.c @@ -30,8 +30,7 @@ static bool length_mt6(const struct sk_buff *skb, struct xt_action_param *par) { const struct xt_length_info *info = par->matchinfo; - const u_int16_t pktlen = ntohs(ipv6_hdr(skb)->payload_len) + - sizeof(struct ipv6hdr); + u32 pktlen = skb_ipv6_totlen(skb); return (pktlen >= info->min && pktlen <= info->max) ^ info->invert; }
For IPv6 jumbogram packets, the packet size is bigger than 65535, it's not right to get it from payload_len and save it to an u16 variable. This patch only fixes it for IPv6 BIG TCP packets, so instead of parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only gets the pktlen via 'skb->len - skb_network_offset(skb)' when skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4 BIG TCP packets. This fix will also help us add selftest for IPv6 BIG TCP in the following patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> --- include/linux/ipv6.h | 9 +++++++++ net/netfilter/xt_length.c | 3 +-- 2 files changed, 10 insertions(+), 2 deletions(-)