diff mbox series

[net-next,09/10] netfilter: get ipv6 pktlen properly in length_mt6

Message ID de91843a7f59feb065475ca82be22c275bede3df.1673666803.git.lucien.xin@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: support ipv4 big tcp | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1926 this patch: 1926
netdev/cc_maintainers warning 4 maintainers not CCed: dsahern@kernel.org kadlec@netfilter.org coreteam@netfilter.org netfilter-devel@vger.kernel.org
netdev/build_clang success Errors and warnings before: 497 this patch: 497
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2056 this patch: 2056
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 24 lines checked
netdev/kdoc success Errors and warnings before: 37 this patch: 37
netdev/source_inline success Was 0 now: 0

Commit Message

Xin Long Jan. 14, 2023, 3:31 a.m. UTC
For IPv6 jumbogram packets, the packet size is bigger than 65535,
it's not right to get it from payload_len and save it to an u16
variable.

This patch only fixes it for IPv6 BIG TCP packets, so instead of
parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
gets the pktlen via 'skb->len - skb_network_offset(skb)' when
skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
BIG TCP packets.

This fix will also help us add selftest for IPv6 BIG TCP in the
following patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/ipv6.h      | 9 +++++++++
 net/netfilter/xt_length.c | 3 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

David Ahern Jan. 15, 2023, 3:41 p.m. UTC | #1
On 1/13/23 8:31 PM, Xin Long wrote:
> For IPv6 jumbogram packets, the packet size is bigger than 65535,
> it's not right to get it from payload_len and save it to an u16
> variable.
> 
> This patch only fixes it for IPv6 BIG TCP packets, so instead of
> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> BIG TCP packets.
> 
> This fix will also help us add selftest for IPv6 BIG TCP in the
> following patch.
> 

If this is a bug fix for the existing IPv6 support, send it outside of
this set for -net.
Xin Long Jan. 15, 2023, 5:42 p.m. UTC | #2
On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 1/13/23 8:31 PM, Xin Long wrote:
> > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > it's not right to get it from payload_len and save it to an u16
> > variable.
> >
> > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > BIG TCP packets.
> >
> > This fix will also help us add selftest for IPv6 BIG TCP in the
> > following patch.
> >
>
> If this is a bug fix for the existing IPv6 support, send it outside of
> this set for -net.
>
Sure,
I was thinking of adding it here to be able to support selftest for
IPv6 too in the next patch. But it seems to make more sense to
get it into -net first, then add this selftest after it goes to net-next.

I will post it and all other fixes I mentioned in the cover-letter for
IPv6 BIG TCP for -net.

But before that, I hope Eric can confirm it is okay to read the length
of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
instead of parsing JUMBO exthdr?

Thanks.
Eric Dumazet Jan. 15, 2023, 7:40 p.m. UTC | #3
On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 1/13/23 8:31 PM, Xin Long wrote:
> > > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > > it's not right to get it from payload_len and save it to an u16
> > > variable.
> > >
> > > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > > BIG TCP packets.
> > >
> > > This fix will also help us add selftest for IPv6 BIG TCP in the
> > > following patch.
> > >
> >
> > If this is a bug fix for the existing IPv6 support, send it outside of
> > this set for -net.
> >
> Sure,
> I was thinking of adding it here to be able to support selftest for
> IPv6 too in the next patch. But it seems to make more sense to
> get it into -net first, then add this selftest after it goes to net-next.
>
> I will post it and all other fixes I mentioned in the cover-letter for
> IPv6 BIG TCP for -net.
>
> But before that, I hope Eric can confirm it is okay to read the length
> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
> instead of parsing JUMBO exthdr?
>

I do not think it is ok, but I will leave the question to netfilter maintainers.

Guessing things in tcpdump or other tools is up to user space implementations,
trying to work around some (kernel ?) deficiencies.

Yes, IPv6 extensions headers are a pain, we all agree.

Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
or ipv6_hop_jumbo())

So skb->len is not the root of trust. Some transport mediums might add paddings.

Ipv4 has a similar logic in ip_rcv_core().

len = ntohs(iph->tot_len);
if (skb->len < len) {
     drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
     __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
    goto drop;
} else if (len < (iph->ihl*4))
     goto inhdr_error;

/* Our transport medium may have padded the buffer out. Now we know it
* is IP we can trim to the true length of the frame.
* Note this now means skb->len holds ntohs(iph->tot_len).
*/
if (pskb_trim_rcsum(skb, len)) {
      __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
      goto drop;
}

After your changes, we might accept illegal packets that were properly
dropped before.
Xin Long Jan. 15, 2023, 8:14 p.m. UTC | #4
On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 1/13/23 8:31 PM, Xin Long wrote:
> > > > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > > > it's not right to get it from payload_len and save it to an u16
> > > > variable.
> > > >
> > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > > > BIG TCP packets.
> > > >
> > > > This fix will also help us add selftest for IPv6 BIG TCP in the
> > > > following patch.
> > > >
> > >
> > > If this is a bug fix for the existing IPv6 support, send it outside of
> > > this set for -net.
> > >
> > Sure,
> > I was thinking of adding it here to be able to support selftest for
> > IPv6 too in the next patch. But it seems to make more sense to
> > get it into -net first, then add this selftest after it goes to net-next.
> >
> > I will post it and all other fixes I mentioned in the cover-letter for
> > IPv6 BIG TCP for -net.
> >
> > But before that, I hope Eric can confirm it is okay to read the length
> > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
> > instead of parsing JUMBO exthdr?
> >
>
> I do not think it is ok, but I will leave the question to netfilter maintainers.
Just note that the issue doesn't only exist in netfilter.
All the changes in Patch 2-7 from this patchset are also needed for IPv6
BIG TCP packets.

>
> Guessing things in tcpdump or other tools is up to user space implementations,
> trying to work around some (kernel ?) deficiencies.
>
> Yes, IPv6 extensions headers are a pain, we all agree.
>
> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
> or ipv6_hop_jumbo())
>
> So skb->len is not the root of trust. Some transport mediums might add paddings.
>
> Ipv4 has a similar logic in ip_rcv_core().
>
> len = ntohs(iph->tot_len);
> if (skb->len < len) {
>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>     goto drop;
> } else if (len < (iph->ihl*4))
>      goto inhdr_error;
>
> /* Our transport medium may have padded the buffer out. Now we know it
> * is IP we can trim to the true length of the frame.
> * Note this now means skb->len holds ntohs(iph->tot_len).
> */
> if (pskb_trim_rcsum(skb, len)) {
>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>       goto drop;
> }
>
> After your changes, we might accept illegal packets that were properly
> dropped before.
I think skb->len is trustable for GSO/GRO packets.
In ipv6_gro_complete/inet_gro_complete():
The new length for payload_len or iph->tot_len are all calculated from skb->len.
As I said in the cover-letter, "there is no padding in GSO/GRO packets".
Or am I missing something?

Thanks.
David Ahern Jan. 15, 2023, 11:57 p.m. UTC | #5
On 1/15/23 1:14 PM, Xin Long wrote:
>>
>> So skb->len is not the root of trust. Some transport mediums might add paddings.
>>

That padding is to meet minimum packet sizes AIUI.

Ethernet for example has a minimum packet length of 60B. Some protocols
running over ethernet can generate packets less than 60B. For example,
ARP is 52B as I recall so something in the hardware pipeline must pad
that out to 60B. IPv4 with UDP and small L4 payloads is another example.
So the trim is to remove the padding added to meet a minimum size. That
should not be relevant for large packets.

>> Ipv4 has a similar logic in ip_rcv_core().
>>
>> len = ntohs(iph->tot_len);
>> if (skb->len < len) {
>>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>>     goto drop;
>> } else if (len < (iph->ihl*4))
>>      goto inhdr_error;
>>
>> /* Our transport medium may have padded the buffer out. Now we know it
>> * is IP we can trim to the true length of the frame.
>> * Note this now means skb->len holds ntohs(iph->tot_len).
>> */
>> if (pskb_trim_rcsum(skb, len)) {
>>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>>       goto drop;
>> }
>>
>> After your changes, we might accept illegal packets that were properly
>> dropped before.
> I think skb->len is trustable for GSO/GRO packets.

That is my understanding as well.

> In ipv6_gro_complete/inet_gro_complete():
> The new length for payload_len or iph->tot_len are all calculated from skb->len.
> As I said in the cover-letter, "there is no padding in GSO/GRO packets".
> Or am I missing something?
> 
> Thanks.
David Ahern Jan. 15, 2023, 11:58 p.m. UTC | #6
On 1/15/23 12:40 PM, Eric Dumazet wrote:
> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
>>
>> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>>>
>>> On 1/13/23 8:31 PM, Xin Long wrote:
>>>> For IPv6 jumbogram packets, the packet size is bigger than 65535,
>>>> it's not right to get it from payload_len and save it to an u16
>>>> variable.
>>>>
>>>> This patch only fixes it for IPv6 BIG TCP packets, so instead of
>>>> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
>>>> gets the pktlen via 'skb->len - skb_network_offset(skb)' when
>>>> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
>>>> BIG TCP packets.
>>>>
>>>> This fix will also help us add selftest for IPv6 BIG TCP in the
>>>> following patch.
>>>>
>>>
>>> If this is a bug fix for the existing IPv6 support, send it outside of
>>> this set for -net.
>>>
>> Sure,
>> I was thinking of adding it here to be able to support selftest for
>> IPv6 too in the next patch. But it seems to make more sense to
>> get it into -net first, then add this selftest after it goes to net-next.
>>
>> I will post it and all other fixes I mentioned in the cover-letter for
>> IPv6 BIG TCP for -net.
>>
>> But before that, I hope Eric can confirm it is okay to read the length
>> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
>> instead of parsing JUMBO exthdr?
>>
> 
> I do not think it is ok, but I will leave the question to netfilter maintainers.
> 
> Guessing things in tcpdump or other tools is up to user space implementations,
> trying to work around some (kernel ?) deficiencies.
> 
> Yes, IPv6 extensions headers are a pain, we all agree.
> 
> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
> or ipv6_hop_jumbo())
> 
> So skb->len is not the root of trust. Some transport mediums might add paddings.
> 
> Ipv4 has a similar logic in ip_rcv_core().
> 
> len = ntohs(iph->tot_len);
> if (skb->len < len) {
>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>     goto drop;
> } else if (len < (iph->ihl*4))
>      goto inhdr_error;
> 
> /* Our transport medium may have padded the buffer out. Now we know it
> * is IP we can trim to the true length of the frame.
> * Note this now means skb->len holds ntohs(iph->tot_len).
> */
> if (pskb_trim_rcsum(skb, len)) {
>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>       goto drop;
> }
> 
> After your changes, we might accept illegal packets that were properly
> dropped before.
Eric Dumazet Jan. 16, 2023, 9:24 a.m. UTC | #7
On Sun, Jan 15, 2023 at 9:15 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > > >
> > > > On 1/13/23 8:31 PM, Xin Long wrote:
> > > > > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > > > > it's not right to get it from payload_len and save it to an u16
> > > > > variable.
> > > > >
> > > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > > > > BIG TCP packets.
> > > > >
> > > > > This fix will also help us add selftest for IPv6 BIG TCP in the
> > > > > following patch.
> > > > >
> > > >
> > > > If this is a bug fix for the existing IPv6 support, send it outside of
> > > > this set for -net.
> > > >
> > > Sure,
> > > I was thinking of adding it here to be able to support selftest for
> > > IPv6 too in the next patch. But it seems to make more sense to
> > > get it into -net first, then add this selftest after it goes to net-next.
> > >
> > > I will post it and all other fixes I mentioned in the cover-letter for
> > > IPv6 BIG TCP for -net.
> > >
> > > But before that, I hope Eric can confirm it is okay to read the length
> > > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
> > > instead of parsing JUMBO exthdr?
> > >
> >
> > I do not think it is ok, but I will leave the question to netfilter maintainers.
> Just note that the issue doesn't only exist in netfilter.
> All the changes in Patch 2-7 from this patchset are also needed for IPv6
> BIG TCP packets.
>
> >
> > Guessing things in tcpdump or other tools is up to user space implementations,
> > trying to work around some (kernel ?) deficiencies.
> >
> > Yes, IPv6 extensions headers are a pain, we all agree.
> >
> > Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
> > skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
> > or ipv6_hop_jumbo())
> >
> > So skb->len is not the root of trust. Some transport mediums might add paddings.
> >
> > Ipv4 has a similar logic in ip_rcv_core().
> >
> > len = ntohs(iph->tot_len);
> > if (skb->len < len) {
> >      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
> >      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
> >     goto drop;
> > } else if (len < (iph->ihl*4))
> >      goto inhdr_error;
> >
> > /* Our transport medium may have padded the buffer out. Now we know it
> > * is IP we can trim to the true length of the frame.
> > * Note this now means skb->len holds ntohs(iph->tot_len).
> > */
> > if (pskb_trim_rcsum(skb, len)) {
> >       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
> >       goto drop;
> > }
> >
> > After your changes, we might accept illegal packets that were properly
> > dropped before.
> I think skb->len is trustable for GSO/GRO packets.
> In ipv6_gro_complete/inet_gro_complete():
> The new length for payload_len or iph->tot_len are all calculated from skb->len.
> As I said in the cover-letter, "there is no padding in GSO/GRO packets".
> Or am I missing something?

This seems to be a contract violation with user space providing GSO packets.

In our changes we added some sanity checks, inherent to JUMBO specs.

Here, a GSO packet can now have a zero ip length, no matter if it is
BIG TCP or not.

It seems we lower the bar for consistency, and allow bugs (say
changing skb->len) to not be detected.

As you said, user space sniffing packets now have to guess what is the
intent, instead of headers carrying all the needed information
that can be fully validated by parsers.
David Ahern Jan. 16, 2023, 3:07 p.m. UTC | #8
On 1/16/23 2:24 AM, Eric Dumazet wrote:
> On Sun, Jan 15, 2023 at 9:15 PM Xin Long <lucien.xin@gmail.com> wrote:
>>
>> On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote:
>>>
>>> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
>>>>
>>>> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>>>>>
>>>>> On 1/13/23 8:31 PM, Xin Long wrote:
>>>>>> For IPv6 jumbogram packets, the packet size is bigger than 65535,
>>>>>> it's not right to get it from payload_len and save it to an u16
>>>>>> variable.
>>>>>>
>>>>>> This patch only fixes it for IPv6 BIG TCP packets, so instead of
>>>>>> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
>>>>>> gets the pktlen via 'skb->len - skb_network_offset(skb)' when
>>>>>> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
>>>>>> BIG TCP packets.
>>>>>>
>>>>>> This fix will also help us add selftest for IPv6 BIG TCP in the
>>>>>> following patch.
>>>>>>
>>>>>
>>>>> If this is a bug fix for the existing IPv6 support, send it outside of
>>>>> this set for -net.
>>>>>
>>>> Sure,
>>>> I was thinking of adding it here to be able to support selftest for
>>>> IPv6 too in the next patch. But it seems to make more sense to
>>>> get it into -net first, then add this selftest after it goes to net-next.
>>>>
>>>> I will post it and all other fixes I mentioned in the cover-letter for
>>>> IPv6 BIG TCP for -net.
>>>>
>>>> But before that, I hope Eric can confirm it is okay to read the length
>>>> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
>>>> instead of parsing JUMBO exthdr?
>>>>
>>>
>>> I do not think it is ok, but I will leave the question to netfilter maintainers.
>> Just note that the issue doesn't only exist in netfilter.
>> All the changes in Patch 2-7 from this patchset are also needed for IPv6
>> BIG TCP packets.
>>
>>>
>>> Guessing things in tcpdump or other tools is up to user space implementations,
>>> trying to work around some (kernel ?) deficiencies.
>>>
>>> Yes, IPv6 extensions headers are a pain, we all agree.
>>>
>>> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
>>> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
>>> or ipv6_hop_jumbo())
>>>
>>> So skb->len is not the root of trust. Some transport mediums might add paddings.
>>>
>>> Ipv4 has a similar logic in ip_rcv_core().
>>>
>>> len = ntohs(iph->tot_len);
>>> if (skb->len < len) {
>>>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>>>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>>>     goto drop;
>>> } else if (len < (iph->ihl*4))
>>>      goto inhdr_error;
>>>
>>> /* Our transport medium may have padded the buffer out. Now we know it
>>> * is IP we can trim to the true length of the frame.
>>> * Note this now means skb->len holds ntohs(iph->tot_len).
>>> */
>>> if (pskb_trim_rcsum(skb, len)) {
>>>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>>>       goto drop;
>>> }
>>>
>>> After your changes, we might accept illegal packets that were properly
>>> dropped before.
>> I think skb->len is trustable for GSO/GRO packets.
>> In ipv6_gro_complete/inet_gro_complete():
>> The new length for payload_len or iph->tot_len are all calculated from skb->len.
>> As I said in the cover-letter, "there is no padding in GSO/GRO packets".
>> Or am I missing something?
> 
> This seems to be a contract violation with user space providing GSO packets.
> 
> In our changes we added some sanity checks, inherent to JUMBO specs.
> 
> Here, a GSO packet can now have a zero ip length, no matter if it is
> BIG TCP or not.

Meaning your preference is to set tot_len anytime it is <= 64kB so the
only time tot_len == 0 is for large GRO/TSO packets? That is doable.

> 
> It seems we lower the bar for consistency, and allow bugs (say
> changing skb->len) to not be detected.

not sure why you think it would not be detected. Today's model for gro
sets tot_len based on skb->len. There is an inherent trust that the
user's of the gro API set the length correctly. If it is not, the
payload to userspace would ultimately be non-sense and hence detectable.
I tend to use ssh to test changes like this for this reason - L4 payload
must make sense.

For the Tx path, there is a similar line of trust that the skb->len
passed to the L3 layer is correct. IPv4/IPv6 blindly trust what it is
told for length.


> 
> As you said, user space sniffing packets now have to guess what is the
> intent, instead of headers carrying all the needed information
> that can be fully validated by parsers.

This is a solveable problem within the packet socket API, and the entire
thing is opt-in. If a user's tcpdump / packet capture program is out of
date and does not support the new API for large packets, then that user
does not have to enable large GRO/TSO.
Eric Dumazet Jan. 16, 2023, 4:02 p.m. UTC | #9
On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
>

> not sure why you think it would not be detected. Today's model for gro
> sets tot_len based on skb->len. There is an inherent trust that the
> user's of the gro API set the length correctly. If it is not, the
> payload to userspace would ultimately be non-sense and hence detectable.

Only if you use some kind of upper protocol adding message integrity
verification.

> >
> > As you said, user space sniffing packets now have to guess what is the
> > intent, instead of headers carrying all the needed information
> > that can be fully validated by parsers.
>
> This is a solveable problem within the packet socket API, and the entire
> thing is opt-in. If a user's tcpdump / packet capture program is out of
> date and does not support the new API for large packets, then that user
> does not have to enable large GRO/TSO.

I do not see this being solved yet.

We have enabled BIG TCP only for IPv6, we do not want the same to
magically be enabled for ipv4
when a new kernel is deployed.

Make sure that IPV4 BIG TCP is guarded by a separate tunable.

Note that our initial patches were adding IPv6 tunables for a reason.

We agreed to change them to IFLA_GRO_MAX_SIZE, IFLA_TSO_MAX_SIZE,
as long as they would not enable unwanted/untested behavior.
Xin Long Jan. 16, 2023, 7:09 p.m. UTC | #10
On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> >
>
> > not sure why you think it would not be detected. Today's model for gro
> > sets tot_len based on skb->len. There is an inherent trust that the
> > user's of the gro API set the length correctly. If it is not, the
> > payload to userspace would ultimately be non-sense and hence detectable.
>
> Only if you use some kind of upper protocol adding message integrity
> verification.
>
> > >
> > > As you said, user space sniffing packets now have to guess what is the
> > > intent, instead of headers carrying all the needed information
> > > that can be fully validated by parsers.
> >
> > This is a solveable problem within the packet socket API, and the entire
> > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > date and does not support the new API for large packets, then that user
> > does not have to enable large GRO/TSO.
>
> I do not see this being solved yet.
I think it's common that we add a feature that is disabled by
default in the kernel if the userspace might not support it.

>
> We have enabled BIG TCP only for IPv6, we do not want the same to
> magically be enabled for ipv4
> when a new kernel is deployed.
>
> Make sure that IPV4 BIG TCP is guarded by a separate tunable.
I can understand this.

Thanks.
Eric Dumazet Jan. 16, 2023, 8:37 p.m. UTC | #11
On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> >
> > > not sure why you think it would not be detected. Today's model for gro
> > > sets tot_len based on skb->len. There is an inherent trust that the
> > > user's of the gro API set the length correctly. If it is not, the
> > > payload to userspace would ultimately be non-sense and hence detectable.
> >
> > Only if you use some kind of upper protocol adding message integrity
> > verification.
> >
> > > >
> > > > As you said, user space sniffing packets now have to guess what is the
> > > > intent, instead of headers carrying all the needed information
> > > > that can be fully validated by parsers.
> > >
> > > This is a solveable problem within the packet socket API, and the entire
> > > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > > date and does not support the new API for large packets, then that user
> > > does not have to enable large GRO/TSO.
> >
> > I do not see this being solved yet.
> I think it's common that we add a feature that is disabled by
> default in the kernel if the userspace might not support it.

One critical feature for us is the ability to restrict packet capture
to the headers only.

Privacy is a key requirement.

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436

So please make sure that packet captures truncated to headers will
still be understood by tcpdump


>
> >
> > We have enabled BIG TCP only for IPv6, we do not want the same to
> > magically be enabled for ipv4
> > when a new kernel is deployed.
> >
> > Make sure that IPV4 BIG TCP is guarded by a separate tunable.
> I can understand this.
>
> Thanks.
Xin Long Jan. 17, 2023, 3:47 p.m. UTC | #12
On Mon, Jan 16, 2023 at 3:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> > > >
> > >
> > > > not sure why you think it would not be detected. Today's model for gro
> > > > sets tot_len based on skb->len. There is an inherent trust that the
> > > > user's of the gro API set the length correctly. If it is not, the
> > > > payload to userspace would ultimately be non-sense and hence detectable.
> > >
> > > Only if you use some kind of upper protocol adding message integrity
> > > verification.
> > >
> > > > >
> > > > > As you said, user space sniffing packets now have to guess what is the
> > > > > intent, instead of headers carrying all the needed information
> > > > > that can be fully validated by parsers.
> > > >
> > > > This is a solveable problem within the packet socket API, and the entire
> > > > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > > > date and does not support the new API for large packets, then that user
> > > > does not have to enable large GRO/TSO.
> > >
> > > I do not see this being solved yet.
> > I think it's common that we add a feature that is disabled by
> > default in the kernel if the userspace might not support it.
>
> One critical feature for us is the ability to restrict packet capture
> to the headers only.
>
> Privacy is a key requirement.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436
>
> So please make sure that packet captures truncated to headers will
> still be understood by tcpdump
IIUC we can try to adjust iph->tot_len to 65535 or at least the length
of all headers for IPv4 BIG TCP packets on the PACKET socket rcv
path? (or even truncate the ipv4 BIG TCP packets there.). This way
tcpdump will be able to parse all the headers.

But what if some applications want all the raw data via PACKET sockets?
Maybe adjust iph->tot_len only, not truncate the packet?

Thanks.
Xin Long Jan. 19, 2023, 1:18 a.m. UTC | #13
On Tue, Jan 17, 2023 at 10:47 AM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 3:37 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> > > > >
> > > >
> > > > > not sure why you think it would not be detected. Today's model for gro
> > > > > sets tot_len based on skb->len. There is an inherent trust that the
> > > > > user's of the gro API set the length correctly. If it is not, the
> > > > > payload to userspace would ultimately be non-sense and hence detectable.
> > > >
> > > > Only if you use some kind of upper protocol adding message integrity
> > > > verification.
> > > >
> > > > > >
> > > > > > As you said, user space sniffing packets now have to guess what is the
> > > > > > intent, instead of headers carrying all the needed information
> > > > > > that can be fully validated by parsers.
> > > > >
> > > > > This is a solveable problem within the packet socket API, and the entire
> > > > > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > > > > date and does not support the new API for large packets, then that user
> > > > > does not have to enable large GRO/TSO.
> > > >
> > > > I do not see this being solved yet.
> > > I think it's common that we add a feature that is disabled by
> > > default in the kernel if the userspace might not support it.
> >
> > One critical feature for us is the ability to restrict packet capture
> > to the headers only.
> >
> > Privacy is a key requirement.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436
> >
> > So please make sure that packet captures truncated to headers will
> > still be understood by tcpdump
> IIUC we can try to adjust iph->tot_len to 65535 or at least the length
> of all headers for IPv4 BIG TCP packets on the PACKET socket rcv
> path? (or even truncate the ipv4 BIG TCP packets there.). This way
> tcpdump will be able to parse all the headers.
>
> But what if some applications want all the raw data via PACKET sockets?
> Maybe adjust iph->tot_len only, not truncate the packet?
>
I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
my env (RHEL-8), and it breaks too:

19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]

it doesn't show what we want from the TCP header either.

For the latest tcpdump on upstream, it can display headers well for
IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
that supports HBH parsing.

For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
and 'tshark' even supports it by default.

I think we should NOT go with "adjust tot_len" or "truncate packets" way,
and it makes more sense to make it supported in "tcpdump" by default,
just like in "tshark". I believe commit [1] was added for some problems
they've met, we should enable it for both.

[1] https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18

Thanks.
Eric Dumazet Jan. 19, 2023, 3:13 a.m. UTC | #14
On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:

> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> my env (RHEL-8), and it breaks too:
>
> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>

Please make sure to use a not too old tcpdump.

> it doesn't show what we want from the TCP header either.
>
> For the latest tcpdump on upstream, it can display headers well for
> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> that supports HBH parsing.

User error. If an admin wants to diagnose TCP potential issues, it should use
a correct version.

>
> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> and 'tshark' even supports it by default.

Not with privacy _requirements_, where only the headers are captured.

I am keeping a NACK, until you make sure you do not break this
important feature.

>
> I think we should NOT go with "adjust tot_len" or "truncate packets" way,
> and it makes more sense to make it supported in "tcpdump" by default,
> just like in "tshark". I believe commit [1] was added for some problems
> they've met, we should enable it for both.
>
> [1] https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18
>
> Thanks.
David Ahern Jan. 19, 2023, 3:41 p.m. UTC | #15
On 1/18/23 8:13 PM, Eric Dumazet wrote:
> On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> 
>> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
>> my env (RHEL-8), and it breaks too:
>>
>> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>>
> 
> Please make sure to use a not too old tcpdump.
> 
>> it doesn't show what we want from the TCP header either.
>>
>> For the latest tcpdump on upstream, it can display headers well for
>> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
>> that supports HBH parsing.
> 
> User error. If an admin wants to diagnose TCP potential issues, it should use
> a correct version.

Both of those just fall under "if you want a new feature, update your
tools."


> 
>>
>> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
>> and 'tshark' even supports it by default.
> 
> Not with privacy _requirements_, where only the headers are captured.
> 
> I am keeping a NACK, until you make sure you do not break this
> important feature.

I think the request here is to keep the snaplen in place (e.g., to make
only headers visible to userspace) while also returning the >64kB packet
length as meta data.

My last pass on the packet socket code suggests this is possible;
someone (Xin) needs to work through the details.
Xin Long Jan. 19, 2023, 4:49 p.m. UTC | #16
On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> >
> >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> >> my env (RHEL-8), and it breaks too:
> >>
> >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >>
> >
> > Please make sure to use a not too old tcpdump.
> >
> >> it doesn't show what we want from the TCP header either.
> >>
> >> For the latest tcpdump on upstream, it can display headers well for
> >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> >> that supports HBH parsing.
> >
> > User error. If an admin wants to diagnose TCP potential issues, it should use
> > a correct version.
>
> Both of those just fall under "if you want a new feature, update your
> tools."
>
>
> >
> >>
> >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> >> and 'tshark' even supports it by default.
> >
> > Not with privacy _requirements_, where only the headers are captured.
> >
> > I am keeping a NACK, until you make sure you do not break this
> > important feature.
>
> I think the request here is to keep the snaplen in place (e.g., to make
> only headers visible to userspace) while also returning the >64kB packet
> length as meta data.
>
> My last pass on the packet socket code suggests this is possible;
> someone (Xin) needs to work through the details.
>
To be honest, I don't really like such a change in a packet socket,
I tried, and the code doesn't look nice.

I'm thinking since skb->len is trustable, why don't we use
IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
namely, only change these 2 helpers to:

static inline unsigned int iph_totlen(const struct sk_buff *skb, const
struct iphdr *iph)
{
        u16 len = ntohs(iph->tot_len);

        return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
                skb->len - skb_network_offset(skb);
}

static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
{
        iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
}

What do you think?
Eric Dumazet Jan. 19, 2023, 6:10 p.m. UTC | #17
On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > >> my env (RHEL-8), and it breaks too:
> > >>
> > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >>
> > >
> > > Please make sure to use a not too old tcpdump.
> > >
> > >> it doesn't show what we want from the TCP header either.
> > >>
> > >> For the latest tcpdump on upstream, it can display headers well for
> > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > >> that supports HBH parsing.
> > >
> > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > a correct version.
> >
> > Both of those just fall under "if you want a new feature, update your
> > tools."
> >
> >
> > >
> > >>
> > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > >> and 'tshark' even supports it by default.
> > >
> > > Not with privacy _requirements_, where only the headers are captured.
> > >
> > > I am keeping a NACK, until you make sure you do not break this
> > > important feature.
> >
> > I think the request here is to keep the snaplen in place (e.g., to make
> > only headers visible to userspace) while also returning the >64kB packet
> > length as meta data.
> >
> > My last pass on the packet socket code suggests this is possible;
> > someone (Xin) needs to work through the details.
> >
> To be honest, I don't really like such a change in a packet socket,
> I tried, and the code doesn't look nice.
>
> I'm thinking since skb->len is trustable, why don't we use
> IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> namely, only change these 2 helpers to:
>
> static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> struct iphdr *iph)
> {
>         u16 len = ntohs(iph->tot_len);
>
>         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
>                 skb->len - skb_network_offset(skb);
> }
>
> static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> {
>         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> }
>
> What do you think?

I think this is a no go for me.

I think I stated clearly what was the problem.
If you care about TCP diagnostics, you want the truth, not truncated
sequence ranges,
making it impossible to know if a packet was sent.

Without headers describing precisely payload length (solution taken in
IPv6 BI TCP),
you have to augment AF_PACKET to provide this information in
additional meta-data.
Xin Long Jan. 19, 2023, 6:57 p.m. UTC | #18
On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > > >
> > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > > >> my env (RHEL-8), and it breaks too:
> > > >>
> > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >>
> > > >
> > > > Please make sure to use a not too old tcpdump.
> > > >
> > > >> it doesn't show what we want from the TCP header either.
> > > >>
> > > >> For the latest tcpdump on upstream, it can display headers well for
> > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > > >> that supports HBH parsing.
> > > >
> > > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > > a correct version.
> > >
> > > Both of those just fall under "if you want a new feature, update your
> > > tools."
> > >
> > >
> > > >
> > > >>
> > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > > >> and 'tshark' even supports it by default.
> > > >
> > > > Not with privacy _requirements_, where only the headers are captured.
> > > >
> > > > I am keeping a NACK, until you make sure you do not break this
> > > > important feature.
> > >
> > > I think the request here is to keep the snaplen in place (e.g., to make
> > > only headers visible to userspace) while also returning the >64kB packet
> > > length as meta data.
> > >
> > > My last pass on the packet socket code suggests this is possible;
> > > someone (Xin) needs to work through the details.
> > >
> > To be honest, I don't really like such a change in a packet socket,
> > I tried, and the code doesn't look nice.
> >
> > I'm thinking since skb->len is trustable, why don't we use
> > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> > namely, only change these 2 helpers to:
> >
> > static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> > struct iphdr *iph)
> > {
> >         u16 len = ntohs(iph->tot_len);
> >
> >         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
> >                 skb->len - skb_network_offset(skb);
> > }
> >
> > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> > {
> >         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> > }
> >
> > What do you think?
>
> I think this is a no go for me.
>
> I think I stated clearly what was the problem.
> If you care about TCP diagnostics, you want the truth, not truncated
> sequence ranges,
> making it impossible to know if a packet was sent.
Sorry Eric if I didn't get you well.

With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535),
all TCP headers will display well, no truncated sequence ranges:

#  ip net exec router tcpdump -i link1
13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483
13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483
13:36:46.675542 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1532895495:1532960978, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483
13:36:46.675550 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1533021985:1533087468, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483

I just don't want to modify the iph tot_len in IPv4 header from the
raw data in the packet socket.
We're trying to avoid iph->tot_len too small for IPv4 BIG TCP to
display tcphdr in tcpdump, aren't we?
That's why I think using IP_MAX_MTU will avoid this.

>
> Without headers describing precisely payload length (solution taken in
> IPv6 BI TCP),
> you have to augment AF_PACKET to provide this information in
> additional meta-data.
For this, I agree to provide the >64kB packet into additional meta-data.

Thanks.
Eric Dumazet Jan. 19, 2023, 7:17 p.m. UTC | #19
On Thu, Jan 19, 2023 at 7:59 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > > >
> > > > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > > > >
> > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > > > >> my env (RHEL-8), and it breaks too:
> > > > >>
> > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >>
> > > > >
> > > > > Please make sure to use a not too old tcpdump.
> > > > >
> > > > >> it doesn't show what we want from the TCP header either.
> > > > >>
> > > > >> For the latest tcpdump on upstream, it can display headers well for
> > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > > > >> that supports HBH parsing.
> > > > >
> > > > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > > > a correct version.
> > > >
> > > > Both of those just fall under "if you want a new feature, update your
> > > > tools."
> > > >
> > > >
> > > > >
> > > > >>
> > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > > > >> and 'tshark' even supports it by default.
> > > > >
> > > > > Not with privacy _requirements_, where only the headers are captured.
> > > > >
> > > > > I am keeping a NACK, until you make sure you do not break this
> > > > > important feature.
> > > >
> > > > I think the request here is to keep the snaplen in place (e.g., to make
> > > > only headers visible to userspace) while also returning the >64kB packet
> > > > length as meta data.
> > > >
> > > > My last pass on the packet socket code suggests this is possible;
> > > > someone (Xin) needs to work through the details.
> > > >
> > > To be honest, I don't really like such a change in a packet socket,
> > > I tried, and the code doesn't look nice.
> > >
> > > I'm thinking since skb->len is trustable, why don't we use
> > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> > > namely, only change these 2 helpers to:
> > >
> > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> > > struct iphdr *iph)
> > > {
> > >         u16 len = ntohs(iph->tot_len);
> > >
> > >         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
> > >                 skb->len - skb_network_offset(skb);
> > > }
> > >
> > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> > > {
> > >         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> > > }
> > >
> > > What do you think?
> >
> > I think this is a no go for me.
> >
> > I think I stated clearly what was the problem.
> > If you care about TCP diagnostics, you want the truth, not truncated
> > sequence ranges,
> > making it impossible to know if a packet was sent.
> Sorry Eric if I didn't get you well.
>
> With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535),
> all TCP headers will display well, no truncated sequence ranges:
>
> #  ip net exec router tcpdump -i link1
> 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483
> 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483

This is completely truncated, don't you see this ?

According to tcpdump, we sent sequences 1532642515:1532707998 and
1532769005:1532834488

And payload was of  65483 bytes per packet (this is not true)

What happened for 1532707998 -> 1532769005 ???

How network engineers will know "oh wait, data was sent/received after all",
and not dropped somewhere in the network or in netfilter or ... in a kernel bug.

> 13:36:46.675542 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1532895495:1532960978, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483
> 13:36:46.675550 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1533021985:1533087468, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483
Xin Long Jan. 19, 2023, 7:30 p.m. UTC | #20
On Thu, Jan 19, 2023 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jan 19, 2023 at 7:59 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > > > >
> > > > > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > > > > >
> > > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > > > > >> my env (RHEL-8), and it breaks too:
> > > > > >>
> > > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >>
> > > > > >
> > > > > > Please make sure to use a not too old tcpdump.
> > > > > >
> > > > > >> it doesn't show what we want from the TCP header either.
> > > > > >>
> > > > > >> For the latest tcpdump on upstream, it can display headers well for
> > > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > > > > >> that supports HBH parsing.
> > > > > >
> > > > > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > > > > a correct version.
> > > > >
> > > > > Both of those just fall under "if you want a new feature, update your
> > > > > tools."
> > > > >
> > > > >
> > > > > >
> > > > > >>
> > > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > > > > >> and 'tshark' even supports it by default.
> > > > > >
> > > > > > Not with privacy _requirements_, where only the headers are captured.
> > > > > >
> > > > > > I am keeping a NACK, until you make sure you do not break this
> > > > > > important feature.
> > > > >
> > > > > I think the request here is to keep the snaplen in place (e.g., to make
> > > > > only headers visible to userspace) while also returning the >64kB packet
> > > > > length as meta data.
> > > > >
> > > > > My last pass on the packet socket code suggests this is possible;
> > > > > someone (Xin) needs to work through the details.
> > > > >
> > > > To be honest, I don't really like such a change in a packet socket,
> > > > I tried, and the code doesn't look nice.
> > > >
> > > > I'm thinking since skb->len is trustable, why don't we use
> > > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> > > > namely, only change these 2 helpers to:
> > > >
> > > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> > > > struct iphdr *iph)
> > > > {
> > > >         u16 len = ntohs(iph->tot_len);
> > > >
> > > >         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
> > > >                 skb->len - skb_network_offset(skb);
> > > > }
> > > >
> > > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> > > > {
> > > >         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> > > > }
> > > >
> > > > What do you think?
> > >
> > > I think this is a no go for me.
> > >
> > > I think I stated clearly what was the problem.
> > > If you care about TCP diagnostics, you want the truth, not truncated
> > > sequence ranges,
> > > making it impossible to know if a packet was sent.
> > Sorry Eric if I didn't get you well.
> >
> > With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535),
> > all TCP headers will display well, no truncated sequence ranges:
> >
> > #  ip net exec router tcpdump -i link1
> > 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> > seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val
> > 2975547125 ecr 2379476018], length 65483
> > 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> > seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val
> > 2975547125 ecr 2379476018], length 65483
>
> This is completely truncated, don't you see this ?
OK, got you now.
Thanks for the explanation.

>
> According to tcpdump, we sent sequences 1532642515:1532707998 and
> 1532769005:1532834488
>
> And payload was of  65483 bytes per packet (this is not true)
>
> What happened for 1532707998 -> 1532769005 ???
>
> How network engineers will know "oh wait, data was sent/received after all",
> and not dropped somewhere in the network or in netfilter or ... in a kernel bug.
>
I will work on providing the >64kB packet length in meta-data instead.

Thanks.
diff mbox series

Patch

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 37dfdcfcdd54..b8edd6c599eb 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -175,6 +175,15 @@  static inline bool inet6_is_jumbogram(const struct sk_buff *skb)
 	return !!(IP6CB(skb)->flags & IP6SKB_JUMBOGRAM);
 }
 
+static inline unsigned int skb_ipv6_totlen(const struct sk_buff *skb)
+{
+	u32 pl = ntohs(ipv6_hdr(skb)->payload_len);
+
+	return pl ? pl + sizeof(struct ipv6hdr)
+		  : (skb_is_gso_v6(skb) ? skb->len - skb_network_offset(skb)
+					: pl + sizeof(struct ipv6hdr));
+}
+
 /* can not be used in TCP layer after tcp_v6_fill_cb */
 static inline int inet6_sdif(const struct sk_buff *skb)
 {
diff --git a/net/netfilter/xt_length.c b/net/netfilter/xt_length.c
index b3d623a52885..61518ec05c6e 100644
--- a/net/netfilter/xt_length.c
+++ b/net/netfilter/xt_length.c
@@ -30,8 +30,7 @@  static bool
 length_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_length_info *info = par->matchinfo;
-	const u_int16_t pktlen = ntohs(ipv6_hdr(skb)->payload_len) +
-				 sizeof(struct ipv6hdr);
+	u32 pktlen = skb_ipv6_totlen(skb);
 
 	return (pktlen >= info->min && pktlen <= info->max) ^ info->invert;
 }