[RFC,net-next,01/15] psp: add documentation

Message ID	20240510030435.120935-2-kuba@kernel.org (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B2BE2148844 for <netdev@vger.kernel.org>; Fri, 10 May 2024 03:04:45 +0000 (UTC) From: Jakub Kicinski <kuba@kernel.org> To: netdev@vger.kernel.org Cc: pabeni@redhat.com, willemdebruijn.kernel@gmail.com, borisp@nvidia.com, gal@nvidia.com, cratiu@nvidia.com, rrameshbabu@nvidia.com, steffen.klassert@secunet.com, tariqt@nvidia.com, Jakub Kicinski <kuba@kernel.org> Subject: [RFC net-next 01/15] psp: add documentation Date: Thu, 9 May 2024 20:04:21 -0700 Message-ID: <20240510030435.120935-2-kuba@kernel.org> In-Reply-To: <20240510030435.120935-1-kuba@kernel.org> References: <20240510030435.120935-1-kuba@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	add basic PSP encryption for TCP connections \| expand [RFC,net-next,00/15] add basic PSP encryption for TCP connections [RFC,net-next,01/15] psp: add documentation [RFC,net-next,02/15] psp: base PSP device support [RFC,net-next,03/15] net: modify core data structures for PSP datapath support [RFC,net-next,04/15] tcp: add datapath logic for PSP with inline key exchange [RFC,net-next,05/15] psp: add op for rotation of secret state [RFC,net-next,06/15] net: psp: add socket security association code [RFC,net-next,07/15] net: psp: update the TCP MSS to reflect PSP packet overhead [RFC,net-next,08/15] psp: track generations of secret state [RFC,net-next,09/15] net/mlx5e: Support PSP offload functionality [RFC,net-next,10/15] net/mlx5e: Implement PSP operations .assoc_add and .assoc_del [RFC,net-next,11/15] net/mlx5e: Implement PSP Tx data path [RFC,net-next,12/15] net/mlx5e: Add PSP steering in local NIC RX [RFC,net-next,13/15] net/mlx5e: Configure PSP Rx flow steering rules [RFC,net-next,14/15] net/mlx5e: Add Rx data path offload [RFC,net-next,15/15] net/mlx5e: Implement PSP key_rotate operation

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for net-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; GEN HAS DIFF 2 files changed, 877 insertions(+);
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 931 this patch: 931
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	warning	3 maintainers not CCed: linux-doc@vger.kernel.org edumazet@google.com corbet@lwn.net
netdev/build_clang	success	Errors and warnings before: 937 this patch: 937
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 943 this patch: 943
netdev/checkpatch	warning	WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Jakub Kicinski May 10, 2024, 3:04 a.m. UTC

Add documentation of things which belong in the docs rather
than commit messages.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/networking/index.rst |   1 +
 Documentation/networking/psp.rst   | 138 +++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)
 create mode 100644 Documentation/networking/psp.rst

Saeed Mahameed May 10, 2024, 10:19 p.m. UTC | #1

On 09 May 20:04, Jakub Kicinski wrote:
>Add documentation of things which belong in the docs rather
>than commit messages.
>
>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>---
> Documentation/networking/index.rst |   1 +
> Documentation/networking/psp.rst   | 138 +++++++++++++++++++++++++++++
> 2 files changed, 139 insertions(+)
> create mode 100644 Documentation/networking/psp.rst
>
>diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
>index 7664c0bfe461..0376029ecbdf 100644
>--- a/Documentation/networking/index.rst
>+++ b/Documentation/networking/index.rst
>@@ -94,6 +94,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
>    ppp_generic
>    proc_net_tcp
>    pse-pd/index
>+   psp
>    radiotap-headers
>    rds
>    regulatory
>diff --git a/Documentation/networking/psp.rst b/Documentation/networking/psp.rst
>new file mode 100644
>index 000000000000..a39b464813ab
>--- /dev/null
>+++ b/Documentation/networking/psp.rst
>@@ -0,0 +1,138 @@
>+.. SPDX-License-Identifier: GPL-2.0-only
>+
>+=====================
>+PSP Security Protocol
>+=====================
>+
>+Protocol
>+========
>+
>+PSP Security Protocol (PSP) was defined at Google and published in:
>+
>+https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf
>+
>+This section briefly covers protocol aspects crucial for understanding
>+the kernel API. Refer to the protocol specification for further details.
>+
>+Note that the kernel implementation and documentation uses the term
>+"secret state" in place of "master key", it is both less confusing
>+to an average developer and is less likely to run afoul any naming
>+guidelines.
>+

[ ... ] 

>+User facing API
>+===============
>+
>+PSP is designed primarily for hardware offloads. There is currently
>+no software fallback for systems which do not have PSP capable NICs.
>+There is also no standard (or otherwise defined) way of establishing
>+a PSP-secured connection or exchanging the symmetric keys.
>+
>+The expectation is that higher layer protocols will take care of
>+protocol and key negotiation. For example one may use TLS key exchange,
>+announce the PSP capability, and switch to PSP if both endpoints
>+are PSP-capable.
>+

The documentation doesn't include anything about userspace, other than
highlevel remarks on how this is expected to work.
What are we planning for userspace? I know we have kperf basic support and
some experimental python library, but nothing official or psp centric. 

I propose to start community driven project with a well established
library, with some concrete sample implementation for key negotiation,
as a plugin maybe, so anyone can implement their own key-exchange
mechanisms on top of the official psp library.

Jakub Kicinski May 11, 2024, 12:11 a.m. UTC | #2

On Fri, 10 May 2024 15:19:23 -0700 Saeed Mahameed wrote:
> >+PSP is designed primarily for hardware offloads. There is currently
> >+no software fallback for systems which do not have PSP capable NICs.
> >+There is also no standard (or otherwise defined) way of establishing
> >+a PSP-secured connection or exchanging the symmetric keys.
> >+
> >+The expectation is that higher layer protocols will take care of
> >+protocol and key negotiation. For example one may use TLS key exchange,
> >+announce the PSP capability, and switch to PSP if both endpoints
> >+are PSP-capable.
> 
> The documentation doesn't include anything about userspace, other than
> highlevel remarks on how this is expected to work.

The cover letter does.

> What are we planning for userspace? I know we have kperf basic support and
> some experimental python library, but nothing official or psp centric. 

Remind me, how long did it take for kernel TLS support to be merged
into OpenSSL? ;)

> I propose to start community driven project with a well established
> library, with some concrete sample implementation for key negotiation,
> as a plugin maybe, so anyone can implement their own key-exchange
> mechanisms on top of the official psp library.

Yes, I should have CCed Meta's folks who work on TLS [1]. Adding them
now. More than happy to facilitate the discussion, maybe Willem can
CC the right Google folks, IDK who else...

We should start moving with the kernel support, IMO, until we do 
the user space implementation is stalled. I don't expect that the
way we install keys in the kernel would be impacted by the handshake.

[1] https://github.com/facebookincubator/fizz

Vadim Fedorenko May 11, 2024, 9:41 a.m. UTC | #3

On 11.05.2024 01:11, Jakub Kicinski wrote:
> On Fri, 10 May 2024 15:19:23 -0700 Saeed Mahameed wrote:
>>> +PSP is designed primarily for hardware offloads. There is currently
>>> +no software fallback for systems which do not have PSP capable NICs.
>>> +There is also no standard (or otherwise defined) way of establishing
>>> +a PSP-secured connection or exchanging the symmetric keys.
>>> +
>>> +The expectation is that higher layer protocols will take care of
>>> +protocol and key negotiation. For example one may use TLS key exchange,
>>> +announce the PSP capability, and switch to PSP if both endpoints
>>> +are PSP-capable.
>>
>> The documentation doesn't include anything about userspace, other than
>> highlevel remarks on how this is expected to work.
> 
> The cover letter does.
> 
>> What are we planning for userspace? I know we have kperf basic support and
>> some experimental python library, but nothing official or psp centric.
> 
> Remind me, how long did it take for kernel TLS support to be merged
> into OpenSSL? ;)

I believe it was bad timing for OpenSSL. The patches with kTLS support were
sitting in the main branch for long time, the problem was postponed release with
with jump to new versioning schema.

But I agree, there is no easy way to start coding user-space lib without initial
support from kernel.

>> I propose to start community driven project with a well established
>> library, with some concrete sample implementation for key negotiation,
>> as a plugin maybe, so anyone can implement their own key-exchange
>> mechanisms on top of the official psp library.
> 
> Yes, I should have CCed Meta's folks who work on TLS [1]. Adding them
> now. More than happy to facilitate the discussion, maybe Willem can
> CC the right Google folks, IDK who else...
> 
> We should start moving with the kernel support, IMO, until we do
> the user space implementation is stalled. I don't expect that the
> way we install keys in the kernel would be impacted by the handshake.
> 
> [1] https://github.com/facebookincubator/fizz

David Ahern May 11, 2024, 4:25 p.m. UTC | #4

On 5/11/24 3:41 AM, Vadim Fedorenko wrote:
> But I agree, there is no easy way to start coding user-space lib without
> initial
> support from kernel.


The 2 sides should be co-developed; it is the only way to merge a sane uapi.

Willem de Bruijn May 13, 2024, 1:24 a.m. UTC | #5

Jakub Kicinski wrote:
> Add documentation of things which belong in the docs rather
> than commit messages.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  Documentation/networking/index.rst |   1 +
>  Documentation/networking/psp.rst   | 138 +++++++++++++++++++++++++++++
>  2 files changed, 139 insertions(+)
>  create mode 100644 Documentation/networking/psp.rst
> 
> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> index 7664c0bfe461..0376029ecbdf 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -94,6 +94,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
>     ppp_generic
>     proc_net_tcp
>     pse-pd/index
> +   psp
>     radiotap-headers
>     rds
>     regulatory
> diff --git a/Documentation/networking/psp.rst b/Documentation/networking/psp.rst
> new file mode 100644
> index 000000000000..a39b464813ab
> --- /dev/null
> +++ b/Documentation/networking/psp.rst
> @@ -0,0 +1,138 @@
> +.. SPDX-License-Identifier: GPL-2.0-only
> +
> +=====================
> +PSP Security Protocol
> +=====================
> +
> +Protocol
> +========
> +
> +PSP Security Protocol (PSP) was defined at Google and published in:
> +
> +https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf
> +
> +This section briefly covers protocol aspects crucial for understanding
> +the kernel API. Refer to the protocol specification for further details.
> +
> +Note that the kernel implementation and documentation uses the term
> +"secret state" in place of "master key", it is both less confusing
> +to an average developer and is less likely to run afoul any naming
> +guidelines.

There is some value in using the same terminology in the code as in
the spec.

And the session keys are derived from a key. That is more precise than
state. Specifically, counter-mode KDF from an AES key.

Perhaps device key, instead of master key? 

> +Derived Rx keys
> +---------------
> +
> +PSP borrows some terms and mechanisms from IPsec. PSP was designed
> +with HW offloads in mind. The key feature of PSP is that Rx keys for every
> +connection do not have to be stored by the receiver but can be derived
> +from secret state and information present in packet headers.

A second less obvious, but neat, feature is that it supports an
encryption offset, such that (say) the L4 ports are integrity
protected, but not encrypted, to allow for in-network telemetry.

> +This makes it possible to implement receivers which require a constant
> +amount of memory regardless of the number of connections (``O(1)`` scaling).
> +
> +Tx keys have to be stored like with any other protocol,

Keys can optionally be passed in descriptor.

> +The expectation is that higher layer protocols will take care of
> +protocol and key negotiation. For example one may use TLS key exchange,
> +announce the PSP capability, and switch to PSP if both endpoints
> +are PSP-capable.

> +Securing a connection
> +---------------------
> +
> +PSP encryption is currently only supported for TCP connections.
> +Rx and Tx keys are allocated separately. First the ``rx-assoc``
> +Netlink command needs to be issued, specifying a target TCP socket.
> +Kernel will allocate a new PSP Rx key from the NIC and associate it
> +with given socket. At this stage socket will accept both PSP-secured
> +and plain text TCP packets.
> +
> +Tx keys are installed using the ``tx-assoc`` Netlink command.
> +Once the Tx keys are installed all data read from the socket will
> +be PSP-secured. In other words act of installing Tx keys has the secondary
> +effect on the Rx direction, requring all received packets to be encrypted.

Consider clarifying the entire state diagram from when one pair
initiates upgrade.

And some edge cases:

- retransmits
- TCP fin handshake, if only one peer succeeds
- TCP control socket response to encrypted pkt

What is the expectation for data already queued for transmission when
the tx assocation is made?

More generally, what happens for data in flight. One possible
simplification is to only allow an upgrade sequence (possibly
including in-band exchange of keys) when no other data is in
flight.

> +Since packet reception is asynchronous, to make it possible for the
> +application to trust that any data read from the socket after the ``tx-assoc``
> +call returns success has been encrypted, the kernel will scan the receive
> +queue of the socket at ``tx-assoc`` time. If any enqueued packet was received
> +in clear text the Tx association will fail, and application should retry
> +installing the Tx key after draining the socket (this should not be necessary
> +if both endpoints are well behaved).
> +
> +Rotation notifications
> +----------------------
> +
> +The rotations of secret state happen asynchornously and are usually

typo: asynchronously

> +performed by management daemons, not under application control.
> +The PSP netlink family will generate a notification whenever keys
> +are rotated. The applications are expected to re-establish connections
> +before keys are rotated again.

Connection key rotation is not supported? I did notice that tx key
insertion fails if a key is already present, so this does appear to be
the behavior.

Jakub Kicinski May 29, 2024, 5:35 p.m. UTC | #6

On Sun, 12 May 2024 21:24:23 -0400 Willem de Bruijn wrote:
> Jakub Kicinski wrote:
> > +PSP Security Protocol (PSP) was defined at Google and published in:
> > +
> > +https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf
> > +
> > +This section briefly covers protocol aspects crucial for understanding
> > +the kernel API. Refer to the protocol specification for further details.
> > +
> > +Note that the kernel implementation and documentation uses the term
> > +"secret state" in place of "master key", it is both less confusing
> > +to an average developer and is less likely to run afoul any naming
> > +guidelines.  
> 
> There is some value in using the same terminology in the code as in
> the spec.
> 
> And the session keys are derived from a key. That is more precise than
> state. Specifically, counter-mode KDF from an AES key.
> 
> Perhaps device key, instead of master key? 

Weak preference towards secret state, but device key works, too.

> > +Derived Rx keys
> > +---------------
> > +
> > +PSP borrows some terms and mechanisms from IPsec. PSP was designed
> > +with HW offloads in mind. The key feature of PSP is that Rx keys for every
> > +connection do not have to be stored by the receiver but can be derived
> > +from secret state and information present in packet headers.  
> 
> A second less obvious, but neat, feature is that it supports an
> encryption offset, such that (say) the L4 ports are integrity
> protected, but not encrypted, to allow for in-network telemetry.

I know, but the opening paragraph has:

   This section briefly covers protocol aspects crucial for
   understanding the kernel API. Refer to the protocol specification for further details.

:) .. and I didn't implement the offset, yet. (It's trivial to add and
ETOOMANYPATCHES.)

> > +This makes it possible to implement receivers which require a constant
> > +amount of memory regardless of the number of connections (``O(1)`` scaling).
> > +
> > +Tx keys have to be stored like with any other protocol,  
> 
> Keys can optionally be passed in descriptor.

Added: Preferably, the Tx keys should be provided with the packet (e.g.
as part of the descriptors).

> > +The expectation is that higher layer protocols will take care of
> > +protocol and key negotiation. For example one may use TLS key exchange,
> > +announce the PSP capability, and switch to PSP if both endpoints
> > +are PSP-capable.  
> 
> > +Securing a connection
> > +---------------------
> > +
> > +PSP encryption is currently only supported for TCP connections.
> > +Rx and Tx keys are allocated separately. First the ``rx-assoc``
> > +Netlink command needs to be issued, specifying a target TCP socket.
> > +Kernel will allocate a new PSP Rx key from the NIC and associate it
> > +with given socket. At this stage socket will accept both PSP-secured
> > +and plain text TCP packets.
> > +
> > +Tx keys are installed using the ``tx-assoc`` Netlink command.
> > +Once the Tx keys are installed all data read from the socket will
> > +be PSP-secured. In other words act of installing Tx keys has the secondary
> > +effect on the Rx direction, requring all received packets to be encrypted.  
> 
> Consider clarifying the entire state diagram from when one pair
> initiates upgrade.

Not sure about state diagram, there are only 3 states. Or do you mean
extend TCP state diagrams? I think a table may be better:

Event         | Normal TCP      | Rx PSP key present | Tx PSP key present |
---------------------------------------------------------------------------
Rx plain text | accept          | accept             | drop               |

Rx PSP (good) | drop            | accept             | accept             |

Rx PSP (bad)  | drop            | drop               | drop               |

Tx            | plain text      | plain text         | encrypted *        |

* data enqueued before Tx key in installed will not be encrypted
  (either initial send nor retranmissions)


What should I add?

> And some edge cases:
> 
> - retransmits
> - TCP fin handshake, if only one peer succeeds

So FIN when one end is "locked down" and the other isn't?

> - TCP control socket response to encrypted pkt

Control sock ignores PSP.

> What is the expectation for data already queued for transmission when
> the tx assocation is made?
> 
> More generally, what happens for data in flight. One possible
> simplification is to only allow an upgrade sequence (possibly
> including in-band exchange of keys) when no other data is in
> flight.

Like TLS offload, the data is annotated "for encryption" when queued.
So data queued earlier or retransmits of such data will never be
encrypted.

> > +performed by management daemons, not under application control.
> > +The PSP netlink family will generate a notification whenever keys
> > +are rotated. The applications are expected to re-establish connections
> > +before keys are rotated again.  
> 
> Connection key rotation is not supported? I did notice that tx key
> insertion fails if a key is already present, so this does appear to be
> the behavior.

Correct, for now connections need to be re-established once a day.
Rx should be easy, Tx we can make easy by only supporting rotation
when there's no data queued.

Willem de Bruijn May 30, 2024, 12:47 a.m. UTC | #7

Jakub Kicinski wrote:
> On Sun, 12 May 2024 21:24:23 -0400 Willem de Bruijn wrote:
> > Jakub Kicinski wrote:
> > > +PSP Security Protocol (PSP) was defined at Google and published in:
> > > +
> > > +https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf
> > > +
> > > +This section briefly covers protocol aspects crucial for understanding
> > > +the kernel API. Refer to the protocol specification for further details.
> > > +
> > > +Note that the kernel implementation and documentation uses the term
> > > +"secret state" in place of "master key", it is both less confusing
> > > +to an average developer and is less likely to run afoul any naming
> > > +guidelines.  
> > 
> > There is some value in using the same terminology in the code as in
> > the spec.
> > 
> > And the session keys are derived from a key. That is more precise than
> > state. Specifically, counter-mode KDF from an AES key.
> > 
> > Perhaps device key, instead of master key? 
> 
> Weak preference towards secret state, but device key works, too.

Totally your choice. I just wanted to make sure this was considered.
 
> > > +Derived Rx keys
> > > +---------------
> > > +
> > > +PSP borrows some terms and mechanisms from IPsec. PSP was designed
> > > +with HW offloads in mind. The key feature of PSP is that Rx keys for every
> > > +connection do not have to be stored by the receiver but can be derived
> > > +from secret state and information present in packet headers.  
> > 
> > A second less obvious, but neat, feature is that it supports an
> > encryption offset, such that (say) the L4 ports are integrity
> > protected, but not encrypted, to allow for in-network telemetry.
> 
> I know, but the opening paragraph has:
> 
>    This section briefly covers protocol aspects crucial for
>    understanding the kernel API. Refer to the protocol specification for further details.
> 
> :) .. and I didn't implement the offset, yet. (It's trivial to add and
> ETOOMANYPATCHES.)

Ack, sounds good.

> 
> > > +This makes it possible to implement receivers which require a constant
> > > +amount of memory regardless of the number of connections (``O(1)`` scaling).
> > > +
> > > +Tx keys have to be stored like with any other protocol,  
> > 
> > Keys can optionally be passed in descriptor.
> 
> Added: Preferably, the Tx keys should be provided with the packet (e.g.
> as part of the descriptors).
> 
> > > +The expectation is that higher layer protocols will take care of
> > > +protocol and key negotiation. For example one may use TLS key exchange,
> > > +announce the PSP capability, and switch to PSP if both endpoints
> > > +are PSP-capable.  
> > 
> > > +Securing a connection
> > > +---------------------
> > > +
> > > +PSP encryption is currently only supported for TCP connections.
> > > +Rx and Tx keys are allocated separately. First the ``rx-assoc``
> > > +Netlink command needs to be issued, specifying a target TCP socket.
> > > +Kernel will allocate a new PSP Rx key from the NIC and associate it
> > > +with given socket. At this stage socket will accept both PSP-secured
> > > +and plain text TCP packets.
> > > +
> > > +Tx keys are installed using the ``tx-assoc`` Netlink command.
> > > +Once the Tx keys are installed all data read from the socket will
> > > +be PSP-secured. In other words act of installing Tx keys has the secondary
> > > +effect on the Rx direction, requring all received packets to be encrypted.  
> > 
> > Consider clarifying the entire state diagram from when one pair
> > initiates upgrade.
> 
> Not sure about state diagram, there are only 3 states. Or do you mean
> extend TCP state diagrams? I think a table may be better:
> 
> Event         | Normal TCP      | Rx PSP key present | Tx PSP key present |
> ---------------------------------------------------------------------------
> Rx plain text | accept          | accept             | drop               |
> 
> Rx PSP (good) | drop            | accept             | accept             |
> 
> Rx PSP (bad)  | drop            | drop               | drop               |
> 
> Tx            | plain text      | plain text         | encrypted *        |
> 
> * data enqueued before Tx key in installed will not be encrypted
>   (either initial send nor retranmissions)
> 
> 
> What should I add?

I've mostly been concerned about the below edge cases.

If both peers are in TCP_ESTABLISHED for the during of the upgrade,
and data is aligned on message boundary, things are straightforward.

The retransmit logic is clear, as this is controlled by skb->decrypted
on the individual skbs on the retransmit queue.

That also solves another edge case: skb geometry changes on retransmit
(due to different MSS or segs, using tcp_fragment, tso_fragment,
tcp_retrans_try_collapse, ..) maintain skb->decrypted. It's not
possible that skb is accidentally created that combines plaintext and
ciphertext content.

Although.. does this require adding that skb->decrypted check to
tcp_skb_can_collapse?
 
> > And some edge cases:
> > 
> > - retransmits
> > - TCP fin handshake, if only one peer succeeds
> 
> So FIN when one end is "locked down" and the other isn't?

If one peer can enter the state where it drops all plaintext, while
the other decides to close the connection before completing the
upgrade, and thus sends a plaintext FIN.

If (big if) that can happen, then the connection cannot be cleanly
closed.
 
> > - TCP control socket response to encrypted pkt
> 
> Control sock ignores PSP.

Another example where a peer stays open and stays retrying if it has
upgraded and drops all plaintext.

Jakub Kicinski May 30, 2024, 7:51 p.m. UTC | #8

On Wed, 29 May 2024 20:47:02 -0400 Willem de Bruijn wrote:
> Jakub Kicinski wrote:
> > On Sun, 12 May 2024 21:24:23 -0400 Willem de Bruijn wrote:  
> > > There is some value in using the same terminology in the code as in
> > > the spec.
> > > 
> > > And the session keys are derived from a key. That is more precise than
> > > state. Specifically, counter-mode KDF from an AES key.
> > > 
> > > Perhaps device key, instead of master key?   
> > 
> > Weak preference towards secret state, but device key works, too.  
> 
> Totally your choice. I just wanted to make sure this was considered.

Already run the sed, device key it is :)

> > > Consider clarifying the entire state diagram from when one pair
> > > initiates upgrade.  
> > 
> > Not sure about state diagram, there are only 3 states. Or do you mean
> > extend TCP state diagrams? I think a table may be better:
> > 
> > Event         | Normal TCP      | Rx PSP key present | Tx PSP key present |
> > ---------------------------------------------------------------------------
> > Rx plain text | accept          | accept             | drop               |
> > 
> > Rx PSP (good) | drop            | accept             | accept             |
> > 
> > Rx PSP (bad)  | drop            | drop               | drop               |
> > 
> > Tx            | plain text      | plain text         | encrypted *        |
> > 
> > * data enqueued before Tx key in installed will not be encrypted
> >   (either initial send nor retranmissions)
> > 
> > 
> > What should I add?  
> 
> I've mostly been concerned about the below edge cases.
> 
> If both peers are in TCP_ESTABLISHED for the during of the upgrade,
> and data is aligned on message boundary, things are straightforward.
> 
> The retransmit logic is clear, as this is controlled by skb->decrypted
> on the individual skbs on the retransmit queue.
> 
> That also solves another edge case: skb geometry changes on retransmit
> (due to different MSS or segs, using tcp_fragment, tso_fragment,
> tcp_retrans_try_collapse, ..) maintain skb->decrypted. It's not
> possible that skb is accidentally created that combines plaintext and
> ciphertext content.
> 
> Although.. does this require adding that skb->decrypted check to
> tcp_skb_can_collapse?

Good catch. The TLS checks predate tcp_skb_can_collapse() (and MPTCP).
We've grown the check in tcp_shift_skb_data() and the logic
in tcp_grow_skb(), both missing the decrypted check.

I'll send some fixes, these are existing bugs :(

> > > And some edge cases:
> > > 
> > > - retransmits
> > > - TCP fin handshake, if only one peer succeeds  
> > 
> > So FIN when one end is "locked down" and the other isn't?  
> 
> If one peer can enter the state where it drops all plaintext, while
> the other decides to close the connection before completing the
> upgrade, and thus sends a plaintext FIN.
> 
> If (big if) that can happen, then the connection cannot be cleanly
> closed.

Hm. And we can avoid this by only enforcing encryption of data-less
segments once we've seen some encrypted data?

> > > - TCP control socket response to encrypted pkt  
> > 
> > Control sock ignores PSP.  
> 
> Another example where a peer stays open and stays retrying if it has
> upgraded and drops all plaintext.

Jakub Kicinski May 30, 2024, 8:15 p.m. UTC | #9

On Thu, 30 May 2024 12:51:20 -0700 Jakub Kicinski wrote:
> > I've mostly been concerned about the below edge cases.
> > 
> > If both peers are in TCP_ESTABLISHED for the during of the upgrade,
> > and data is aligned on message boundary, things are straightforward.
> > 
> > The retransmit logic is clear, as this is controlled by skb->decrypted
> > on the individual skbs on the retransmit queue.
> > 
> > That also solves another edge case: skb geometry changes on retransmit
> > (due to different MSS or segs, using tcp_fragment, tso_fragment,
> > tcp_retrans_try_collapse, ..) maintain skb->decrypted. It's not
> > possible that skb is accidentally created that combines plaintext and
> > ciphertext content.
> > 
> > Although.. does this require adding that skb->decrypted check to
> > tcp_skb_can_collapse?  
> 
> Good catch. The TLS checks predate tcp_skb_can_collapse() (and MPTCP).
> We've grown the check in tcp_shift_skb_data() and the logic
> in tcp_grow_skb(), both missing the decrypted check.
> 
> I'll send some fixes, these are existing bugs :(

I take that back, we can depend on EOR like TLS does.

Willem de Bruijn May 30, 2024, 9:03 p.m. UTC | #10

Jakub Kicinski wrote:
> On Thu, 30 May 2024 12:51:20 -0700 Jakub Kicinski wrote:
> > > I've mostly been concerned about the below edge cases.
> > > 
> > > If both peers are in TCP_ESTABLISHED for the during of the upgrade,
> > > and data is aligned on message boundary, things are straightforward.
> > > 
> > > The retransmit logic is clear, as this is controlled by skb->decrypted
> > > on the individual skbs on the retransmit queue.
> > > 
> > > That also solves another edge case: skb geometry changes on retransmit
> > > (due to different MSS or segs, using tcp_fragment, tso_fragment,
> > > tcp_retrans_try_collapse, ..) maintain skb->decrypted. It's not
> > > possible that skb is accidentally created that combines plaintext and
> > > ciphertext content.
> > > 
> > > Although.. does this require adding that skb->decrypted check to
> > > tcp_skb_can_collapse?  
> > 
> > Good catch. The TLS checks predate tcp_skb_can_collapse() (and MPTCP).
> > We've grown the check in tcp_shift_skb_data() and the logic
> > in tcp_grow_skb(), both missing the decrypted check.
> > 
> > I'll send some fixes, these are existing bugs :(
> 
> I take that back, we can depend on EOR like TLS does.

Oh yes. Neat solution.

This relies on userspace doing the right thing by passing MSG_EOR
right? That is easy to get wrong. Should we still add a check or
a WARN_ONCE. That would be net-next material.

Willem de Bruijn May 31, 2024, 1:56 p.m. UTC | #11

Jakub Kicinski wrote:
> On Wed, 29 May 2024 20:47:02 -0400 Willem de Bruijn wrote:
> > Jakub Kicinski wrote:
> > > On Sun, 12 May 2024 21:24:23 -0400 Willem de Bruijn wrote:  
> > > > There is some value in using the same terminology in the code as in
> > > > the spec.
> > > > 
> > > > And the session keys are derived from a key. That is more precise than
> > > > state. Specifically, counter-mode KDF from an AES key.
> > > > 
> > > > Perhaps device key, instead of master key?   
> > > 
> > > Weak preference towards secret state, but device key works, too.  
> > 
> > Totally your choice. I just wanted to make sure this was considered.
> 
> Already run the sed, device key it is :)
> 
> > > > Consider clarifying the entire state diagram from when one pair
> > > > initiates upgrade.  
> > > 
> > > Not sure about state diagram, there are only 3 states. Or do you mean
> > > extend TCP state diagrams? I think a table may be better:
> > > 
> > > Event         | Normal TCP      | Rx PSP key present | Tx PSP key present |
> > > ---------------------------------------------------------------------------
> > > Rx plain text | accept          | accept             | drop               |
> > > 
> > > Rx PSP (good) | drop            | accept             | accept             |
> > > 
> > > Rx PSP (bad)  | drop            | drop               | drop               |
> > > 
> > > Tx            | plain text      | plain text         | encrypted *        |
> > > 
> > > * data enqueued before Tx key in installed will not be encrypted
> > >   (either initial send nor retranmissions)
> > > 
> > > 
> > > What should I add?  
> > 
> > I've mostly been concerned about the below edge cases.
> > 
> > If both peers are in TCP_ESTABLISHED for the during of the upgrade,
> > and data is aligned on message boundary, things are straightforward.
> > 
> > The retransmit logic is clear, as this is controlled by skb->decrypted
> > on the individual skbs on the retransmit queue.
> > 
> > That also solves another edge case: skb geometry changes on retransmit
> > (due to different MSS or segs, using tcp_fragment, tso_fragment,
> > tcp_retrans_try_collapse, ..) maintain skb->decrypted. It's not
> > possible that skb is accidentally created that combines plaintext and
> > ciphertext content.
> > 
> > Although.. does this require adding that skb->decrypted check to
> > tcp_skb_can_collapse?
> 
> Good catch. The TLS checks predate tcp_skb_can_collapse() (and MPTCP).
> We've grown the check in tcp_shift_skb_data() and the logic
> in tcp_grow_skb(), both missing the decrypted check.
> 
> I'll send some fixes, these are existing bugs :(
> 
> > > > And some edge cases:
> > > > 
> > > > - retransmits
> > > > - TCP fin handshake, if only one peer succeeds  
> > > 
> > > So FIN when one end is "locked down" and the other isn't?  
> > 
> > If one peer can enter the state where it drops all plaintext, while
> > the other decides to close the connection before completing the
> > upgrade, and thus sends a plaintext FIN.
> > 
> > If (big if) that can happen, then the connection cannot be cleanly
> > closed.
> 
> Hm. And we can avoid this by only enforcing encryption of data-less
> segments once we've seen some encrypted data?

That would help. It may also be needed to accept a pure ACK right at
the upgrade seqno. Depends on the upgrade process.

Which may be worth documenting explicitly: the system call and network
packet exchange from when one peer initiates (by generating its local
key) until the connection is fully encrypted. That also allows poking
at the various edge cases that may happen if packets are lost, or when
actions can race.

One unexpected example of the latter that I came across was Tx SADB
key insertion in tail edge cases taking longer than network RTT, for
instance.

The kernel API can be exercised in a variety of ways, not all of them
will uphold the correctness. Documenting how it should be used should
help.

Even better when it reduces the option space. As it already does by
failing a Tx key install until Rx is configured.

> > > > - TCP control socket response to encrypted pkt  
> > > 
> > > Control sock ignores PSP.  
> > 
> > Another example where a peer stays open and stays retrying if it has
> > upgraded and drops all plaintext.

May want to always allow plaintext RSTs. This is a potential DoS
vector. In all these cases, I suppose this has already been figured
out for TLS.

Jakub Kicinski June 5, 2024, 12:08 a.m. UTC | #12

On Fri, 31 May 2024 09:56:42 -0400 Willem de Bruijn wrote:
> > > If one peer can enter the state where it drops all plaintext, while
> > > the other decides to close the connection before completing the
> > > upgrade, and thus sends a plaintext FIN.
> > > 
> > > If (big if) that can happen, then the connection cannot be cleanly
> > > closed.  
> > 
> > Hm. And we can avoid this by only enforcing encryption of data-less
> > segments once we've seen some encrypted data?  
> 
> That would help. It may also be needed to accept a pure ACK right at
> the upgrade seqno. Depends on the upgrade process.
> 
> Which may be worth documenting explicitly: the system call and network
> packet exchange from when one peer initiates (by generating its local
> key) until the connection is fully encrypted. That also allows poking
> at the various edge cases that may happen if packets are lost, or when
> actions can race.

Dunno if the format below is good, but you're very right.
At least to me writing the diagram was an hour well spent :)

> One unexpected example of the latter that I came across was Tx SADB
> key insertion in tail edge cases taking longer than network RTT, for
> instance.
> 
> The kernel API can be exercised in a variety of ways, not all of them
> will uphold the correctness. Documenting how it should be used should
> help.
> 
> Even better when it reduces the option space. As it already does by
> failing a Tx key install until Rx is configured.

Something along these lines?

"Sequence" diagram of the worst case scenario:

01 p       Host A                         Host B
02 l t        ~~~~~~~~~~~[TCP 3 WHS]~~~~~~~~~~
03 a e        ~~~~~~[crypto negotiation]~~~~~~
04 i x                             [Rx key alloc = K-B]
05 n t                          <--- [app] K-B key send 
06 ------[Rx key alloc = K-A]-
07     [app] K-A key send -->|
08        [TCP] K-B input <-----
08 P      [TCP] K-B ACK ---->|
09 S R [app] recv(K-B)       |
10 P x [app] [Tx key set]    |  
11 -------------------------- 
12 P T [app] send(RPC) #####>|   
13 S x                       |<----    [TCP] Seq OoO! queue RPC, SACK
14 P      [TCP] retr K-A --->|   
15                           |  `->    [TCP] K-A input
16                           | <---    [TCP] K-A ACK (or FIN) 
17                           |      [app] recv(K-A)
18                           |      [app] [Tx key set]
19                            -----------------------------------
20

There is a causal dependency between Host B allocating the key (line 4),
sending it (line 5) and Host A receiving it (line 8). Since Host B will
accept PSP packets as soon as it allocated the key, Host A does not
need to wait to start using the key (line 12). Host B will queue the
RPC to the socket (line 13).

[Problem #1]

However, because Host B does not have a Tx key, the ACK / SACK packet
(line 13) will not be encrypted. (Similarly if Host B decided to close
the connection at this point, the resulting FIN packet would not be
encrypted.) Host B needs to accept unencrypted non-data segments 
(pure acks, pure FIN) until it sees an encrypted packet from Host B.

[Problem #2]

The retansmissions of K-A are unencrypted, to avoid sending the same
data in encrypted and unencrypted form. This poses a risk if an ACK
gets lost but both hosts end up in the PSP Tx state. Assume that Host A
did not send the RPC (line 12), and the retransmission (line 14)
happens as an RTO or TLP. Host B may already reach PSP Tx state (line
"20") and expect encrypted data. Plain text retransmissions (with
sequence number before rcv_nxt) must be accepted until Host B sees
encrypted data from Host A.


With that I think the state machine needs to be amended:

Event          | Normal TCP  | Rx PSP      | Tx PSP      | PSP full    |
-----------------------------------------------------------------------
Rx plain (new) | accept      | accept      | drop        | drop        |

Rx plain       | accept      | accept      | accept      | drop        |
(ACK|FIN|rtx)  |             |             |             |             |

Rx PSP (good)  | drop        | accept      | accept      | accept      |

Rx PSP (bad    | drop        | drop        | drop        | drop        |
(crypt, !=SPI) |             |             |             |             |

Tx             | plain text  | plain text  | encrypted   | encrypted   |
               |             |             | (excl. rtx) | (excl. rtx) |

> > > Another example where a peer stays open and stays retrying if it has
> > > upgraded and drops all plaintext.  
> 
> May want to always allow plaintext RSTs. This is a potential DoS
> vector.

Because of key exhaustion? Or we can be tricked into spamming someone
with retranmissions and ignoring their RST?

> In all these cases, I suppose this has already been figured
> out for TLS.

Assuming the answer above is "key exhaustion" - I wouldn't be surprised
if it wasn't :(

Willem de Bruijn June 5, 2024, 8:11 p.m. UTC | #13

Jakub Kicinski wrote:
> On Fri, 31 May 2024 09:56:42 -0400 Willem de Bruijn wrote:
> > > > If one peer can enter the state where it drops all plaintext, while
> > > > the other decides to close the connection before completing the
> > > > upgrade, and thus sends a plaintext FIN.
> > > > 
> > > > If (big if) that can happen, then the connection cannot be cleanly
> > > > closed.  
> > > 
> > > Hm. And we can avoid this by only enforcing encryption of data-less
> > > segments once we've seen some encrypted data?  
> > 
> > That would help. It may also be needed to accept a pure ACK right at
> > the upgrade seqno. Depends on the upgrade process.
> > 
> > Which may be worth documenting explicitly: the system call and network
> > packet exchange from when one peer initiates (by generating its local
> > key) until the connection is fully encrypted. That also allows poking
> > at the various edge cases that may happen if packets are lost, or when
> > actions can race.
> 
> Dunno if the format below is good, but you're very right.
> At least to me writing the diagram was an hour well spent :)

Great :)
 
> > One unexpected example of the latter that I came across was Tx SADB
> > key insertion in tail edge cases taking longer than network RTT, for
> > instance.
> > 
> > The kernel API can be exercised in a variety of ways, not all of them
> > will uphold the correctness. Documenting how it should be used should
> > help.
> > 
> > Even better when it reduces the option space. As it already does by
> > failing a Tx key install until Rx is configured.
> 
> Something along these lines?
> 
> "Sequence" diagram of the worst case scenario:
> 
> 01 p       Host A                         Host B
> 02 l t        ~~~~~~~~~~~[TCP 3 WHS]~~~~~~~~~~
> 03 a e        ~~~~~~[crypto negotiation]~~~~~~
> 04 i x                             [Rx key alloc = K-B]
> 05 n t                          <--- [app] K-B key send 
> 06 ------[Rx key alloc = K-A]-
> 07     [app] K-A key send -->|
> 08        [TCP] K-B input <-----
> 08 P      [TCP] K-B ACK ---->|
> 09 S R [app] recv(K-B)       |
> 10 P x [app] [Tx key set]    |  
> 11 -------------------------- 
> 12 P T [app] send(RPC) #####>|   
> 13 S x                       |<----    [TCP] Seq OoO! queue RPC, SACK
> 14 P      [TCP] retr K-A --->|   
> 15                           |  `->    [TCP] K-A input
> 16                           | <---    [TCP] K-A ACK (or FIN) 
> 17                           |      [app] recv(K-A)
> 18                           |      [app] [Tx key set]
> 19                            -----------------------------------
> 20
> 
> There is a causal dependency between Host B allocating the key (line 4),
> sending it (line 5) and Host A receiving it (line 8). Since Host B will
> accept PSP packets as soon as it allocated the key, Host A does not
> need to wait to start using the key (line 12). Host B will queue the
> RPC to the socket (line 13).
> 
> [Problem #1]
> 
> However, because Host B does not have a Tx key, the ACK / SACK packet
> (line 13) will not be encrypted. (Similarly if Host B decided to close
> the connection at this point, the resulting FIN packet would not be
> encrypted.)

Or if it plays SO_LINGER games the resulting RST.

> Host B needs to accept unencrypted non-data segments 
> (pure acks, pure FIN) until it sees an encrypted packet from Host B.
>
> [Problem #2]
> 
> The retansmissions of K-A are unencrypted, to avoid sending the same
> data in encrypted and unencrypted form. This poses a risk if an ACK
> gets lost but both hosts end up in the PSP Tx state. Assume that Host A
> did not send the RPC (line 12), and the retransmission (line 14)
> happens as an RTO or TLP. Host B may already reach PSP Tx state (line
> "20") and expect encrypted data. Plain text retransmissions (with
> sequence number before rcv_nxt) must be accepted until Host B sees
> encrypted data from Host A.

Is that sufficient if an initial encrypted packet could get reordered
by the network to arrive before a plaintext retransmit of a lower
seqno?

Both scenarios make sense. It is unfortunately harder to be sure that
we have captured all edge cases.

An issue related to the rcv_nxt cut-point, not sure how important: the
plaintext packet contents are protected by user crypto before upgrade.
But the TCP headers are not. PSP relies on TCP PAWS against replay
protection. It is possible for a MITM to offset all seqno from the
start of connection establishment. I don't see an immediate issue. But
at a minimum it could be possible to insert or delete before PSP is
upgraded.

> 
> With that I think the state machine needs to be amended:
> 
> Event          | Normal TCP  | Rx PSP      | Tx PSP      | PSP full    |
> -----------------------------------------------------------------------
> Rx plain (new) | accept      | accept      | drop        | drop        |
> 
> Rx plain       | accept      | accept      | accept      | drop        |
> (ACK|FIN|rtx)  |             |             |             |             |
> 
> Rx PSP (good)  | drop        | accept      | accept      | accept      |
> 
> Rx PSP (bad    | drop        | drop        | drop        | drop        |
> (crypt, !=SPI) |             |             |             |             |
> 
> Tx             | plain text  | plain text  | encrypted   | encrypted   |
>                |             |             | (excl. rtx) | (excl. rtx) |
> 
> > > > Another example where a peer stays open and stays retrying if it has
> > > > upgraded and drops all plaintext.  
> > 
> > May want to always allow plaintext RSTs. This is a potential DoS
> > vector.
> 
> Because of key exhaustion? Or we can be tricked into spamming someone
> with retranmissions and ignoring their RST?

Simpler: this falls back onto unencrypted TCP where someone capable of
spoofing valid data is capable of terminating a connection.

If denying all plaintext after upgrade, PSP protects against this.

It is arguably low on the list of concerns, especially in a closed
world hyperscaler setting. As it is hardly the only DoS vector.

> > In all these cases, I suppose this has already been figured
> > out for TLS.
> 
> Assuming the answer above is "key exhaustion" - I wouldn't be surprised
> if it wasn't :(

Jakub Kicinski June 5, 2024, 10:24 p.m. UTC | #14

On Wed, 05 Jun 2024 16:11:31 -0400 Willem de Bruijn wrote:
> > The retansmissions of K-A are unencrypted, to avoid sending the same
> > data in encrypted and unencrypted form. This poses a risk if an ACK
> > gets lost but both hosts end up in the PSP Tx state. Assume that Host A
> > did not send the RPC (line 12), and the retransmission (line 14)
> > happens as an RTO or TLP. Host B may already reach PSP Tx state (line
> > "20") and expect encrypted data. Plain text retransmissions (with
> > sequence number before rcv_nxt) must be accepted until Host B sees
> > encrypted data from Host A.  
> 
> Is that sufficient if an initial encrypted packet could get reordered
> by the network to arrive before a plaintext retransmit of a lower
> seqno?

Yes, I believe that's fine. 

I will document this clearer but both sides must be pretty precise in
their understanding when the switchover happens. They must read what 
they expect to be clear text, and then install the Tx key thus locking
down the socket.

1. If they under-read and clear text data is already queued - the kernel
   will error out.
2. If they under-read and clear text arrives later - the connection will
   hang.
3. If they over-read they will presumably get PSP-protected data
   which they have no way of validating, since it won't be secured by
   user crypto.

We could protect from over-read (case 3) by refusing to give out
PSP-protected data until keys are installed. But it adds to the fast
path and I don't think it's all that beneficial, since there's no way
to protect a sloppy application from under-read (case 2).

Back to your question about reordering plain text with cipher text:
the application should not lock down the socket until it gets all
its clear text. So clear text retransmissions _after_ lock down must be
spurious. The only worry is that we lose an ACK and never tell
the other side that we got all the clear text. But we're guaranteed
to successfully ACK any PSP-protected data, so if we receive some
there is no way to get stuck.  Let me copy / paste the diagram:

01 p       Host A                         Host B
02 l t        ~~~~~~~~~~~[TCP 3 WHS]~~~~~~~~~~
03 a e        ~~~~~~[crypto negotiation]~~~~~~
04 i x                             [Rx key alloc = K-B]
05 n t                          <--- [app] K-B key send 
06 ------[Rx key alloc = K-A]-
07     [app] K-A key send -->|
08        [TCP] K-B input <-----
08 P      [TCP] K-B ACK ---->|
09 S R [app] recv(K-B)       |
10 P x [app] [Tx key set]    |  
11 -------------------------- 
12 P T [app] send(RPC) #####>|   
13 S x                       |<----    [TCP] Seq OoO! queue RPC, SACK
14 P      [TCP] retr K-A --->|   
15                           |  `->    [TCP] K-A input
16                           | <---    [TCP] K-A ACK (or FIN) 
17                           |      [app] recv(K-A)
18                           |      [app] [Tx key set]
19                            -----------------------------------
20

Looking as Host A, if we receive encrypted data, we must have
allocated and sent key (line 7) so we will start accepting encrypted
data. But at this point we are also accepting plain text (until we
reach line 9). We will send a plain text (S)ACK to encrypted data, 
but that's fine too since Host B hasn't seen any encrypted data from us
and will accept such ACKs.

> Both scenarios make sense. It is unfortunately harder to be sure that
> we have captured all edge cases.

Are you trying to say packetdrill without saying packetdrill? :)

> An issue related to the rcv_nxt cut-point, not sure how important: the
> plaintext packet contents are protected by user crypto before upgrade.
> But the TCP headers are not. PSP relies on TCP PAWS against replay
> protection. It is possible for a MITM to offset all seqno from the
> start of connection establishment. I don't see an immediate issue. But
> at a minimum it could be possible to insert or delete before PSP is
> upgraded.

Yes, the "cut off" point must be quite clearly defined, because both
sides must precisely read out all the clear text. Then they install 
the Tx key and anything they read must have been PSP-protected.

Hope I understood the point.

Willem de Bruijn June 6, 2024, 2:40 a.m. UTC | #15

Jakub Kicinski wrote:
> On Wed, 05 Jun 2024 16:11:31 -0400 Willem de Bruijn wrote:
> > > The retansmissions of K-A are unencrypted, to avoid sending the same
> > > data in encrypted and unencrypted form. This poses a risk if an ACK
> > > gets lost but both hosts end up in the PSP Tx state. Assume that Host A
> > > did not send the RPC (line 12), and the retransmission (line 14)
> > > happens as an RTO or TLP. Host B may already reach PSP Tx state (line
> > > "20") and expect encrypted data. Plain text retransmissions (with
> > > sequence number before rcv_nxt) must be accepted until Host B sees
> > > encrypted data from Host A.  
> > 
> > Is that sufficient if an initial encrypted packet could get reordered
> > by the network to arrive before a plaintext retransmit of a lower
> > seqno?
> 
> Yes, I believe that's fine. 
> 
> I will document this clearer but both sides must be pretty precise in
> their understanding when the switchover happens. They must read what 
> they expect to be clear text, and then install the Tx key thus locking
> down the socket.
> 
> 1. If they under-read and clear text data is already queued - the kernel
>    will error out.
> 2. If they under-read and clear text arrives later - the connection will
>    hang.
> 3. If they over-read they will presumably get PSP-protected data
>    which they have no way of validating, since it won't be secured by
>    user crypto.
> 
> We could protect from over-read (case 3) by refusing to give out
> PSP-protected data until keys are installed. But it adds to the fast
> path and I don't think it's all that beneficial, since there's no way
> to protect a sloppy application from under-read (case 2).
> 
> Back to your question about reordering plain text with cipher text:
> the application should not lock down the socket until it gets all
> its clear text. So clear text retransmissions _after_ lock down must be
> spurious.

Ah yes, good point.

> The only worry is that we lose an ACK and never tell
> the other side that we got all the clear text. But we're guaranteed
> to successfully ACK any PSP-protected data, so if we receive some
> there is no way to get stuck.  Let me copy / paste the diagram:
> 
> 01 p       Host A                         Host B
> 02 l t        ~~~~~~~~~~~[TCP 3 WHS]~~~~~~~~~~
> 03 a e        ~~~~~~[crypto negotiation]~~~~~~
> 04 i x                             [Rx key alloc = K-B]
> 05 n t                          <--- [app] K-B key send 
> 06 ------[Rx key alloc = K-A]-
> 07     [app] K-A key send -->|
> 08        [TCP] K-B input <-----
> 08 P      [TCP] K-B ACK ---->|
> 09 S R [app] recv(K-B)       |
> 10 P x [app] [Tx key set]    |  
> 11 -------------------------- 
> 12 P T [app] send(RPC) #####>|   
> 13 S x                       |<----    [TCP] Seq OoO! queue RPC, SACK
> 14 P      [TCP] retr K-A --->|   
> 15                           |  `->    [TCP] K-A input
> 16                           | <---    [TCP] K-A ACK (or FIN) 
> 17                           |      [app] recv(K-A)
> 18                           |      [app] [Tx key set]
> 19                            -----------------------------------
> 20
> 
> Looking as Host A, if we receive encrypted data, we must have
> allocated and sent key (line 7) so we will start accepting encrypted
> data. But at this point we are also accepting plain text (until we
> reach line 9). We will send a plain text (S)ACK to encrypted data, 
> but that's fine too since Host B hasn't seen any encrypted data from us
> and will accept such ACKs.
> 
> > Both scenarios make sense. It is unfortunately harder to be sure that
> > we have captured all edge cases.
> 
> Are you trying to say packetdrill without saying packetdrill? :)

Ha, no, no such hint implied.

I did expand packetdrill to PSP to exercise the cases that I could
come up with, at a minimum to ensure coverage of all branches.

But does that cover all edge cases possible? Including drops,
reorders, geometry changes from MTU changes, SO_LINGER 0, races with
slow OS operations (like that slow SADB insertion I mentioned)? The
unknown unknowns. Stuck connections are a low risk, bugs that can be
fixed later. As long as it is easy to reason that actual crypto issues
like plaintext leaks are not reachable.

Extending packetdrill to netlink would be quite some work, I suspect.
A quick scan shows that it knows NLA, but only for OPT_STATS decoding.

> > An issue related to the rcv_nxt cut-point, not sure how important: the
> > plaintext packet contents are protected by user crypto before upgrade.
> > But the TCP headers are not. PSP relies on TCP PAWS against replay
> > protection. It is possible for a MITM to offset all seqno from the
> > start of connection establishment. I don't see an immediate issue. But
> > at a minimum it could be possible to insert or delete before PSP is
> > upgraded.
> 
> Yes, the "cut off" point must be quite clearly defined, because both
> sides must precisely read out all the clear text. Then they install 
> the Tx key and anything they read must have been PSP-protected.
> 
> Hope I understood the point.

I think the issue, if any, is that there may be a gap between the two
methods of integrity protection. What we call "cleartext" here is
integrity protected such that no insertion or deletion attacks are
possible. And PSP ensures the same. But is a a deletion of the last
plaintext or first ciphertext possible?

An insertion is not an issue as it will be protected by neither,
while PSP is expected, so it is dropped.

As long as the application (or is it presentation?) layer has a clear
definition of at what point in the stream it must insert the Tx key,
plaintext deletion is not possible, as the key is not inserted until
all plaintext has been received.

Which leaves: is it possible for a MITM to offset the seqno, such that
the first PSP encrypted packet can be removed from the stream and this
goes undetected?

Sasha Levin June 26, 2024, 1:57 p.m. UTC | #16

On Fri, May 10, 2024 at 05:11:32PM -0700, Jakub Kicinski wrote:
>Yes, I should have CCed Meta's folks who work on TLS [1]. Adding them
>now. More than happy to facilitate the discussion, maybe Willem can
>CC the right Google folks, IDK who else...
>
>We should start moving with the kernel support, IMO, until we do
>the user space implementation is stalled. I don't expect that the
>way we install keys in the kernel would be impacted by the handshake.
>
>[1] https://github.com/facebookincubator/fizz

Having worked on this in the past, I was curious about looking at the
kernel/userspace implementation, and was surprised there's nothing at
all on the userspace side of things.

What's the rush in merging this in prior to having a usable, open
userspace? Can't they be developed side by side and merged in together?

Lance Richardson June 27, 2024, 3:14 p.m. UTC | #17

On Wed, May 29, 2024 at 1:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 12 May 2024 21:24:23 -0400 Willem de Bruijn wrote:

... snip ...

> > Connection key rotation is not supported? I did notice that tx key
> > insertion fails if a key is already present, so this does appear to be
> > the behavior.
>
> Correct, for now connections need to be re-established once a day.
> Rx should be easy, Tx we can make easy by only supporting rotation
> when there's no data queued.
>

Could you elaborate on why updating the Tx key should only be allowed when
no data is queued? At the point rekeying is being done, the receiver should
accept both the new and previous key:spi.

The lack of support for rekeying existing connections is a significant gap. At
a minimum the API for notifying the application that a rotation has occurred
should be defined, and the implementation should allow the configuration
of a new Tx key:spi for rekeying. A tiny bit logic would also be needed on the
Rx side to track the current and previous SPI, if the hardware supports keys
indescriptors then nothing more should be needed on the Tx side. If the NIC
maintains an SA database and doesn't allow existing entries to be updated,
a small amount of additional logic would be needed, but perhaps that could
be (waving hands a bit) the responsibility of the driver.

   - Lance

Jakub Kicinski June 27, 2024, 10:33 p.m. UTC | #18

On Thu, 27 Jun 2024 11:14:39 -0400 Lance Richardson wrote:
> > > Connection key rotation is not supported? I did notice that tx key
> > > insertion fails if a key is already present, so this does appear to be
> > > the behavior.  
> >
> > Correct, for now connections need to be re-established once a day.
> > Rx should be easy, Tx we can make easy by only supporting rotation
> > when there's no data queued.
> 
> Could you elaborate on why updating the Tx key should only be allowed when
> no data is queued? At the point rekeying is being done, the receiver should
> accept both the new and previous key:spi.

I didn't say it shouldn't be allowed, just that disallowing it
initially would make the implementation easier ;)

> The lack of support for rekeying existing connections is a significant gap. At
> a minimum the API for notifying the application that a rotation has occurred
> should be defined, 

Notifications are in place, that's one of the reasons I chose netlink.

> and the implementation should allow the configuration of a new Tx
> key:spi for rekeying.

I was under the possibly mistaken impression that Google have used PSP
for years without rekeying... Did I misunderstand?

> A tiny bit logic would also be needed on the Rx
> side to track the current and previous SPI, if the hardware supports
> keys indescriptors then nothing more should be needed on the Tx side.
> If the NIC maintains an SA database and doesn't allow existing
> entries to be updated, a small amount of additional logic would be
> needed, but perhaps that could be (waving hands a bit) the
> responsibility of the driver.

Interesting. Hm. But SADB drivers would then have to implement some
complex logic to make sure all rings have cycled, or take references.
I'd rather have an opt-in for core to delay reaping old keys until
all sockets which used them went empty at least once (in wmem sense).

Lance Richardson June 28, 2024, 7:33 p.m. UTC | #19

On Thu, Jun 27, 2024 at 6:33 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> I was under the possibly mistaken impression that Google have used PSP
> for years without rekeying... Did I misunderstand?
>
Actually Google does implement connection rekeying when master key
rotation occurs. I believe this was the case even in the first production
deployment (Willem would know the history better).

> > A tiny bit logic would also be needed on the Rx
> > side to track the current and previous SPI, if the hardware supports
> > keys indescriptors then nothing more should be needed on the Tx side.
> > If the NIC maintains an SA database and doesn't allow existing
> > entries to be updated, a small amount of additional logic would be
> > needed, but perhaps that could be (waving hands a bit) the
> > responsibility of the driver.
>
> Interesting. Hm. But SADB drivers would then have to implement some
> complex logic to make sure all rings have cycled, or take references.
> I'd rather have an opt-in for core to delay reaping old keys until
> all sockets which used them went empty at least once (in wmem sense).

Right, ensuring that the old entry is no longer referenced by packets in the
transmit pipeline before removing is definitely a concern. One simple
approach is to simply keep the old entry around for long enough (e.g. a
minute or two) to ensure that any packets referencing it have been transmitted.

Jakub Kicinski June 28, 2024, 11:41 p.m. UTC | #20

On Fri, 28 Jun 2024 15:33:28 -0400 Lance Richardson wrote:
> > Interesting. Hm. But SADB drivers would then have to implement some
> > complex logic to make sure all rings have cycled, or take references.
> > I'd rather have an opt-in for core to delay reaping old keys until
> > all sockets which used them went empty at least once (in wmem sense).  
> 
> Right, ensuring that the old entry is no longer referenced by packets in the
> transmit pipeline before removing is definitely a concern. One simple
> approach is to simply keep the old entry around for long enough (e.g. a
> minute or two) to ensure that any packets referencing it have been transmitted.

Sounds good, we can opportunistically remove immediately if socket is
idle, otherwise use a timeout. I'll try to have the patches on GH
before I post v1, it won't all fit in one reviewable series.

[RFC,net-next,01/15] psp: add documentation

Checks

Commit Message

Comments

Patch