diff mbox series

[v2,2/3] net/handshake: Add support for PF_HANDSHAKE

Message ID 167474894272.5189.9499312703868893688.stgit@91.116.238.104.host.secureserver.net (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Another crack at a handshake upcall mechanism | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2633 this patch: 2633
netdev/cc_maintainers warning 6 maintainers not CCed: linux-trace-kernel@vger.kernel.org mhiramat@kernel.org davem@davemloft.net pabeni@redhat.com edumazet@google.com rostedt@goodmis.org
netdev/build_clang fail Errors and warnings before: 542 this patch: 546
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2773 this patch: 2773
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Lines should not end with a '(' CHECK: Please don't use multiple blank lines CHECK: Please use a blank line after function/struct/union/enum declarations CHECK: extern prototypes should be avoided in .h files WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: networking block comments don't use an empty /* line, use /* Comment...
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Chuck Lever III Jan. 26, 2023, 4:02 p.m. UTC
In-kernel TLS consumers need a way to perform a TLS handshake. In
the absence of a TLS handshake implementation in the kernel itself,
a mechanism to perform the handshake in user space, using an
existing TLS handshake library, is necessary.

I've designed a way to pass a connected kernel socket endpoint to
user space using the traditional listen/accept mechanism. accept(2)
gives us a well-worn building block that can materialize a connected
socket endpoint as a file descriptor in a specific user space
process. Like any open socket descriptor, the accepted FD can then
be passed to a library such as GnuTLS to perform a TLS handshake.

The socket sharing mechanism is built into the kernel for now since
it is a small utility to be used by several transport layer security
mechanisms.

This prototype is net-namespace aware.

NB: The kernel has no mechanism to attest that the listening user
space agent is trustworthy.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/net/handshake.h          |   31 +
 include/net/sock.h               |    2 
 include/trace/events/handshake.h |  328 +++++++++++++++
 include/uapi/linux/handshake.h   |   49 ++
 net/Makefile                     |    1 
 net/handshake/Makefile           |    7 
 net/handshake/af_handshake.c     |  838 ++++++++++++++++++++++++++++++++++++++
 net/handshake/handshake.h        |   33 +
 net/handshake/netlink.c          |  169 ++++++++
 net/handshake/trace.c            |   20 +
 10 files changed, 1478 insertions(+)
 create mode 100644 include/net/handshake.h
 create mode 100644 include/trace/events/handshake.h
 create mode 100644 include/uapi/linux/handshake.h
 create mode 100644 net/handshake/Makefile
 create mode 100644 net/handshake/af_handshake.c
 create mode 100644 net/handshake/handshake.h
 create mode 100644 net/handshake/netlink.c
 create mode 100644 net/handshake/trace.c

Comments

Jakub Kicinski Jan. 28, 2023, 8:32 a.m. UTC | #1
On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:
> I've designed a way to pass a connected kernel socket endpoint to
> user space using the traditional listen/accept mechanism. accept(2)
> gives us a well-worn building block that can materialize a connected
> socket endpoint as a file descriptor in a specific user space
> process. Like any open socket descriptor, the accepted FD can then
> be passed to a library such as GnuTLS to perform a TLS handshake.

I can't bring myself to like the new socket family layer.
I'd like a second opinion on that, if anyone within netdev
is willing to share..
Chuck Lever III Jan. 28, 2023, 2:06 p.m. UTC | #2
> On Jan 28, 2023, at 3:32 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:
>> I've designed a way to pass a connected kernel socket endpoint to
>> user space using the traditional listen/accept mechanism. accept(2)
>> gives us a well-worn building block that can materialize a connected
>> socket endpoint as a file descriptor in a specific user space
>> process. Like any open socket descriptor, the accepted FD can then
>> be passed to a library such as GnuTLS to perform a TLS handshake.
> 
> I can't bring myself to like the new socket family layer.

poll/listen/accept is the simplest and most natural way of
materializing a socket endpoint in a process that I can think
of. It's a well-understood building block. What specifically
is troubling you about it?


> I'd like a second opinion on that, if anyone within netdev
> is willing to share..

Hopefully that opinion comes with an alternative way of getting
a connected kernel socket endpoint up to user space without
race issues.

We need to make some progress on this. If you don't have a
technical objection, I think we should go with this with the
idea that eventually something more palatable will come along
to replace it.


--
Chuck Lever
Stephen Hemminger Jan. 28, 2023, 5:40 p.m. UTC | #3
On Sat, 28 Jan 2023 00:32:12 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:
> > I've designed a way to pass a connected kernel socket endpoint to
> > user space using the traditional listen/accept mechanism. accept(2)
> > gives us a well-worn building block that can materialize a connected
> > socket endpoint as a file descriptor in a specific user space
> > process. Like any open socket descriptor, the accepted FD can then
> > be passed to a library such as GnuTLS to perform a TLS handshake.  
> 
> I can't bring myself to like the new socket family layer.
> I'd like a second opinion on that, if anyone within netdev
> is willing to share..

Why not just pass fd's with Unix Domain socket?
The application is going to need to be changed to handle new AF already.

Also, expanding the address families has security impacts as well.
Either all the container and LSM's need to deny your new AF or they need
to be taught to validate whether this a valid operation.
Hannes Reinecke Jan. 29, 2023, 4:21 p.m. UTC | #4
On 1/28/23 09:32, Jakub Kicinski wrote:
> On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:
>> I've designed a way to pass a connected kernel socket endpoint to
>> user space using the traditional listen/accept mechanism. accept(2)
>> gives us a well-worn building block that can materialize a connected
>> socket endpoint as a file descriptor in a specific user space
>> process. Like any open socket descriptor, the accepted FD can then
>> be passed to a library such as GnuTLS to perform a TLS handshake.
> 
> I can't bring myself to like the new socket family layer.
> I'd like a second opinion on that, if anyone within netdev
> is willing to share..

I am not particularly fond of that, either, but the alternative of using 
netlink doesn't make it any better
You can't pass the fd/socket directly via netlink messages, you can only 
pass the (open!) fd number with the message.
The fd itself _needs_ be be part of the process context of the 
application by the time the application processes that message.
Consequently:
- I can't see how an application can _reject_ the message; the fd needs 
to be present in the fd table even before the message is processed, 
rendering any decision by the application pointless (and I would _so_ 
love to be proven wrong on this point)
- It's slightly tricky to handle processes which go away prior to 
handling the message; I _think_ the process cleanup code will close the 
fd, but I guess it also depends on how and when the fd is stored in the 
process context.

If someone can point me to a solution for these points I would vastly 
prefer to move to netlink. But with these issues in place I'm not sure 
if netlink doesn't cause more issues than it solves.

Cheers,

Hannes
Chuck Lever III Jan. 29, 2023, 4:53 p.m. UTC | #5
> On Jan 28, 2023, at 12:40 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> On Sat, 28 Jan 2023 00:32:12 -0800
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
>> On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:
>>> I've designed a way to pass a connected kernel socket endpoint to
>>> user space using the traditional listen/accept mechanism. accept(2)
>>> gives us a well-worn building block that can materialize a connected
>>> socket endpoint as a file descriptor in a specific user space
>>> process. Like any open socket descriptor, the accepted FD can then
>>> be passed to a library such as GnuTLS to perform a TLS handshake.  
>> 
>> I can't bring myself to like the new socket family layer.
>> I'd like a second opinion on that, if anyone within netdev
>> is willing to share..
> 
> Why not just pass fd's with Unix Domain socket?

Or a pipe. We do need a queue for handshake requests, and a
pipe provides a reliable mechanism to ensure that handshake
requests are not dropped.

The question I have is: would the application then just start
using that fd number as if it had been opened or accepted? Is
reading the fd from the pipe actually an implicit accept(2)?


> The application is going to need to be changed to handle new AF already.

The only application that has to deal with PF_HANDSHAKE is the
user space daemon that performs the handshakes, an example of
which we provide here:

  https://github.com/oracle/ktls-utils

This is for handling handshakes on behalf of kernel TLS consumers.
The new family would not be used by any existing application in
user space.


> Also, expanding the address families has security impacts as well.
> Either all the container and LSM's need to deny your new AF or they need
> to be taught to validate whether this a valid operation.

It wasn't clear to me yesterday whether Jakub's objection was to
the listen/poll/accept part of this contraption, or whether he
was uncomfortable specifically with the addition of PF_HANDSHAKE.
I can certainly see that a new socket family is unwieldy from a
security perspective.

However, what if listen/poll/accept was used with an existing
address family, maybe AF_INET, with either a special bind address,
with a socket option set, or perhaps a new netlink operation can
inform the kernel that the listener is specifically for transport
layer handshake requests...?

A socket that comes from an accept(2) on a PF_HANDSHAKE listener
is also a PF_HANDSHAKE socket, though it behaves in every other
aspect like an AF_INET/AF_INET6 socket. So, using an AF_INET/6
listener instead might be nicer overall.


--
Chuck Lever
Marcel Holtmann Jan. 30, 2023, 1:44 p.m. UTC | #6
Hi Hannes,

>>> I've designed a way to pass a connected kernel socket endpoint to
>>> user space using the traditional listen/accept mechanism. accept(2)
>>> gives us a well-worn building block that can materialize a connected
>>> socket endpoint as a file descriptor in a specific user space
>>> process. Like any open socket descriptor, the accepted FD can then
>>> be passed to a library such as GnuTLS to perform a TLS handshake.
>> I can't bring myself to like the new socket family layer.
>> I'd like a second opinion on that, if anyone within netdev
>> is willing to share..
> 
> I am not particularly fond of that, either, but the alternative of using netlink doesn't make it any better
> You can't pass the fd/socket directly via netlink messages, you can only pass the (open!) fd number with the message.
> The fd itself _needs_ be be part of the process context of the application by the time the application processes that message.
> Consequently:
> - I can't see how an application can _reject_ the message; the fd needs to be present in the fd table even before the message is processed, rendering any decision by the application pointless (and I would _so_ love to be proven wrong on this point)
> - It's slightly tricky to handle processes which go away prior to handling the message; I _think_ the process cleanup code will close the fd, but I guess it also depends on how and when the fd is stored in the process context.
> 
> If someone can point me to a solution for these points I would vastly prefer to move to netlink. But with these issues in place I'm not sure if netlink doesn't cause more issues than it solves.

I think we first need to figure out the security model behind this.

For kTLS you have the TLS Handshake messages inline with the TCP
socket and thus credentials are given by the owner of that socket.
This is simple and makes a lot of sense since whoever opened that
connection has to decide to give a client certificate or accept
the server certificate (in case of session resumption also provide
the PSK).

I like to have a generic TLS Handshake interface as well since more
and more protocols will take TLS 1.3 as reference and use its
handshake protocol. What I would not do is insist on using an fd,
because that is what OpenSSL and others are just used to. The TLS
libraries need to go away from the fd as IO model and provide
appropriate APIs into the TLS Handshake (and also TLS Alert
protocol) for a “codec style” operation.

Fundamentally nothing speaks against TLS Handshake in the kernel. All
the core functionality is already present. All KPP, HKDF and even the
certifiacate handling is present. In a simplified view, you just need
To give the kernel a keyctl keyring that has the CA certs to verify
and provide the keyring with either client or server certificate to
use.

On a TCP socket for example you could do this:

	setsockopt(fd, SOL_TCP, TCP_ULP, “tls+hs", ..);

	tls_client.cert_id = key_id_cert;
	tls_client.ca_id = key_id_ca;

	setsockopt(fd, SOL_TLS, TLS_CLIENT, &tls_client, ..);

Failures or errors would be reported out via socket errors or SCM.
And you need some extra options to select cipher ranges or limit to
TLS 1.3 only etc.

But overall it would make using TCP+TLS really simple. The complicated
part is providing the key ring. Then again, the CA key ring could be
inherited from systemd or some basic component setting it up and
sealing it.

For other protocols or usages the input would be similar. It should
be rather straight forward to provide key ring identifiers as mount
option or via an ioctl.

This however needs to overcome the fear of putting the TLS Handshake
into the kernel. I can understand anybody thinking that it is not a
good idea and with TLS 1.2 and before it is a bit convoluted and
error prone. However starting with TLS 1.3 things are a lot simpler
and streamlined. There are few oddities where TLS 1.3 has to look
like TLS 1.2 on the wire, but that mainly only affects the TLS
record protocol and kTLS does that today already anyway.

For reference ELL (git.kernel.org/pub/scm/libs/ell/ell.git) has a
TLS implementation that utilizes AF_ALG and keyctl for all the
basic crypto needs. Certificates and certificate operations are
purely done via keyctl and that works nicely. If KPP would finally
get an usersapce interface, even shared secret derivation would go
via kernel crypto.

The code is currently TLS 1.2 and earlier, but I have code for
TLS 1.3 and also code for utilizing kTLS. It needs a bit more
cleanup, but then I am happy to publish it. The modified code
for TLS 1.3 support has TLS Handshake+Alert separated from TLS
Record protocol and doesn’t even rely on an fd to operate. This
comes from the requirement that TLS for WiFi Enterprise (or in
the future QUIC) doesn’t have a fd either.

Long story short, who is suppose to run the TLS Handshake if
we push it to userspace. There will be never a generic daemon
that handles all handshakes since they are all application
specific. No daemon can run the TLS Handshake on behalf of
Chrome browser for example. This leads me to AF_HANDSHAKE
is not a good idea.

One nice thing we found with using keyctl for WiFi Enterprise
is that we can have certificates that are backed by the TPM.
Doing that via keyctl was a lot simpler than dealing with the
different oddities of SSL engines or different variations of
crypto libraries. The unification by the kernel is really
nice. I have to re-read how much EFI can provide securely
hardware backed keys, but for everybody working in early
userspace or initramfs it is nice to be able to utilize
this without having to drag in megabytes of TLS library.

Regards

Marcel
Chuck Lever III Jan. 30, 2023, 3 p.m. UTC | #7
Hello Marcel-


> On Jan 30, 2023, at 8:44 AM, Marcel Holtmann <marcel@holtmann.org> wrote:
> 
> Hi Hannes,
> 
>>>> I've designed a way to pass a connected kernel socket endpoint to
>>>> user space using the traditional listen/accept mechanism. accept(2)
>>>> gives us a well-worn building block that can materialize a connected
>>>> socket endpoint as a file descriptor in a specific user space
>>>> process. Like any open socket descriptor, the accepted FD can then
>>>> be passed to a library such as GnuTLS to perform a TLS handshake.
>>> I can't bring myself to like the new socket family layer.
>>> I'd like a second opinion on that, if anyone within netdev
>>> is willing to share..
>> 
>> I am not particularly fond of that, either, but the alternative of using netlink doesn't make it any better
>> You can't pass the fd/socket directly via netlink messages, you can only pass the (open!) fd number with the message.
>> The fd itself _needs_ be be part of the process context of the application by the time the application processes that message.
>> Consequently:
>> - I can't see how an application can _reject_ the message; the fd needs to be present in the fd table even before the message is processed, rendering any decision by the application pointless (and I would _so_ love to be proven wrong on this point)
>> - It's slightly tricky to handle processes which go away prior to handling the message; I _think_ the process cleanup code will close the fd, but I guess it also depends on how and when the fd is stored in the process context.
>> 
>> If someone can point me to a solution for these points I would vastly prefer to move to netlink. But with these issues in place I'm not sure if netlink doesn't cause more issues than it solves.
> 
> I think we first need to figure out the security model behind this.
> 
> For kTLS you have the TLS Handshake messages inline with the TCP
> socket and thus credentials are given by the owner of that socket.
> This is simple and makes a lot of sense since whoever opened that
> connection has to decide to give a client certificate or accept
> the server certificate (in case of session resumption also provide
> the PSK).
> 
> I like to have a generic TLS Handshake interface as well since more
> and more protocols will take TLS 1.3 as reference and use its
> handshake protocol. What I would not do is insist on using an fd,
> because that is what OpenSSL and others are just used to. The TLS
> libraries need to go away from the fd as IO model and provide
> appropriate APIs into the TLS Handshake (and also TLS Alert
> protocol) for a “codec style” operation.

Note that we are not attempting to replace handshakes for user
applications -- the purpose of AF_HANDSHAKE is for dealing with
the needs of /kernel/ consumers of transport layer security.


> Fundamentally nothing speaks against TLS Handshake in the kernel. All
> the core functionality is already present. All KPP, HKDF and even the
> certifiacate handling is present. In a simplified view, you just need
> To give the kernel a keyctl keyring that has the CA certs to verify
> and provide the keyring with either client or server certificate to
> use.

I agree strongly that the handshake logic can and should reside
in the kernel. The kernel security community and many file
system developers have told us repeatedly that an in-kernel
design is a non-starter, thus we are going with an upcall for
now.


> On a TCP socket for example you could do this:
> 
> 	setsockopt(fd, SOL_TCP, TCP_ULP, “tls+hs", ..);
> 
> 	tls_client.cert_id = key_id_cert;
> 	tls_client.ca_id = key_id_ca;
> 
> 	setsockopt(fd, SOL_TLS, TLS_CLIENT, &tls_client, ..);
> 
> Failures or errors would be reported out via socket errors or SCM.
> And you need some extra options to select cipher ranges or limit to
> TLS 1.3 only etc.
> 
> But overall it would make using TCP+TLS really simple. The complicated
> part is providing the key ring. Then again, the CA key ring could be
> inherited from systemd or some basic component setting it up and
> sealing it.
> 
> For other protocols or usages the input would be similar. It should
> be rather straight forward to provide key ring identifiers as mount
> option or via an ioctl.

That is essentially what I'm attempting to do here: provide a
stable API that kernel TLS consumers can use that hides the
mechanism of the handshake. It can be an upcall, it can be done
in the kernel, it can use a struct socket * endpoint or a file
descriptor. But the API that kernel consumers see does not
change, and we can build something very much like what you
described behind that API, or replace it all with chewing gum,
bailing wire, and band-aids. The consumers should not care, as
long as it works.


> This however needs to overcome the fear of putting the TLS Handshake
> into the kernel. I can understand anybody thinking that it is not a
> good idea and with TLS 1.2 and before it is a bit convoluted and
> error prone. However starting with TLS 1.3 things are a lot simpler
> and streamlined. There are few oddities where TLS 1.3 has to look
> like TLS 1.2 on the wire, but that mainly only affects the TLS
> record protocol and kTLS does that today already anyway.

We've been looking at the mbed-based in-kernel handshake that
Tempesta did, so it's not a stretch to imagine what such a
thing would look like.

Tempesta, however, is indeed looking at a handshake facility
that can be used by user applications, it should be noted.

The point of having a fixed in-kernel handshake API is that,
if something like Tempesta is ever adopted, in-kernel TLS
consumers can be switched over to that with little fuss.

Also note: the current cohort of kernel consumers do not need
or want anything older than TLSv1.3. Anything that has to
deal with user space applications likely needs compatibility
with TLSv1.2 at least.

The kernel consumers we want to support include:

 - RPC-with-TLS (RFC 9289)
 - NVMe over TCP with TLS (part of NVMe/TCP rev 1.0a)
 - SMB on QUIC (no public specification found)

All of these mandate at least TLSv1.3.


> For reference ELL (git.kernel.org/pub/scm/libs/ell/ell.git) has a
> TLS implementation that utilizes AF_ALG and keyctl for all the
> basic crypto needs. Certificates and certificate operations are
> purely done via keyctl and that works nicely. If KPP would finally
> get an usersapce interface, even shared secret derivation would go
> via kernel crypto.
> 
> The code is currently TLS 1.2 and earlier, but I have code for
> TLS 1.3 and also code for utilizing kTLS. It needs a bit more
> cleanup, but then I am happy to publish it. The modified code
> for TLS 1.3 support has TLS Handshake+Alert separated from TLS
> Record protocol and doesn’t even rely on an fd to operate. This
> comes from the requirement that TLS for WiFi Enterprise (or in
> the future QUIC) doesn’t have a fd either.

We have our eye on QUIC as well, for exactly those reasons.
But as I said, our use case is for kernel consumers of
transport layer security, such as SMB on QUIC support in
the kernel's cifs client.


> Long story short, who is suppose to run the TLS Handshake if
> we push it to userspace. There will be never a generic daemon
> that handles all handshakes since they are all application
> specific. No daemon can run the TLS Handshake on behalf of
> Chrome browser for example. This leads me to AF_HANDSHAKE
> is not a good idea.

In our narrow use case, user space applications will continue
to use existing libraries directly for doing the handshaking
that they need. I'm not advocating for using AF_HANDSHAKE to
replace existing user space TLS library users, because I
agree that user space is going to have a much broader set of
use cases that handshake implementations have to deal with.

For kernel consumers, we have provided an example user space
daemon that handles handshake requests via a direct call to
the GnuTLS library. We can add as much decoration on that
mechanism as kernel consumers require...

   https://github.com/oracle/ktls-utils

We expect there to be one of these daemons running in each
network namespace. The daemon reads authentication material
from local files or gets it from kernel keyrings.


> One nice thing we found with using keyctl for WiFi Enterprise
> is that we can have certificates that are backed by the TPM.

TPM support is one reason we use kernel keyrings in our design,
though for us this is currently untested.


> Doing that via keyctl was a lot simpler than dealing with the
> different oddities of SSL engines or different variations of
> crypto libraries. The unification by the kernel is really
> nice. I have to re-read how much EFI can provide securely
> hardware backed keys, but for everybody working in early
> userspace or initramfs it is nice to be able to utilize
> this without having to drag in megabytes of TLS library.

Performing handshakes during early boot is one very important
use case that doing an upcall does not handle without a lot
of additional complexity. This is one reason we'd like to see
an in-kernel handshake implementation.

--
Chuck Lever
Jakub Kicinski Jan. 31, 2023, 4:35 a.m. UTC | #8
On Sat, 28 Jan 2023 14:06:49 +0000 Chuck Lever III wrote:
> > On Jan 28, 2023, at 3:32 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:  
> >> I've designed a way to pass a connected kernel socket endpoint to
> >> user space using the traditional listen/accept mechanism. accept(2)
> >> gives us a well-worn building block that can materialize a connected
> >> socket endpoint as a file descriptor in a specific user space
> >> process. Like any open socket descriptor, the accepted FD can then
> >> be passed to a library such as GnuTLS to perform a TLS handshake.  
> > 
> > I can't bring myself to like the new socket family layer.  
> 
> poll/listen/accept is the simplest and most natural way of
> materializing a socket endpoint in a process that I can think
> of. It's a well-understood building block. What specifically
> is troubling you about it?

poll/listen/accept yes, but that's not the entire socket interface. 
Our overall experience with the TCP ULPs is rather painful, proxying
all the other callbacks here may add another dimension.

Also I have a fear (perhaps unjustified) of reusing constructs which are
cornerstones of the networking stack and treating them as abstractions.

> > I'd like a second opinion on that, if anyone within netdev
> > is willing to share..  
> 
> Hopefully that opinion comes with an alternative way of getting
> a connected kernel socket endpoint up to user space without
> race issues.

If the user application decides the fd, wouldn't that solve the problem
in netlink?

  kernel                          user space

   notification     ---------->
 (new connection awaits)

                    <----------
                                  request (target fd=100)

                    ---------->
   reply
 (fd 100 is installed;
  extra params)

> We need to make some progress on this. If you don't have a
> technical objection, I think we should go with this with the
> idea that eventually something more palatable will come along
> to replace it.
Hannes Reinecke Jan. 31, 2023, 7:40 a.m. UTC | #9
On 1/30/23 14:44, Marcel Holtmann wrote:
> Hi Hannes,
> 
>>>> I've designed a way to pass a connected kernel socket endpoint to
>>>> user space using the traditional listen/accept mechanism. accept(2)
>>>> gives us a well-worn building block that can materialize a connected
>>>> socket endpoint as a file descriptor in a specific user space
>>>> process. Like any open socket descriptor, the accepted FD can then
>>>> be passed to a library such as GnuTLS to perform a TLS handshake.
>>> I can't bring myself to like the new socket family layer.
>>> I'd like a second opinion on that, if anyone within netdev
>>> is willing to share..
>>
>> I am not particularly fond of that, either, but the alternative of using
>> netlink doesn't make it any better
>> You can't pass the fd/socket directly via netlink messages, you can only
>> pass the (open!) fd number with the message.
>> The fd itself _needs_ be be part of the process context of the application
>> by the time the application processes that message.
>> Consequently:
>> - I can't see how an application can _reject_ the message; the fd needs to
>> be present in the fd table even before the message is processed, rendering
>> any decision by the application pointless (and I would _so_ love to be proven
>> wrong on this point)
>> - It's slightly tricky to handle processes which go away prior to handling
>> the message; I _think_ the process cleanup code will close the fd, but I guess
>> it also depends on how and when the fd is stored in the process context.
>>
>> If someone can point me to a solution for these points I would vastly prefer
>> to move to netlink. But with these issues in place I'm not sure if netlink
>> doesn't cause more issues than it solves.
> 
> I think we first need to figure out the security model behind this.
> 
> For kTLS you have the TLS Handshake messages inline with the TCP
> socket and thus credentials are given by the owner of that socket.
> This is simple and makes a lot of sense since whoever opened that
> connection has to decide to give a client certificate or accept
> the server certificate (in case of session resumption also provide
> the PSK).
> 
> I like to have a generic TLS Handshake interface as well since more
> and more protocols will take TLS 1.3 as reference and use its
> handshake protocol. What I would not do is insist on using an fd,
> because that is what OpenSSL and others are just used to. The TLS
> libraries need to go away from the fd as IO model and provide
> appropriate APIs into the TLS Handshake (and also TLS Alert
> protocol) for a “codec style” operation.
> 
That's something we have discussed, too.
We could forward the TLS handshake frames via netlink, thus saving us 
the headache of passing an entire socket to userspace.
However, that would require a major infrastructure work on the 
libraries, and my experience with fixing/updating things in gnutls have 
not been stellar. So I didn't pursue this route.

> Fundamentally nothing speaks against TLS Handshake in the kernel. All
> the core functionality is already present. All KPP, HKDF and even the
> certifiacate handling is present. In a simplified view, you just need
> To give the kernel a keyctl keyring that has the CA certs to verify
> and provide the keyring with either client or server certificate to
> use.
> 
> On a TCP socket for example you could do this:
> 
> 	setsockopt(fd, SOL_TCP, TCP_ULP, “tls+hs", ..);
> 
> 	tls_client.cert_id = key_id_cert;
> 	tls_client.ca_id = key_id_ca;
> 
> 	setsockopt(fd, SOL_TLS, TLS_CLIENT, &tls_client, ..);
> 
> Failures or errors would be reported out via socket errors or SCM.
> And you need some extra options to select cipher ranges or limit to
> TLS 1.3 only etc.
> 
Fundamentally you are correct. But these are security relevant areas, 
and any implementation we do will have to be vetted by some security 
people. _And_ will have to be maintained by someone well-versed in 
security, too, lest we have a security breach in the kernel.
And that person will certainly not be me, so I haven't attempt that route.

> But overall it would make using TCP+TLS really simple. The complicated
> part is providing the key ring. Then again, the CA key ring could be
> inherited from systemd or some basic component setting it up and
> sealing it.
> 
I don't think that's a major concern. The good thing with the keyring is 
that it can be populated externally, ie one can have a daemon to fetch 
the certificate and stuff it in the keyring. request_key() and all that ...

> For other protocols or usages the input would be similar. It should
> be rather straight forward to provide key ring identifiers as mount
> option or via an ioctl.
> 
> This however needs to overcome the fear of putting the TLS Handshake
> into the kernel. I can understand anybody thinking that it is not a
> good idea and with TLS 1.2 and before it is a bit convoluted and
> error prone. However starting with TLS 1.3 things are a lot simpler
> and streamlined. There are few oddities where TLS 1.3 has to look
> like TLS 1.2 on the wire, but that mainly only affects the TLS
> record protocol and kTLS does that today already anyway.
> 
See above. It's not so much 'fear' as rather the logistics of it. 
Getting hold of a TLS library is reasonably easy (Chuck had another 
example ready), but massaging it for inclusion into the kernel is quite 
some effort.
You might even succeed in convincing the powers that be to include it 
into the kernel.
But then you are stuck with having to find a capable maintainer, who is 
willing _and qualified_ to take the work and answer awkward questions.
And take the heat when that code introduced a security breach in the 
linux kernel.
Which excluded essentially everybody who had been working on this 
project; we are capable enough engineers in the network and storage 
space, but deep security issues ... not so much.

> For reference ELL (git.kernel.org/pub/scm/libs/ell/ell.git) has a
> TLS implementation that utilizes AF_ALG and keyctl for all the
> basic crypto needs. Certificates and certificate operations are
> purely done via keyctl and that works nicely. If KPP would finally
> get an usersapce interface, even shared secret derivation would go
> via kernel crypto.
> 
> The code is currently TLS 1.2 and earlier, but I have code for
> TLS 1.3 and also code for utilizing kTLS. It needs a bit more
> cleanup, but then I am happy to publish it. The modified code
> for TLS 1.3 support has TLS Handshake+Alert separated from TLS
> Record protocol and doesn’t even rely on an fd to operate. This
> comes from the requirement that TLS for WiFi Enterprise (or in
> the future QUIC) doesn’t have a fd either.
> 
If you have code to update it to 1.3 I would be very willing to look at 
it; the main reason why with went with gnutls was that no-one of us was 
eager (not hat the knowledge) to really delve into TLS and do fancy things.

And that was the other thing; we found quite some TLS implementations, 
but nearly all of the said '1.3 support to come' ...

> Long story short, who is suppose to run the TLS Handshake if
> we push it to userspace. There will be never a generic daemon
> that handles all handshakes since they are all application
> specific. No daemon can run the TLS Handshake on behalf of
> Chrome browser for example. This leads me to AF_HANDSHAKE
> is not a good idea.
> 
> One nice thing we found with using keyctl for WiFi Enterprise
> is that we can have certificates that are backed by the TPM.
> Doing that via keyctl was a lot simpler than dealing with the
> different oddities of SSL engines or different variations of
> crypto libraries. The unification by the kernel is really
> nice. I have to re-read how much EFI can provide securely
> hardware backed keys, but for everybody working in early
> userspace or initramfs it is nice to be able to utilize
> this without having to drag in megabytes of TLS library.
> 
We don't deny that having TLS handshake in the kernel would be a good 
thing. It's just the hurdles to _get_ there are quite high, and we 
thought that the userspace daemon would be an easier route.

Cheers,

Hannes
Marcel Holtmann Jan. 31, 2023, 2:17 p.m. UTC | #10
Hi Hannes,

>>>>> I've designed a way to pass a connected kernel socket endpoint to
>>>>> user space using the traditional listen/accept mechanism. accept(2)
>>>>> gives us a well-worn building block that can materialize a connected
>>>>> socket endpoint as a file descriptor in a specific user space
>>>>> process. Like any open socket descriptor, the accepted FD can then
>>>>> be passed to a library such as GnuTLS to perform a TLS handshake.
>>>> I can't bring myself to like the new socket family layer.
>>>> I'd like a second opinion on that, if anyone within netdev
>>>> is willing to share..
>>> 
>>> I am not particularly fond of that, either, but the alternative of using
>>> netlink doesn't make it any better
>>> You can't pass the fd/socket directly via netlink messages, you can only
>>> pass the (open!) fd number with the message.
>>> The fd itself _needs_ be be part of the process context of the application
>>> by the time the application processes that message.
>>> Consequently:
>>> - I can't see how an application can _reject_ the message; the fd needs to
>>> be present in the fd table even before the message is processed, rendering
>>> any decision by the application pointless (and I would _so_ love to be proven
>>> wrong on this point)
>>> - It's slightly tricky to handle processes which go away prior to handling
>>> the message; I _think_ the process cleanup code will close the fd, but I guess
>>> it also depends on how and when the fd is stored in the process context.
>>> 
>>> If someone can point me to a solution for these points I would vastly prefer
>>> to move to netlink. But with these issues in place I'm not sure if netlink
>>> doesn't cause more issues than it solves.
>> I think we first need to figure out the security model behind this.
>> For kTLS you have the TLS Handshake messages inline with the TCP
>> socket and thus credentials are given by the owner of that socket.
>> This is simple and makes a lot of sense since whoever opened that
>> connection has to decide to give a client certificate or accept
>> the server certificate (in case of session resumption also provide
>> the PSK).
>> I like to have a generic TLS Handshake interface as well since more
>> and more protocols will take TLS 1.3 as reference and use its
>> handshake protocol. What I would not do is insist on using an fd,
>> because that is what OpenSSL and others are just used to. The TLS
>> libraries need to go away from the fd as IO model and provide
>> appropriate APIs into the TLS Handshake (and also TLS Alert
>> protocol) for a “codec style” operation.
> That's something we have discussed, too.
> We could forward the TLS handshake frames via netlink, thus saving us the headache of passing an entire socket to userspace.
> However, that would require a major infrastructure work on the libraries, and my experience with fixing/updating things in gnutls have not been stellar. So I didn't pursue this route.

I know, utilizing existing TLS libraries is a pain if you don’t do
exactly what they had in mind. I started looking at QUIC a while
back and quickly realized, I have to start looking at TLS 1.3 first.

My past experience with GnuTLS and OpenSSL have been bad and that is
why iwd (our WiFi daemon) has its own TLS implementation utilizing
AF_ALG and keyctl.

>> Fundamentally nothing speaks against TLS Handshake in the kernel. All
>> the core functionality is already present. All KPP, HKDF and even the
>> certifiacate handling is present. In a simplified view, you just need
>> To give the kernel a keyctl keyring that has the CA certs to verify
>> and provide the keyring with either client or server certificate to
>> use.
>> On a TCP socket for example you could do this:
>> 	setsockopt(fd, SOL_TCP, TCP_ULP, “tls+hs", ..);
>> 	tls_client.cert_id = key_id_cert;
>> 	tls_client.ca_id = key_id_ca;
>> 	setsockopt(fd, SOL_TLS, TLS_CLIENT, &tls_client, ..);
>> Failures or errors would be reported out via socket errors or SCM.
>> And you need some extra options to select cipher ranges or limit to
>> TLS 1.3 only etc.
> Fundamentally you are correct. But these are security relevant areas, and any implementation we do will have to be vetted by some security people. _And_ will have to be maintained by someone well-versed in security, too, lest we have a security breach in the kernel.
> And that person will certainly not be me, so I haven't attempt that route.

While that might have been true in the past and with TLS 1.2 and earlier,
I am not sure that is all true today.

Lets assume we start with TLS 1.3 and don’t have backwards compatibility
with TLS 1.2 and earlier. And for now we don’t worry about Middleboxes
compatibility mode since you don’t have to for all the modern protocols
that utilize just the TLS 1.3 handshake like QUIC.

Now the key derivation is just choosing 1 out of 5 ciphers and using
its associated hash algorithm to derive the keys. This is all present
functionality in the kernel and so well tested that it doesn’t worry
me at all. We also have a separate RFC with just sample data so you
can check your derivation functionality. Especially if you check it
against AEAD encrypted sample data, any mistake is fatal.

The shared key portion is just ECDHE or DHE and you either end up with
x25519 or secp256r1 and both are in the kernel. Bluetooth has been
using secp256r1 inside the kernel for many years now. We all know how
to handle and verify public keys from secp256r1 and neat part is that
it would be also offloaded to hardware if needed. So the private key
doesn’t need to stay even in kernel memory.

So dealing with generating your key material for your cipher is really
simple and similar things have been done for Bluetooth for a long
time now. And it looks like NVMe is also utilizing KPP as of today.

The tricky part is the authentication portion of TLS utilizing
certificates. That part is complicated, but then again, we already
decided the kernel needs to handle certificates for various places
and you have to assume that it is fairly secure.

Now, you need to secure the handshake protocol like any other protocol
and the only difference is that it will lead to key material and
does authentication with certificates. All of it, the kernel already
does in one form or another.

The TLS 1.3 spec is also really nicely written and explicit in
error behavior in case of attempts to attack the protocol. While
implementing my TLS 1.3 only prototype I have been positively
surprised on how clean it is. I personally think they went over
board with the key verification, but so be it.

Once I have cleaned up my TLS 1.3 prototype, I am happy to take
a stab at a kernel version.

>> But overall it would make using TCP+TLS really simple. The complicated
>> part is providing the key ring. Then again, the CA key ring could be
>> inherited from systemd or some basic component setting it up and
>> sealing it.
> I don't think that's a major concern. The good thing with the keyring is that it can be populated externally, ie one can have a daemon to fetch the certificate and stuff it in the keyring. request_key() and all that ...

It is just painful for the simple reason that there is no real
standard around CA certificates and where to place them. Every
distro is kinda doing it their way and you expect your TLS
library to do the magic.

I like to see systemd create a keyring of the CA certs, seal it
and then provide it do every process/service it launches. And
for non systemd distros they need to find a way to actually
provide that one keyring that can be used as master for all the
CA certs.

We have not bothered with that yet since for WiFi, you always
have a client cert derived from the CA of the server. So you
give a CA cert and a client cert when you connect to WiFi
Enterprise systems.

>> For other protocols or usages the input would be similar. It should
>> be rather straight forward to provide key ring identifiers as mount
>> option or via an ioctl.
>> This however needs to overcome the fear of putting the TLS Handshake
>> into the kernel. I can understand anybody thinking that it is not a
>> good idea and with TLS 1.2 and before it is a bit convoluted and
>> error prone. However starting with TLS 1.3 things are a lot simpler
>> and streamlined. There are few oddities where TLS 1.3 has to look
>> like TLS 1.2 on the wire, but that mainly only affects the TLS
>> record protocol and kTLS does that today already anyway.
> See above. It's not so much 'fear' as rather the logistics of it. Getting hold of a TLS library is reasonably easy (Chuck had another example ready), but massaging it for inclusion into the kernel is quite some effort.
> You might even succeed in convincing the powers that be to include it into the kernel.
> But then you are stuck with having to find a capable maintainer, who is willing _and qualified_ to take the work and answer awkward questions.
> And take the heat when that code introduced a security breach in the linux kernel.
> Which excluded essentially everybody who had been working on this project; we are capable enough engineers in the network and storage space, but deep security issues ... not so much.

Having looked at various TLS libraries during the past few months,
I would not even recommend taking any of them. This needs to be
written from scratch. Some of them are just license wise a problem
others are just too much legacy for TLS 1.3 support.

I am happy to give this stab and see how badly I would fail ;)

But as stated above, I am surprised on how good TLS 1.3 spec is
when it comes to ensuring good and secure implementations. The
thing can be really easily unit testes to death. I think people
underestimate the huge effort from the guys at IETF to make
this simple and more secure.

Some of it comes from the fact that Middleboxes have been doing
stupid things over the years, but it resulted in an even more
encrypted and secure Internet along the way.

I am also wondering if there is a use case for offloading the
TLS Handshake to some crypto accelerator. Especially when it
comes to TLS 1.3 and server role, you really don’t have much
of a handshake. You get all required information in the
ClientHello message and if you like the public key, you can
produce almost all key material in one go.

The reality is that even the shared secret is not required
anymore once you derived your handshake secret.

And WiFi FullMac cards have been offloading EAP-TLS for a while
now.

>> For reference ELL (git.kernel.org/pub/scm/libs/ell/ell.git) has a
>> TLS implementation that utilizes AF_ALG and keyctl for all the
>> basic crypto needs. Certificates and certificate operations are
>> purely done via keyctl and that works nicely. If KPP would finally
>> get an usersapce interface, even shared secret derivation would go
>> via kernel crypto.
>> The code is currently TLS 1.2 and earlier, but I have code for
>> TLS 1.3 and also code for utilizing kTLS. It needs a bit more
>> cleanup, but then I am happy to publish it. The modified code
>> for TLS 1.3 support has TLS Handshake+Alert separated from TLS
>> Record protocol and doesn’t even rely on an fd to operate. This
>> comes from the requirement that TLS for WiFi Enterprise (or in
>> the future QUIC) doesn’t have a fd either.
> If you have code to update it to 1.3 I would be very willing to look at it; the main reason why with went with gnutls was that no-one of us was eager (not hat the knowledge) to really delve into TLS and do fancy things.
> 
> And that was the other thing; we found quite some TLS implementations, but nearly all of the said '1.3 support to come' ...

True to that. I think even OpenSSL started an effort to have a
QUIC specific API now.

The problem that I found is that TLS Handshake, TLS Alert and
TLS Record protocol are not cleanly separated. They are mixed
together.

For example if I want to use kTLS, I mostly just have to deal
with TLS Handshake portion. QUIC was specific and just uses
TLS Handshake and TLS Alert are converted to QUIC errors.

>> Long story short, who is suppose to run the TLS Handshake if
>> we push it to userspace. There will be never a generic daemon
>> that handles all handshakes since they are all application
>> specific. No daemon can run the TLS Handshake on behalf of
>> Chrome browser for example. This leads me to AF_HANDSHAKE
>> is not a good idea.
>> One nice thing we found with using keyctl for WiFi Enterprise
>> is that we can have certificates that are backed by the TPM.
>> Doing that via keyctl was a lot simpler than dealing with the
>> different oddities of SSL engines or different variations of
>> crypto libraries. The unification by the kernel is really
>> nice. I have to re-read how much EFI can provide securely
>> hardware backed keys, but for everybody working in early
>> userspace or initramfs it is nice to be able to utilize
>> this without having to drag in megabytes of TLS library.
> We don't deny that having TLS handshake in the kernel would be a good thing. It's just the hurdles to _get_ there are quite high, and we thought that the userspace daemon would be an easier route.

My problem with doing an upcall for TLS Handshake messages
is to ensure that the right process has the correct rights
to receive and send the messages. And nobody else can
interfere with that or intercept messages without proper
right to do so.

And I am certain that Wireshark would love to get hold of
the unencrypted TLS Handshake traffic. Debugging TLS
and also QUIC transfers is hugely painful. The method of
SSLKEYLOGFILE works but it is so cumbersome and defeats
any kind of live traffic analysis. So having some DIAG
here would help a lot of developers.

Regards

Marcel
Hannes Reinecke Jan. 31, 2023, 2:47 p.m. UTC | #11
On 1/31/23 15:17, Marcel Holtmann wrote:
> Hi Hannes,
> 
[ .. ]
>>> I like to have a generic TLS Handshake interface as well since more
>>> and more protocols will take TLS 1.3 as reference and use its
>>> handshake protocol. What I would not do is insist on using an fd,
>>> because that is what OpenSSL and others are just used to. The TLS
>>> libraries need to go away from the fd as IO model and provide
>>> appropriate APIs into the TLS Handshake (and also TLS Alert
>>> protocol) for a “codec style” operation.
>> That's something we have discussed, too.
>> We could forward the TLS handshake frames via netlink, thus saving
>> us the headache of passing an entire socket to userspace.
>> However, that would require a major infrastructure work on the
>> libraries, and my experience with fixing/updating things in gnutls
>> have not been stellar. So I didn't pursue this route.
> 
> I know, utilizing existing TLS libraries is a pain if you don’t do
> exactly what they had in mind. I started looking at QUIC a while
> back and quickly realized, I have to start looking at TLS 1.3 first.
> 
> My past experience with GnuTLS and OpenSSL have been bad and that is
> why iwd (our WiFi daemon) has its own TLS implementation utilizing
> AF_ALG and keyctl.
> 
I know the feeling :-)

>>> Fundamentally nothing speaks against TLS Handshake in the kernel. All
>>> the core functionality is already present. All KPP, HKDF and even the
>>> certifiacate handling is present. In a simplified view, you just need
>>> To give the kernel a keyctl keyring that has the CA certs to verify
>>> and provide the keyring with either client or server certificate to
>>> use.
>>> On a TCP socket for example you could do this:
>>> 	setsockopt(fd, SOL_TCP, TCP_ULP, “tls+hs", ..);
>>> 	tls_client.cert_id = key_id_cert;
>>> 	tls_client.ca_id = key_id_ca;
>>> 	setsockopt(fd, SOL_TLS, TLS_CLIENT, &tls_client, ..);
>>> Failures or errors would be reported out via socket errors or SCM.
>>> And you need some extra options to select cipher ranges or limit to
>>> TLS 1.3 only etc.
>> Fundamentally you are correct. But these are security relevant areas,
>> and any implementation we do will have to be vetted by some security >> people. _And_ will have to be maintained by someone well-versed in
>> security, too, lest we have a security breach in the kernel.
>> And that person will certainly not be me, so I haven't attempt that route.
> 
> While that might have been true in the past and with TLS 1.2 and earlier,
> I am not sure that is all true today.
> 
> Lets assume we start with TLS 1.3 and don’t have backwards compatibility
> with TLS 1.2 and earlier. And for now we don’t worry about Middleboxes
> compatibility mode since you don’t have to for all the modern protocols
> that utilize just the TLS 1.3 handshake like QUIC.
> 
> Now the key derivation is just choosing 1 out of 5 ciphers and using
> its associated hash algorithm to derive the keys. This is all present
> functionality in the kernel and so well tested that it doesn’t worry
> me at all. We also have a separate RFC with just sample data so you
> can check your derivation functionality. Especially if you check it
> against AEAD encrypted sample data, any mistake is fatal.
> 
> The shared key portion is just ECDHE or DHE and you either end up with
> x25519 or secp256r1 and both are in the kernel. Bluetooth has been
> using secp256r1 inside the kernel for many years now. We all know how
> to handle and verify public keys from secp256r1 and neat part is that
> it would be also offloaded to hardware if needed. So the private key
> doesn’t need to stay even in kernel memory.
> 
ECDHE has now been stabilized, too; I needed that for NVMe 
authentication. So all's good there.

> So dealing with generating your key material for your cipher is really
> simple and similar things have been done for Bluetooth for a long
> time now. And it looks like NVMe is also utilizing KPP as of today.
> 
Yes. Guess who did that.

> The tricky part is the authentication portion of TLS utilizing
> certificates. That part is complicated, but then again, we already
> decided the kernel needs to handle certificates for various places
> and you have to assume that it is fairly secure.
> 
> Now, you need to secure the handshake protocol like any other protocol
> and the only difference is that it will lead to key material and
> does authentication with certificates. All of it, the kernel already
> does in one form or another.
> 
> The TLS 1.3 spec is also really nicely written and explicit in
> error behavior in case of attempts to attack the protocol. While
> implementing my TLS 1.3 only prototype I have been positively
> surprised on how clean it is. I personally think they went over
> board with the key verification, but so be it.
> 
> Once I have cleaned up my TLS 1.3 prototype, I am happy to take
> a stab at a kernel version.
> 
Oh, please. I would so love it to get it done properly; the TLS 
handshake has been a major worry for us.
And even if you would just add TLS1.3 support for ELL that'll be 
fantastic, as then I could give it a stab at the netlink frame handling 
interface (which shouldn't be too hard).

>>> But overall it would make using TCP+TLS really simple. The complicated
>>> part is providing the key ring. Then again, the CA key ring could be
>>> inherited from systemd or some basic component setting it up and
>>> sealing it.
>> I don't think that's a major concern. The good thing with the keyring
>> is that it can be populated externally, ie one can have a daemon to
>> fetch the certificate and stuff it in the keyring. request_key() and all that ...
> 
> It is just painful for the simple reason that there is no real
> standard around CA certificates and where to place them. Every
> distro is kinda doing it their way and you expect your TLS
> library to do the magic.
> 
> I like to see systemd create a keyring of the CA certs, seal it
> and then provide it do every process/service it launches. And
> for non systemd distros they need to find a way to actually
> provide that one keyring that can be used as master for all the
> CA certs.
> 
> We have not bothered with that yet since for WiFi, you always
> have a client cert derived from the CA of the server. So you
> give a CA cert and a client cert when you connect to WiFi
> Enterprise systems.
> 
The good thing is that NVMe is currently PSK-only, so the certificate 
bit is easy for me. Others like NFS will have to do proper X.509 cert 
handling, but I'll let them worry about that :-)

>>> For other protocols or usages the input would be similar. It should
>>> be rather straight forward to provide key ring identifiers as mount
>>> option or via an ioctl.
>>> This however needs to overcome the fear of putting the TLS Handshake
>>> into the kernel. I can understand anybody thinking that it is not a
>>> good idea and with TLS 1.2 and before it is a bit convoluted and
>>> error prone. However starting with TLS 1.3 things are a lot simpler
>>> and streamlined. There are few oddities where TLS 1.3 has to look
>>> like TLS 1.2 on the wire, but that mainly only affects the TLS
>>> record protocol and kTLS does that today already anyway.
>> See above. It's not so much 'fear' as rather the logistics of it.
>> Getting hold of a TLS library is reasonably easy (Chuck had another
>> example ready), but massaging it for inclusion into the kernel is
>> quite some effort.
>> You might even succeed in convincing the powers that be to include
>> it into the kernel.
>> But then you are stuck with having to find a capable maintainer, who
>> is willing _and qualified_ to take the work and answer awkward questions.
>> And take the heat when that code introduced a security breach in the linux kernel.
>> Which excluded essentially everybody who had been working on this project;
>> we are capable enough engineers in the network and storage space, but
>> deep security issues ... not so much.
> 
> Having looked at various TLS libraries during the past few months,
> I would not even recommend taking any of them. This needs to be
> written from scratch. Some of them are just license wise a problem
> others are just too much legacy for TLS 1.3 support.
> 
> I am happy to give this stab and see how badly I would fail ;)
> 
Cool. Count me in; I'll gladly give it a spin for NVMe-TLS where
I've all the surrounding infrastructure like keyrings and certificate 
generation ready. It really just need a TLS handshake protocol handling...

> But as stated above, I am surprised on how good TLS 1.3 spec is
> when it comes to ensuring good and secure implementations. The
> thing can be really easily unit testes to death. I think people
> underestimate the huge effort from the guys at IETF to make
> this simple and more secure.
> 

True. TLS 1.3 _is_ simple, and it might be that quite some issues
around TLS is related to older versions.

[ .. ]

>>> The code is currently TLS 1.2 and earlier, but I have code for
>>> TLS 1.3 and also code for utilizing kTLS. It needs a bit more
>>> cleanup, but then I am happy to publish it. The modified code
>>> for TLS 1.3 support has TLS Handshake+Alert separated from TLS
>>> Record protocol and doesn’t even rely on an fd to operate. This
>>> comes from the requirement that TLS for WiFi Enterprise (or in
>>> the future QUIC) doesn’t have a fd either.
>> If you have code to update it to 1.3 I would be very willing to
>> look at it; the main reason why with went with gnutls was that
>> no-one of us was eager (not hat the knowledge) to really delve
>> into TLS and do fancy things.
>>
>> And that was the other thing; we found quite some TLS implementations,
>> but nearly all of the said '1.3 support to come' ...
> 
> True to that. I think even OpenSSL started an effort to have a
> QUIC specific API now.
> 
> The problem that I found is that TLS Handshake, TLS Alert and
> TLS Record protocol are not cleanly separated. They are mixed
> together.
> 
Yep.

> For example if I want to use kTLS, I mostly just have to deal
> with TLS Handshake portion. QUIC was specific and just uses
> TLS Handshake and TLS Alert are converted to QUIC errors.
> 
Some for us. Alerts don't make sense to us as we have long-lived 
connections, so the prime reason for alerts is gone, and we have to 
re-establish the connection whenever the cipher is changed. So we will 
be converting alerts in errors, too.

>>> Long story short, who is suppose to run the TLS Handshake if
>>> we push it to userspace. There will be never a generic daemon
>>> that handles all handshakes since they are all application
>>> specific. No daemon can run the TLS Handshake on behalf of
>>> Chrome browser for example. This leads me to AF_HANDSHAKE
>>> is not a good idea.
>>> One nice thing we found with using keyctl for WiFi Enterprise
>>> is that we can have certificates that are backed by the TPM.
>>> Doing that via keyctl was a lot simpler than dealing with the
>>> different oddities of SSL engines or different variations of
>>> crypto libraries. The unification by the kernel is really
>>> nice. I have to re-read how much EFI can provide securely
>>> hardware backed keys, but for everybody working in early
>>> userspace or initramfs it is nice to be able to utilize
>>> this without having to drag in megabytes of TLS library.
 >>
>> We don't deny that having TLS handshake in the kernel would
>> be a good thing. It's just the hurdles to _get_ there are
>> quite high, and we thought that the userspace daemon would be an easier route.
> 
> My problem with doing an upcall for TLS Handshake messages
> is to ensure that the right process has the correct rights
> to receive and send the messages. And nobody else can
> interfere with that or intercept messages without proper
> right to do so.
> 
Correct. That is a concern.

> And I am certain that Wireshark would love to get hold of
> the unencrypted TLS Handshake traffic. Debugging TLS
> and also QUIC transfers is hugely painful. The method of
> SSLKEYLOGFILE works but it is so cumbersome and defeats
> any kind of live traffic analysis. So having some DIAG
> here would help a lot of developers.
> 
Oh, yes. That would be nice side-effect.

So, when can I expect the patch?
:-)

Cheers,

Hannes
Chuck Lever III Jan. 31, 2023, 3:18 p.m. UTC | #12
> On Jan 30, 2023, at 11:35 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Sat, 28 Jan 2023 14:06:49 +0000 Chuck Lever III wrote:
>>> On Jan 28, 2023, at 3:32 AM, Jakub Kicinski <kuba@kernel.org> wrote:
>>> On Thu, 26 Jan 2023 11:02:22 -0500 Chuck Lever wrote:  
>>>> I've designed a way to pass a connected kernel socket endpoint to
>>>> user space using the traditional listen/accept mechanism. accept(2)
>>>> gives us a well-worn building block that can materialize a connected
>>>> socket endpoint as a file descriptor in a specific user space
>>>> process. Like any open socket descriptor, the accepted FD can then
>>>> be passed to a library such as GnuTLS to perform a TLS handshake.  
>>> 
>>> I can't bring myself to like the new socket family layer.  
>> 
>> poll/listen/accept is the simplest and most natural way of
>> materializing a socket endpoint in a process that I can think
>> of. It's a well-understood building block. What specifically
>> is troubling you about it?
> 
> poll/listen/accept yes, but that's not the entire socket interface. 
> Our overall experience with the TCP ULPs is rather painful, proxying
> all the other callbacks here may add another dimension.

> Also I have a fear (perhaps unjustified) of reusing constructs which are
> cornerstones of the networking stack and treating them as abstractions.

OK, then I take this as a NAK for listen/poll/accept in
any form. I need some finality here because we need to
move forward.


>>> I'd like a second opinion on that, if anyone within netdev
>>> is willing to share..  
>> 
>> Hopefully that opinion comes with an alternative way of getting
>> a connected kernel socket endpoint up to user space without
>> race issues.
> 
> If the user application decides the fd, wouldn't that solve the problem
> in netlink?

David or Hannes will have to answer that because they
understand the races better than I do.

However, I will prototype "fd passing" with netlink and
ignore the races for now, just to get something to
continue the conversation.


>  kernel                          user space
> 
>   notification     ---------->
> (new connection awaits)
> 
>                    <----------
>                                  request (target fd=100)
> 
>                    ---------->
>   reply
> (fd 100 is installed;
>  extra params)

What type of notification do you prefer for this? You've
said in the past that RT signals are not appropriate. It
would be easy for user space to simply wait on nlm_recvmsg()
but I worry that netlink is not a reliable message service.

And, do you have a preferred mechanism or code sample for
installing a socket descriptor? 


--
Chuck Lever
Jakub Kicinski Jan. 31, 2023, 7:30 p.m. UTC | #13
On Tue, 31 Jan 2023 15:18:02 +0000 Chuck Lever III wrote:
> > On Jan 30, 2023, at 11:35 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Sat, 28 Jan 2023 14:06:49 +0000 Chuck Lever III wrote:  
> >> poll/listen/accept is the simplest and most natural way of
> >> materializing a socket endpoint in a process that I can think
> >> of. It's a well-understood building block. What specifically
> >> is troubling you about it?  
> > 
> > poll/listen/accept yes, but that's not the entire socket interface. 
> > Our overall experience with the TCP ULPs is rather painful, proxying
> > all the other callbacks here may add another dimension.  
> >
> > Also I have a fear (perhaps unjustified) of reusing constructs which are
> > cornerstones of the networking stack and treating them as abstractions.  
> 
> OK, then I take this as a NAK for listen/poll/accept in
> any form. I need some finality here because we need to
> move forward.

To be clear - if Paolo, Eric or someone else who knows the socket layer
better than I do thinks that your current implementation is good then 
I won't stand in the way. 

> >  kernel                          user space
> > 
> >   notification     ---------->
> > (new connection awaits)
> > 
> >                    <----------
> >                                  request (target fd=100)
> >   
> >                    ---------->  
> >   reply
> > (fd 100 is installed;
> >  extra params)  
> 
> What type of notification do you prefer for this? You've
> said in the past that RT signals are not appropriate. It
> would be easy for user space to simply wait on nlm_recvmsg()
> but I worry that netlink is not a reliable message service.

There are various bits and bobs in netlink which are supposed to help.
A socket which subscribed to notifications should get an error if a
delivery fails (netlink_overrun()). The kernel commonly supports a GET
request which the user space can exercise after missing notifications 
to get back in sync.

> And, do you have a preferred mechanism or code sample for
> installing a socket descriptor? 

I must admit - I don't.
Chuck Lever III Jan. 31, 2023, 7:34 p.m. UTC | #14
> On Jan 31, 2023, at 2:30 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
>> On Tue, 31 Jan 2023 15:18:02 +0000 Chuck Lever III wrote:
>> And, do you have a preferred mechanism or code sample for
>> installing a socket descriptor? 
> 
> I must admit - I don't.

As part of responding to the handshake daemon's netlink call,
I'm thinking of doing something like:

get_unused_fd_flags(), then sock_alloc_file(), and then fd_install() 


--
Chuck Lever
Marcel Holtmann Jan. 31, 2023, 8:23 p.m. UTC | #15
Hi Chuck,

>>> And, do you have a preferred mechanism or code sample for
>>> installing a socket descriptor? 
>> 
>> I must admit - I don't.
> 
> As part of responding to the handshake daemon's netlink call,
> I'm thinking of doing something like:
> 
> get_unused_fd_flags(), then sock_alloc_file(), and then fd_install() 

can we be really careful here. fd passing over Unix sockets is already
complicated to get right on the receiver side. We had this with D-Bus
and man, can you screw up things here. The problem is really that your
fd is part of the receiving process as soon as you receive that message
and you are _required_ to take care of it. Simple things like not
setting CLOEXC is already a path to disaster. And Unix sockets have
SCM_RIGHTS and other fun stuff. I don’t remember having that for
Netlink. And don’t forget the SELinux etc. folks that might want to
have some control here.

Regards

Marcel
Benjamin Coddington Jan. 31, 2023, 8:26 p.m. UTC | #16
On 31 Jan 2023, at 14:34, Chuck Lever III wrote:

>> On Jan 31, 2023, at 2:30 PM, Jakub Kicinski <kuba@kernel.org> wrote:
>>
>>> On Tue, 31 Jan 2023 15:18:02 +0000 Chuck Lever III wrote:
>>> And, do you have a preferred mechanism or code sample for
>>> installing a socket descriptor?
>>
>> I must admit - I don't.
>
> As part of responding to the handshake daemon's netlink call,
> I'm thinking of doing something like:
>
> get_unused_fd_flags(), then sock_alloc_file(), and then fd_install()

It seems odd to me that we're not taking advantage of request_key() to do
this work.  It was designed for exactly this problem: the kernel needs
something/work (a tls handhsake) from/done in userspace.

I have a working implementation here:
https://github.com/bcodding/linux/tree/tls_keys

Perhaps there's no interest because no one likes call_usermode_helper(),
which cannot figure out what set of namespace(s) to use, but there's a
solution for that as well: keyagents can represent a running process to
satisfy request_key().
https://lore.kernel.org/linux-nfs/cover.1657624639.git.bcodding@redhat.com/

Keyagents are not required to simply pass socket fds to userspace, however
they do create a flexible way for containers to specify exactly where the
kernel should send various key requests.

I am happy to continue to explain how these approaches work, though I would
also much prefer to see us doing handshakes in-kernel.

Ben
Marcel Holtmann Jan. 31, 2023, 8:32 p.m. UTC | #17
Hi Hannes,

>>>> I like to have a generic TLS Handshake interface as well since more
>>>> and more protocols will take TLS 1.3 as reference and use its
>>>> handshake protocol. What I would not do is insist on using an fd,
>>>> because that is what OpenSSL and others are just used to. The TLS
>>>> libraries need to go away from the fd as IO model and provide
>>>> appropriate APIs into the TLS Handshake (and also TLS Alert
>>>> protocol) for a “codec style” operation.
>>> That's something we have discussed, too.
>>> We could forward the TLS handshake frames via netlink, thus saving
>>> us the headache of passing an entire socket to userspace.
>>> However, that would require a major infrastructure work on the
>>> libraries, and my experience with fixing/updating things in gnutls
>>> have not been stellar. So I didn't pursue this route.
>> I know, utilizing existing TLS libraries is a pain if you don’t do
>> exactly what they had in mind. I started looking at QUIC a while
>> back and quickly realized, I have to start looking at TLS 1.3 first.
>> My past experience with GnuTLS and OpenSSL have been bad and that is
>> why iwd (our WiFi daemon) has its own TLS implementation utilizing
>> AF_ALG and keyctl.
> I know the feeling :-)
> 
>>>> Fundamentally nothing speaks against TLS Handshake in the kernel. All
>>>> the core functionality is already present. All KPP, HKDF and even the
>>>> certifiacate handling is present. In a simplified view, you just need
>>>> To give the kernel a keyctl keyring that has the CA certs to verify
>>>> and provide the keyring with either client or server certificate to
>>>> use.
>>>> On a TCP socket for example you could do this:
>>>> 	setsockopt(fd, SOL_TCP, TCP_ULP, “tls+hs", ..);
>>>> 	tls_client.cert_id = key_id_cert;
>>>> 	tls_client.ca_id = key_id_ca;
>>>> 	setsockopt(fd, SOL_TLS, TLS_CLIENT, &tls_client, ..);
>>>> Failures or errors would be reported out via socket errors or SCM.
>>>> And you need some extra options to select cipher ranges or limit to
>>>> TLS 1.3 only etc.
>>> Fundamentally you are correct. But these are security relevant areas,
>>> and any implementation we do will have to be vetted by some security >> people. _And_ will have to be maintained by someone well-versed in
>>> security, too, lest we have a security breach in the kernel.
>>> And that person will certainly not be me, so I haven't attempt that route.
>> While that might have been true in the past and with TLS 1.2 and earlier,
>> I am not sure that is all true today.
>> Lets assume we start with TLS 1.3 and don’t have backwards compatibility
>> with TLS 1.2 and earlier. And for now we don’t worry about Middleboxes
>> compatibility mode since you don’t have to for all the modern protocols
>> that utilize just the TLS 1.3 handshake like QUIC.
>> Now the key derivation is just choosing 1 out of 5 ciphers and using
>> its associated hash algorithm to derive the keys. This is all present
>> functionality in the kernel and so well tested that it doesn’t worry
>> me at all. We also have a separate RFC with just sample data so you
>> can check your derivation functionality. Especially if you check it
>> against AEAD encrypted sample data, any mistake is fatal.
>> The shared key portion is just ECDHE or DHE and you either end up with
>> x25519 or secp256r1 and both are in the kernel. Bluetooth has been
>> using secp256r1 inside the kernel for many years now. We all know how
>> to handle and verify public keys from secp256r1 and neat part is that
>> it would be also offloaded to hardware if needed. So the private key
>> doesn’t need to stay even in kernel memory.
> ECDHE has now been stabilized, too; I needed that for NVMe authentication. So all's good there.
> 
>> So dealing with generating your key material for your cipher is really
>> simple and similar things have been done for Bluetooth for a long
>> time now. And it looks like NVMe is also utilizing KPP as of today.
> Yes. Guess who did that.
> 
>> The tricky part is the authentication portion of TLS utilizing
>> certificates. That part is complicated, but then again, we already
>> decided the kernel needs to handle certificates for various places
>> and you have to assume that it is fairly secure.
>> Now, you need to secure the handshake protocol like any other protocol
>> and the only difference is that it will lead to key material and
>> does authentication with certificates. All of it, the kernel already
>> does in one form or another.
>> The TLS 1.3 spec is also really nicely written and explicit in
>> error behavior in case of attempts to attack the protocol. While
>> implementing my TLS 1.3 only prototype I have been positively
>> surprised on how clean it is. I personally think they went over
>> board with the key verification, but so be it.
>> Once I have cleaned up my TLS 1.3 prototype, I am happy to take
>> a stab at a kernel version.
> Oh, please. I would so love it to get it done properly; the TLS handshake has been a major worry for us.
> And even if you would just add TLS1.3 support for ELL that'll be fantastic, as then I could give it a stab at the netlink frame handling interface (which shouldn't be too hard).
> 
>>>> But overall it would make using TCP+TLS really simple. The complicated
>>>> part is providing the key ring. Then again, the CA key ring could be
>>>> inherited from systemd or some basic component setting it up and
>>>> sealing it.
>>> I don't think that's a major concern. The good thing with the keyring
>>> is that it can be populated externally, ie one can have a daemon to
>>> fetch the certificate and stuff it in the keyring. request_key() and all that ...
>> It is just painful for the simple reason that there is no real
>> standard around CA certificates and where to place them. Every
>> distro is kinda doing it their way and you expect your TLS
>> library to do the magic.
>> I like to see systemd create a keyring of the CA certs, seal it
>> and then provide it do every process/service it launches. And
>> for non systemd distros they need to find a way to actually
>> provide that one keyring that can be used as master for all the
>> CA certs.
>> We have not bothered with that yet since for WiFi, you always
>> have a client cert derived from the CA of the server. So you
>> give a CA cert and a client cert when you connect to WiFi
>> Enterprise systems.
> The good thing is that NVMe is currently PSK-only, so the certificate bit is easy for me. Others like NFS will have to do proper X.509 cert handling, but I'll let them worry about that :-)

Interesting. I have not looked at PSK part yet. Do you require PSK-only
or ECDHE+PSK. From what I quickly glanced at the spec, the PSK-only is
really simple. That is the most simplest TLS Handshake I have seen and
I should give that a spin.

So NVMe agrees the PSK out-of-band? I might be able to read up on it,
but I can only keep so much specifications in memory ;)

>>>> For other protocols or usages the input would be similar. It should
>>>> be rather straight forward to provide key ring identifiers as mount
>>>> option or via an ioctl.
>>>> This however needs to overcome the fear of putting the TLS Handshake
>>>> into the kernel. I can understand anybody thinking that it is not a
>>>> good idea and with TLS 1.2 and before it is a bit convoluted and
>>>> error prone. However starting with TLS 1.3 things are a lot simpler
>>>> and streamlined. There are few oddities where TLS 1.3 has to look
>>>> like TLS 1.2 on the wire, but that mainly only affects the TLS
>>>> record protocol and kTLS does that today already anyway.
>>> See above. It's not so much 'fear' as rather the logistics of it.
>>> Getting hold of a TLS library is reasonably easy (Chuck had another
>>> example ready), but massaging it for inclusion into the kernel is
>>> quite some effort.
>>> You might even succeed in convincing the powers that be to include
>>> it into the kernel.
>>> But then you are stuck with having to find a capable maintainer, who
>>> is willing _and qualified_ to take the work and answer awkward questions.
>>> And take the heat when that code introduced a security breach in the linux kernel.
>>> Which excluded essentially everybody who had been working on this project;
>>> we are capable enough engineers in the network and storage space, but
>>> deep security issues ... not so much.
>> Having looked at various TLS libraries during the past few months,
>> I would not even recommend taking any of them. This needs to be
>> written from scratch. Some of them are just license wise a problem
>> others are just too much legacy for TLS 1.3 support.
>> I am happy to give this stab and see how badly I would fail ;)
> Cool. Count me in; I'll gladly give it a spin for NVMe-TLS where
> I've all the surrounding infrastructure like keyrings and certificate generation ready. It really just need a TLS handshake protocol handling...
> 
>> But as stated above, I am surprised on how good TLS 1.3 spec is
>> when it comes to ensuring good and secure implementations. The
>> thing can be really easily unit testes to death. I think people
>> underestimate the huge effort from the guys at IETF to make
>> this simple and more secure.
> 
> True. TLS 1.3 _is_ simple, and it might be that quite some issues
> around TLS is related to older versions.
> 
> [ .. ]
> 
>>>> The code is currently TLS 1.2 and earlier, but I have code for
>>>> TLS 1.3 and also code for utilizing kTLS. It needs a bit more
>>>> cleanup, but then I am happy to publish it. The modified code
>>>> for TLS 1.3 support has TLS Handshake+Alert separated from TLS
>>>> Record protocol and doesn’t even rely on an fd to operate. This
>>>> comes from the requirement that TLS for WiFi Enterprise (or in
>>>> the future QUIC) doesn’t have a fd either.
>>> If you have code to update it to 1.3 I would be very willing to
>>> look at it; the main reason why with went with gnutls was that
>>> no-one of us was eager (not hat the knowledge) to really delve
>>> into TLS and do fancy things.
>>> 
>>> And that was the other thing; we found quite some TLS implementations,
>>> but nearly all of the said '1.3 support to come' ...
>> True to that. I think even OpenSSL started an effort to have a
>> QUIC specific API now.
>> The problem that I found is that TLS Handshake, TLS Alert and
>> TLS Record protocol are not cleanly separated. They are mixed
>> together.
> Yep.
> 
>> For example if I want to use kTLS, I mostly just have to deal
>> with TLS Handshake portion. QUIC was specific and just uses
>> TLS Handshake and TLS Alert are converted to QUIC errors.
> Some for us. Alerts don't make sense to us as we have long-lived connections, so the prime reason for alerts is gone, and we have to re-establish the connection whenever the cipher is changed. So we will be converting alerts in errors, too.

What is the solution for sequence number exhaustion. Do you
re-connect or do you re-key via TLS?

>>>> Long story short, who is suppose to run the TLS Handshake if
>>>> we push it to userspace. There will be never a generic daemon
>>>> that handles all handshakes since they are all application
>>>> specific. No daemon can run the TLS Handshake on behalf of
>>>> Chrome browser for example. This leads me to AF_HANDSHAKE
>>>> is not a good idea.
>>>> One nice thing we found with using keyctl for WiFi Enterprise
>>>> is that we can have certificates that are backed by the TPM.
>>>> Doing that via keyctl was a lot simpler than dealing with the
>>>> different oddities of SSL engines or different variations of
>>>> crypto libraries. The unification by the kernel is really
>>>> nice. I have to re-read how much EFI can provide securely
>>>> hardware backed keys, but for everybody working in early
>>>> userspace or initramfs it is nice to be able to utilize
>>>> this without having to drag in megabytes of TLS library.
> >>
>>> We don't deny that having TLS handshake in the kernel would
>>> be a good thing. It's just the hurdles to _get_ there are
>>> quite high, and we thought that the userspace daemon would be an easier route.
>> My problem with doing an upcall for TLS Handshake messages
>> is to ensure that the right process has the correct rights
>> to receive and send the messages. And nobody else can
>> interfere with that or intercept messages without proper
>> right to do so.
> Correct. That is a concern.
> 
>> And I am certain that Wireshark would love to get hold of
>> the unencrypted TLS Handshake traffic. Debugging TLS
>> and also QUIC transfers is hugely painful. The method of
>> SSLKEYLOGFILE works but it is so cumbersome and defeats
>> any kind of live traffic analysis. So having some DIAG
>> here would help a lot of developers.
> Oh, yes. That would be nice side-effect.
> 
> So, when can I expect the patch?
> :-)

Lol. I need to get a few things cleaned up in the userspace
prototype I have. Then I take a stab at a kernel code. I do
need to build myself a test setup for PSK-only since I have
not yet bothered with that.

Regards

Marcel
Hannes Reinecke Feb. 1, 2023, 7:09 a.m. UTC | #18
On 1/31/23 21:32, Marcel Holtmann wrote:
> Hi Hannes,
> 
[ .. ]
>>> It is just painful for the simple reason that there is no real
>>> standard around CA certificates and where to place them. Every
>>> distro is kinda doing it their way and you expect your TLS
>>> library to do the magic.
>>> I like to see systemd create a keyring of the CA certs, seal it
>>> and then provide it do every process/service it launches. And
>>> for non systemd distros they need to find a way to actually
>>> provide that one keyring that can be used as master for all the
>>> CA certs.
>>> We have not bothered with that yet since for WiFi, you always
>>> have a client cert derived from the CA of the server. So you
>>> give a CA cert and a client cert when you connect to WiFi
>>> Enterprise systems.
>> The good thing is that NVMe is currently PSK-only, so the certificate
> bit is easy for me. Others like NFS will have to do proper X.509 cert
> handling, but I'll let them worry about that :-)
> 
> Interesting. I have not looked at PSK part yet. Do you require PSK-only
> or ECDHE+PSK. From what I quickly glanced at the spec, the PSK-only is
> really simple. That is the most simplest TLS Handshake I have seen and
> I should give that a spin.
> 

ECDHE+PSK, sadly; the PSK-only method will be deprecated eventually.
But I've got code for that already, so that's nothing to worry about.

> So NVMe agrees the PSK out-of-band? I might be able to read up on it,
> but I can only keep so much specifications in memory ;)
> 
Yep, that's the plan. Or, one could say, the 'P' in PSK :-)
That's where keyrings come in; idea is that external agent populate the 
keyring, and the kernel code just looks up keys from there.

[ .. ]
>>>> And that was the other thing; we found quite some TLS implementations,
>>>> but nearly all of the said '1.3 support to come' ...
>>> True to that. I think even OpenSSL started an effort to have a
>>> QUIC specific API now.
>>> The problem that I found is that TLS Handshake, TLS Alert and
>>> TLS Record protocol are not cleanly separated. They are mixed
>>> together.
>> Yep.
>>
>>> For example if I want to use kTLS, I mostly just have to deal
>>> with TLS Handshake portion. QUIC was specific and just uses
>>> TLS Handshake and TLS Alert are converted to QUIC errors.
>> Some for us. Alerts don't make sense to us as we have long-lived
>> connections, so the prime reason for alerts is gone, and we have
>> to re-establish the connection whenever the cipher is changed. So
>> we will be converting alerts in errors, too.
> 
> What is the solution for sequence number exhaustion. Do you
> re-connect or do you re-key via TLS?
> 
Reconnect. We don't have a good way of allowing for re-keying, as the
key material directly relates to information retrieved from the protocol 
itself (here: the TLS key might derived from the DH-CHAP authentication 
protocol running in NVMe space).
So for any re-key operationn we'll should to re-run that protocol. At 
which point we might as well kill the connection and start over, as this 
is basically the same operation.

[ .. ]
>>>>> Long story short, who is suppose to 
>>> And I am certain that Wireshark would love to get hold of
>>> the unencrypted TLS Handshake traffic. Debugging TLS
>>> and also QUIC transfers is hugely painful. The method of
>>> SSLKEYLOGFILE works but it is so cumbersome and defeats
>>> any kind of live traffic analysis. So having some DIAG
>>> here would help a lot of developers.
>> Oh, yes. That would be nice side-effect.
>>
>> So, when can I expect the patch?
>> :-)
> 
> Lol. I need to get a few things cleaned up in the userspace
> prototype I have. Then I take a stab at a kernel code. I do
> need to build myself a test setup for PSK-only since I have
> not yet bothered with that.
> 
I'd be happy to have the userspace code; my plan is to work on a frame 
forwarder (such that the TLS handshake and alert frames are forwarded to 
userspace via netlink).
Then I can repurpose the daemon code from the accept solution to handle 
those netlink frames, and use your library for the tls handshake.

That would give us a nice proof of concept, and if designed properly the 
frame forwarder could later be modified to redirect to the in-kernel code.

So having the userspace code already is a bonus.
And the 'patch' really was just for the userspace library :-)

Cheers,

Hannes
Xin Long Feb. 2, 2023, 5:13 p.m. UTC | #19
On Tue, Jan 31, 2023 at 9:24 AM Marcel Holtmann <marcel@holtmann.org> wrote:
>
> I know, utilizing existing TLS libraries is a pain if you don’t do
> exactly what they had in mind. I started looking at QUIC a while
> back and quickly realized, I have to start looking at TLS 1.3 first.
>
> My past experience with GnuTLS and OpenSSL have been bad and that is
> why iwd (our WiFi daemon) has its own TLS implementation utilizing
> AF_ALG and keyctl.
Hi Marcel,

I'm no expert on TLS, but I'm a supporter of in-kernel TLS 1.3 Handshake
implementation :). When working on implementing in-kernel QUIC protocol,
the code looks a lot simpler with the pure in-kernel TLS 1.3 Handshake APIs
than the upcall method, and I believe the NFS over TLS 1.3 in kernel will
feel the same.

>
> While that might have been true in the past and with TLS 1.2 and earlier,
> I am not sure that is all true today.
>
> Lets assume we start with TLS 1.3 and don’t have backwards compatibility
> with TLS 1.2 and earlier. And for now we don’t worry about Middleboxes
> compatibility mode since you don’t have to for all the modern protocols
> that utilize just the TLS 1.3 handshake like QUIC.
>
> Now the key derivation is just choosing 1 out of 5 ciphers and using
> its associated hash algorithm to derive the keys. This is all present
> functionality in the kernel and so well tested that it doesn’t worry
> me at all. We also have a separate RFC with just sample data so you
> can check your derivation functionality. Especially if you check it
> against AEAD encrypted sample data, any mistake is fatal.
>
> The shared key portion is just ECDHE or DHE and you either end up with
> x25519 or secp256r1 and both are in the kernel. Bluetooth has been
> using secp256r1 inside the kernel for many years now. We all know how
> to handle and verify public keys from secp256r1 and neat part is that
> it would be also offloaded to hardware if needed. So the private key
> doesn’t need to stay even in kernel memory.
>
> So dealing with generating your key material for your cipher is really
> simple and similar things have been done for Bluetooth for a long
> time now. And it looks like NVMe is also utilizing KPP as of today.
>
> The tricky part is the authentication portion of TLS utilizing
> certificates. That part is complicated, but then again, we already
> decided the kernel needs to handle certificates for various places
> and you have to assume that it is fairly secure.
>
> Now, you need to secure the handshake protocol like any other protocol
> and the only difference is that it will lead to key material and
> does authentication with certificates. All of it, the kernel already
> does in one form or another.
>
> The TLS 1.3 spec is also really nicely written and explicit in
> error behavior in case of attempts to attack the protocol. While
> implementing my TLS 1.3 only prototype I have been positively
> surprised on how clean it is. I personally think they went over
> board with the key verification, but so be it.
>
> Once I have cleaned up my TLS 1.3 prototype, I am happy to take
> a stab at a kernel version.
>
I'm glad to hear that you're planning to add this in kernel space, and I
agree that there won't be a lot of things to do in kernel due to the kernel
crypto APIs. There is also a TLS 1.3 Handshake prototype I worked on and
based on the torvalds/linux kernel code. In case of any duplicate work
when you're doing it, I just share my code here:

  https://github.com/lxin/tls_hs/blob/master/crypto/tls_hs.c

and the TLS_HS APIs docs for QUIC and NFS use are here:

  https://github.com/lxin/tls_hs

Hopefully, it will help.

Besides, There were some security concerns from others for which I didn't
continue:

  https://github.com/lxin/tls_hs#the-security-issues

It will be great if we can have your opinions about it.

Thanks.
Hannes Reinecke Feb. 2, 2023, 5:32 p.m. UTC | #20
On 2/2/23 18:13, Xin Long wrote:
> On Tue, Jan 31, 2023 at 9:24 AM Marcel Holtmann <marcel@holtmann.org> wrote:
>>
>> I know, utilizing existing TLS libraries is a pain if you don’t do
>> exactly what they had in mind. I started looking at QUIC a while
>> back and quickly realized, I have to start looking at TLS 1.3 first.
>>
>> My past experience with GnuTLS and OpenSSL have been bad and that is
>> why iwd (our WiFi daemon) has its own TLS implementation utilizing
>> AF_ALG and keyctl.
> Hi Marcel,
> 
> I'm no expert on TLS, but I'm a supporter of in-kernel TLS 1.3 Handshake
> implementation :). When working on implementing in-kernel QUIC protocol,
> the code looks a lot simpler with the pure in-kernel TLS 1.3 Handshake APIs
> than the upcall method, and I believe the NFS over TLS 1.3 in kernel will
> feel the same.
> 
>>
>> While that might have been true in the past and with TLS 1.2 and earlier,
>> I am not sure that is all true today.
>>
>> Lets assume we start with TLS 1.3 and don’t have backwards compatibility
>> with TLS 1.2 and earlier. And for now we don’t worry about Middleboxes
>> compatibility mode since you don’t have to for all the modern protocols
>> that utilize just the TLS 1.3 handshake like QUIC.
>>
>> Now the key derivation is just choosing 1 out of 5 ciphers and using
>> its associated hash algorithm to derive the keys. This is all present
>> functionality in the kernel and so well tested that it doesn’t worry
>> me at all. We also have a separate RFC with just sample data so you
>> can check your derivation functionality. Especially if you check it
>> against AEAD encrypted sample data, any mistake is fatal.
>>
>> The shared key portion is just ECDHE or DHE and you either end up with
>> x25519 or secp256r1 and both are in the kernel. Bluetooth has been
>> using secp256r1 inside the kernel for many years now. We all know how
>> to handle and verify public keys from secp256r1 and neat part is that
>> it would be also offloaded to hardware if needed. So the private key
>> doesn’t need to stay even in kernel memory.
>>
>> So dealing with generating your key material for your cipher is really
>> simple and similar things have been done for Bluetooth for a long
>> time now. And it looks like NVMe is also utilizing KPP as of today.
>>
>> The tricky part is the authentication portion of TLS utilizing
>> certificates. That part is complicated, but then again, we already
>> decided the kernel needs to handle certificates for various places
>> and you have to assume that it is fairly secure.
>>
>> Now, you need to secure the handshake protocol like any other protocol
>> and the only difference is that it will lead to key material and
>> does authentication with certificates. All of it, the kernel already
>> does in one form or another.
>>
>> The TLS 1.3 spec is also really nicely written and explicit in
>> error behavior in case of attempts to attack the protocol. While
>> implementing my TLS 1.3 only prototype I have been positively
>> surprised on how clean it is. I personally think they went over
>> board with the key verification, but so be it.
>>
>> Once I have cleaned up my TLS 1.3 prototype, I am happy to take
>> a stab at a kernel version.
>>
> I'm glad to hear that you're planning to add this in kernel space, and I
> agree that there won't be a lot of things to do in kernel due to the kernel
> crypto APIs. There is also a TLS 1.3 Handshake prototype I worked on and
> based on the torvalds/linux kernel code. In case of any duplicate work
> when you're doing it, I just share my code here:
> 
>    https://github.com/lxin/tls_hs/blob/master/crypto/tls_hs.c
> 
> and the TLS_HS APIs docs for QUIC and NFS use are here:
> 
>    https://github.com/lxin/tls_hs
> 
> Hopefully, it will help.
> 
> Besides, There were some security concerns from others for which I didn't
> continue:
> 
>    https://github.com/lxin/tls_hs#the-security-issues
> 
> It will be great if we can have your opinions about it.
> 
> Thanks.
Wow.
That certainly looks good; I'll give it a go and try to integrate it 
with my NVMe-TLS stuff.
Thanks a lot for this!

As for your security concerns:
- Certification management _is_ complete, but we have the kernel keyring
   to help us out. The cert can be looked up in the keyring, and one can
   even call request_key() to get userspace to supply us with one.
   With that we can just use the kernel keyring to lookup certificates
   and don't worry about key management itself.
- TLS 1.3 is nailed down, and won't inherit more encryption algorithms.
   So nothing to worry about
- Corner cases ... yes, the code needs to be validated.
- RSA rewrite ... can't we disallow RSA for in-kernel usage?
- mpi code not constant-time safe; might be. But that goes for all
   in-kernel users, and I'm not sure if _we_ have to worry about that.
- DH and ECDH potentially broken: not anymore; got fixed up to
   be FIPS compliant
- Take advantage of decades of work in userspace: that's _precisely_
   why we're sticking with TLS 1.3. Most of the backwards-compat stuff
   got removed, so we _don't_ have to worry about that.

So it's at least worth a shot, to see where we end up.

Cheers,

Hannes
diff mbox series

Patch

diff --git a/include/net/handshake.h b/include/net/handshake.h
new file mode 100644
index 000000000000..b3fa1d006dcc
--- /dev/null
+++ b/include/net/handshake.h
@@ -0,0 +1,31 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PF_HANDSHAKE protocol family socket handler.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are visible only within the
+ * kernel are declared here.
+ */
+
+#ifndef _NET_HANDSHAKE_H
+#define _NET_HANDSHAKE_H
+
+struct handshake_info {
+	void			(*hi_done)(struct handshake_info *hsi);
+	int			(*hi_fd_parms_reply)(struct sk_buff *msg,
+						     struct handshake_info *hsi);
+	void			*hi_data;
+	struct socket_wq	*hi_saved_wq;
+	struct socket		*hi_saved_socket;
+	kuid_t			hi_saved_uid;
+};
+
+extern int handshake_enqueue_sock(struct socket *sock,
+				  struct handshake_info *hsi);
+
+#endif /* _NET_HANDSHAKE_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index e0517ecc6531..5ed2d809a149 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -349,6 +349,7 @@  struct sk_filter;
   *	@sk_txtime_unused: unused txtime flags
   *	@ns_tracker: tracker for netns reference
   *	@sk_bind2_node: bind node in the bhash2 table
+  *	@sk_handshake_data: private data for xprt layer security handshake
   */
 struct sock {
 	/*
@@ -515,6 +516,7 @@  struct sock {
 
 	struct socket		*sk_socket;
 	void			*sk_user_data;
+	void			*sk_handshake_data;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
 #endif
diff --git a/include/trace/events/handshake.h b/include/trace/events/handshake.h
new file mode 100644
index 000000000000..ae3fd3a1ebe9
--- /dev/null
+++ b/include/trace/events/handshake.h
@@ -0,0 +1,328 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022, 2023 Oracle.  All rights reserved.
+ *
+ * Trace point definitions for the "handshake" trace subsystem.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM handshake
+
+#if !defined(_TRACE_HANDSHAKE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HANDSHAKE_H
+
+#include <asm/unaligned.h>
+#include <linux/types.h>
+#include <net/tcp_states.h>
+
+#include <linux/tracepoint.h>
+
+#define show_address_family(family)				\
+	__print_symbolic(family,				\
+		{ AF_INET,		"AF_INET" },		\
+		{ AF_INET6,		"AF_INET6" },		\
+		{ AF_HANDSHAKE,		"AF_HANDSHAKE" })
+
+TRACE_DEFINE_ENUM(TCP_ESTABLISHED);
+TRACE_DEFINE_ENUM(TCP_SYN_SENT);
+TRACE_DEFINE_ENUM(TCP_SYN_RECV);
+TRACE_DEFINE_ENUM(TCP_FIN_WAIT1);
+TRACE_DEFINE_ENUM(TCP_FIN_WAIT2);
+TRACE_DEFINE_ENUM(TCP_TIME_WAIT);
+TRACE_DEFINE_ENUM(TCP_CLOSE);
+TRACE_DEFINE_ENUM(TCP_CLOSE_WAIT);
+TRACE_DEFINE_ENUM(TCP_LAST_ACK);
+TRACE_DEFINE_ENUM(TCP_LISTEN);
+TRACE_DEFINE_ENUM(TCP_CLOSING);
+TRACE_DEFINE_ENUM(TCP_NEW_SYN_RECV);
+
+#define show_tcp_state(state)					\
+	__print_symbolic(state,					\
+		{ TCP_ESTABLISHED,	"ESTABLISHED" },	\
+		{ TCP_SYN_SENT,		"SYN_SENT" },		\
+		{ TCP_SYN_RECV,		"SYN_RECV" },		\
+		{ TCP_FIN_WAIT1,	"FIN_WAIT1" },		\
+		{ TCP_FIN_WAIT2,	"FIN_WAIT2" },		\
+		{ TCP_TIME_WAIT,	"TIME_WAIT" },		\
+		{ TCP_CLOSE,		"CLOSE" },		\
+		{ TCP_CLOSE_WAIT,	"CLOSE_WAIT" },		\
+		{ TCP_LAST_ACK,		"LAST_ACK" },		\
+		{ TCP_LISTEN,		"LISTEN" },		\
+		{ TCP_CLOSING,		"CLOSING" },		\
+		{ TCP_NEW_SYN_RECV,	"NEW_SYN_RECV" })
+
+#define show_poll_event_mask(mask)				\
+	__print_flags(mask, "|",				\
+		{ EPOLLIN,		"IN" },			\
+		{ EPOLLPRI,		"PRI" },		\
+		{ EPOLLOUT,		"OUT" },		\
+		{ EPOLLERR,		"ERR" },		\
+		{ EPOLLHUP,		"HUP" },		\
+		{ EPOLLNVAL,		"NVAL" },		\
+		{ EPOLLRDNORM,		"RDNORM" },		\
+		{ EPOLLRDBAND,		"RDBAND" },		\
+		{ EPOLLWRNORM,		"WRNORM" },		\
+		{ EPOLLWRBAND,		"WRBAND" },		\
+		{ EPOLLMSG,		"MSG" },		\
+		{ EPOLLRDHUP,		"RDHUP" })
+
+DECLARE_EVENT_CLASS(handshake_listener_class,
+	TP_PROTO(const struct socket *sock),
+	TP_ARGS(sock),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, family)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->family = handshake_sk((struct sock *)sk)->hs_bind_family;
+	),
+	TP_printk("listener=%p sk=%p(%d) family=%s",
+		__entry->sock, __entry->sk,
+		__entry->refcount, show_address_family(__entry->family)
+	)
+);
+
+#define DEFINE_HANDSHAKE_LISTENER_EVENT(name)			\
+	DEFINE_EVENT(handshake_listener_class, name,		\
+		TP_PROTO(const struct socket *sock),		\
+		TP_ARGS(sock))
+
+DEFINE_HANDSHAKE_LISTENER_EVENT(handshake_bind);
+DEFINE_HANDSHAKE_LISTENER_EVENT(handshake_accept);
+DEFINE_HANDSHAKE_LISTENER_EVENT(handshake_listen);
+DEFINE_HANDSHAKE_LISTENER_EVENT(handshake_pf_create);
+
+TRACE_EVENT(handshake_newsock,
+	TP_PROTO(
+		const struct socket *newsock,
+		const struct sock *newsk
+	),
+	TP_ARGS(newsock, newsk),
+	TP_STRUCT__entry(
+		__field(const struct socket *, newsock)
+		__field(const struct sock *, newsk)
+		__field(int, refcount)
+		__field(unsigned long, family)
+	),
+	TP_fast_assign(
+		__entry->newsock = newsock;
+		__entry->newsk = newsk;
+		__entry->refcount = refcount_read(&newsk->sk_refcnt);
+		__entry->family = newsk->sk_family;
+	),
+	TP_printk("newsock=%p newsk=%p(%d) family=%s",
+		__entry->newsock, __entry->newsk,
+		__entry->refcount, show_address_family(__entry->family)
+	)
+);
+
+DECLARE_EVENT_CLASS(handshake_proto_op_class,
+	TP_PROTO(const struct socket *sock),
+	TP_ARGS(sock),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, family)
+		__field(unsigned long, state)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->family = sk->sk_family;
+		__entry->state = sk->sk_state;
+	),
+	TP_printk("sock=%p sk=%p(%d) family=%s state=%s",
+		__entry->sock, __entry->sk, __entry->refcount,
+		show_address_family(__entry->family),
+		show_tcp_state(__entry->state)
+	)
+);
+
+#define DEFINE_HANDSHAKE_PROTO_OP_EVENT(name)			\
+	DEFINE_EVENT(handshake_proto_op_class, name,		\
+		TP_PROTO(const struct socket *sock),		\
+		TP_ARGS(sock))
+
+DEFINE_HANDSHAKE_PROTO_OP_EVENT(handshake_release);
+DEFINE_HANDSHAKE_PROTO_OP_EVENT(handshake_getname);
+DEFINE_HANDSHAKE_PROTO_OP_EVENT(handshake_shutdown);
+DEFINE_HANDSHAKE_PROTO_OP_EVENT(handshake_setsockopt);
+DEFINE_HANDSHAKE_PROTO_OP_EVENT(handshake_getsockopt);
+
+TRACE_EVENT(handshake_sendmsg_start,
+	TP_PROTO(
+		const struct socket *sock,
+		size_t size
+	),
+	TP_ARGS(sock, size),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, family)
+		__field(unsigned long, state)
+		__field(const void *, op)
+		__field(size_t, size)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->family = sk->sk_family;
+		__entry->state = sk->sk_state;
+		__entry->op = sk->sk_prot->sendmsg;
+		__entry->size = size;
+	),
+	TP_printk("sock=%p sk=%p(%d) family=%s state=%s size=%zu op=%pS",
+		__entry->sock, __entry->sk, __entry->refcount,
+		show_address_family(__entry->family),
+		show_tcp_state(__entry->state),
+		__entry->size, __entry->op
+	)
+);
+
+TRACE_EVENT(handshake_recvmsg_start,
+	TP_PROTO(
+		const struct socket *sock,
+		size_t size
+	),
+	TP_ARGS(sock, size),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, family)
+		__field(unsigned long, state)
+		__field(const void *, op)
+		__field(size_t, size)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->family = sk->sk_family;
+		__entry->state = sk->sk_state;
+		__entry->op = sk->sk_prot->recvmsg;
+		__entry->size = size;
+	),
+	TP_printk("sock=%p sk=%p(%d) family=%s state=%s size=%zu op=%pS",
+		__entry->sock, __entry->sk, __entry->refcount,
+		show_address_family(__entry->family),
+		show_tcp_state(__entry->state),
+		__entry->size, __entry->op
+	)
+);
+
+DECLARE_EVENT_CLASS(handshake_opmsg_result_class,
+	TP_PROTO(
+		const struct socket *sock,
+		int result
+	),
+	TP_ARGS(sock, result),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, family)
+		__field(unsigned long, state)
+		__field(int, result)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->family = sk->sk_family;
+		__entry->state = sk->sk_state;
+		__entry->result = result;
+	),
+	TP_printk("sock=%p sk=%p(%d) family=%s state=%s result=%d",
+		__entry->sock, __entry->sk, __entry->refcount,
+		show_address_family(__entry->family),
+		show_tcp_state(__entry->state),
+		__entry->result
+	)
+);
+
+#define DEFINE_HANDSHAKE_OPMSG_RESULT_EVENT(name)		\
+	DEFINE_EVENT(handshake_opmsg_result_class, name,		\
+		TP_PROTO(					\
+			const struct socket *sock,		\
+			int result				\
+		),						\
+		TP_ARGS(sock, result))
+
+DEFINE_HANDSHAKE_OPMSG_RESULT_EVENT(handshake_sendmsg_result);
+DEFINE_HANDSHAKE_OPMSG_RESULT_EVENT(handshake_recvmsg_result);
+
+TRACE_EVENT(handshake_poll,
+	TP_PROTO(
+		const struct socket *sock,
+		__poll_t mask
+	),
+	TP_ARGS(sock, mask),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, mask)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->mask = (__force unsigned long)mask;
+	),
+	TP_printk("sock=%p sk=%p(%d) mask=%s",
+		__entry->sock, __entry->sk, __entry->refcount,
+		show_poll_event_mask(__entry->mask)
+	)
+);
+
+TRACE_EVENT(handshake_poll_listener,
+	TP_PROTO(
+		const struct socket *sock,
+		__poll_t mask
+	),
+	TP_ARGS(sock, mask),
+	TP_STRUCT__entry(
+		__field(const struct socket *, sock)
+		__field(const struct sock *, sk)
+		__field(int, refcount)
+		__field(unsigned long, mask)
+	),
+	TP_fast_assign(
+		const struct sock *sk = sock->sk;
+
+		__entry->sock = sock;
+		__entry->sk = sk;
+		__entry->refcount = refcount_read(&sk->sk_refcnt);
+		__entry->mask = (__force unsigned long)mask;
+	),
+	TP_printk("sock=%p sk=%p(%d) mask=%s",
+		__entry->sock, __entry->sk, __entry->refcount,
+		show_poll_event_mask(__entry->mask)
+	)
+);
+
+#endif /* _TRACE_HANDSHAKE_H */
+
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/handshake.h b/include/uapi/linux/handshake.h
new file mode 100644
index 000000000000..72facc352c71
--- /dev/null
+++ b/include/uapi/linux/handshake.h
@@ -0,0 +1,49 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Generic netlink service for handshakes
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are visible to user space are
+ * declared here. This file constitutes an API contract between the
+ * Linux kernel and user space.
+ */
+
+#ifndef _UAPI_LINUX_HANDSHAKE_H
+#define _UAPI_LINUX_HANDSHAKE_H
+
+enum handshake_protocol {
+	HANDSHAKE_PROTO_UNSPEC = 0,
+};
+
+#define HANDSHAKE_GENL_NAME	"HANDSHAKE_GENL"
+#define HANDSHAKE_GENL_VERSION	0x01
+
+enum handshake_genl_attrs {
+	HANDSHAKE_GENL_ATTR_UNSPEC = 0,
+	HANDSHAKE_GENL_ATTR_SOCKFD,
+	HANDSHAKE_GENL_ATTR_STATUS,
+	HANDSHAKE_GENL_ATTR_PROTOCOL,
+	__HANDSHAKE_GENL_ATTR_MAX
+};
+#define HANDSHAKE_GENL_ATTR_MAX	(__HANDSHAKE_GENL_ATTR_MAX - 1)
+
+enum handshake_genl_cmds {
+	HANDSHAKE_GENL_CMD_UNSPEC = 0,
+	HANDSHAKE_GENL_CMD_GET_FD_PARAMETERS,
+	__HANDSHAKE_GENL_CMD_MAX
+};
+#define HANDSHAKE_GENL_CMD_MAX	(__HANDSHAKE_GENL_CMD_MAX - 1)
+
+enum handshake_genl_status {
+	HANDSHAKE_GENL_STATUS_OK = 0,
+	HANDSHAKE_GENL_STATUS_INVAL,
+	HANDSHAKE_GENL_STATUS_SOCKNOTFOUND,
+	HANDSHAKE_GENL_STATUS_SOCKNOTVALID,
+};
+
+#endif /* _UAPI_LINUX_HANDSHAKE_H */
diff --git a/net/Makefile b/net/Makefile
index 6a62e5b27378..c1bb53f00486 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -78,3 +78,4 @@  obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
 obj-$(CONFIG_MPTCP)		+= mptcp/
 obj-$(CONFIG_MCTP)		+= mctp/
+obj-y				+= handshake/
diff --git a/net/handshake/Makefile b/net/handshake/Makefile
new file mode 100644
index 000000000000..847e0ab2b99e
--- /dev/null
+++ b/net/handshake/Makefile
@@ -0,0 +1,7 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for the HANDSHAKE subsystem.
+#
+
+obj-y += handshake.o
+handshake-y := af_handshake.o netlink.o trace.o
diff --git a/net/handshake/af_handshake.c b/net/handshake/af_handshake.c
new file mode 100644
index 000000000000..3ba3daeb82d3
--- /dev/null
+++ b/net/handshake/af_handshake.c
@@ -0,0 +1,838 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PF_HANDSHAKE protocol family socket handler.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2021-2023 Oracle and/or its affiliates.
+ *
+ * When the kernel needs to invoke a user space service on an open
+ * socket descriptor, it can use this mechanism to make the socket
+ * endpoint available to a user space program.
+ *
+ * The user space program listens on an AF_HANDSHAKE socket. When
+ * the listener is made ready, an accept(2) call materializes
+ * the desired socket endpoint in the listening process's file
+ * descriptor table.
+ *
+ * The listener closes that endpoint when it is finished with it
+ * (or when it exits). The kernel knows that at that point it is
+ * safe to use the socket again.
+ */
+
+/*
+ * Socket reference counting
+ *  A: listener socket initial reference
+ *  B: listener socket on the global listener list
+ *  C: listener socket while a ready AF_INET(6) socket is enqueued
+ *  D: listener socket while its accept queue is drained
+ *
+ *  I: ready AF_INET(6) socket waiting on a listener's accept queue
+ *  J: ready AF_INET(6) socket with a consumer waiting for a completion callback
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/in.h>
+#include <linux/kernel.h>
+#include <linux/poll.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/inet.h>
+
+#include <net/ip.h>
+#include <net/ipv6.h>
+#include <net/tcp.h>
+#include <net/protocol.h>
+#include <net/sock.h>
+#include <net/genetlink.h>
+#include <net/inet_common.h>
+#include <net/net_namespace.h>
+#include <net/handshake.h>
+
+#include "handshake.h"
+
+#include <trace/events/handshake.h>
+
+static DEFINE_RWLOCK(handshake_listener_lock);
+static HLIST_HEAD(handshake_listeners);
+
+static void handshake_register_listener(struct sock *sk)
+{
+	write_lock_bh(&handshake_listener_lock);
+	sk_add_node(sk, &handshake_listeners);	/* Ref: B */
+	write_unlock_bh(&handshake_listener_lock);
+}
+
+static void handshake_unregister_listener(struct sock *sk)
+{
+	write_lock_bh(&handshake_listener_lock);
+	sk_del_node_init(sk);			/* Ref: B */
+	write_unlock_bh(&handshake_listener_lock);
+}
+
+/**
+ * handshake_find_listener - find listener that matches an incoming connection
+ * @net: net namespace to match
+ * @family: address family to match
+ *
+ * Return values:
+ *   On success, address of a listening AF_HANDSHAKE socket
+ *   %NULL: No matching listener found
+ */
+static struct sock *handshake_find_listener(struct net *net, unsigned short family)
+{
+	struct sock *listener;
+
+	read_lock(&handshake_listener_lock);
+
+	sk_for_each(listener, &handshake_listeners) {
+		if (sock_net(listener) != net)
+			continue;
+		if (handshake_sk(listener)->hs_bind_family != AF_UNSPEC &&
+		    handshake_sk(listener)->hs_bind_family != family)
+			continue;
+
+		sock_hold(listener);	/* Ref: C */
+		goto out;
+	}
+	listener = NULL;
+
+out:
+	read_unlock(&handshake_listener_lock);
+	return listener;
+}
+
+/**
+ * handshake_accept_enqueue - add a socket to a listener's accept_q
+ * @listener: listening socket
+ * @sk: socket to enqueue on @listener
+ *
+ * Return values:
+ *   On success, returns 0
+ *   %-ENOMEM: Memory for skbs has been exhausted
+ */
+static int handshake_accept_enqueue(struct sock *listener, struct sock *sk)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	sock_hold(sk);	/* Ref: I */
+	skb->sk = sk;
+	skb_queue_tail(&listener->sk_receive_queue, skb);
+	sk_acceptq_added(listener);
+	listener->sk_data_ready(listener);
+	return 0;
+}
+
+/**
+ * handshake_accept_dequeue - remove a socket from a listener's accept_q
+ * @listener: listener socket to check
+ *
+ * Caller must guarantee that @listener won't disappear.
+ *
+ * Return values:
+ *   On success, return a TCP socket waiting for TLS service
+ *   %NULL: No sockets on the accept queue
+ */
+static struct sock *handshake_accept_dequeue(struct sock *listener)
+{
+	struct sk_buff *skb;
+	struct sock *sk;
+
+	skb = skb_dequeue(&listener->sk_receive_queue);
+	if (!skb)
+		return NULL;
+	sk_acceptq_removed(listener);
+	sock_put(listener);	/* Ref: C */
+
+	sk = skb->sk;
+	skb->sk = NULL;
+	kfree_skb(skb);
+	sock_put(sk);	/* Ref: I */
+	return sk;
+}
+
+static void handshake_sock_save(struct sock *sk, struct handshake_info *hsi)
+{
+	sock_hold(sk);	/* Ref: J */
+
+	write_lock_bh(&sk->sk_callback_lock);
+	hsi->hi_saved_wq = sk->sk_wq_raw;
+	hsi->hi_saved_socket = sk->sk_socket;
+	hsi->hi_saved_uid = sk->sk_uid;
+	sk->sk_handshake_data = hsi;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void handshake_sock_clear(struct sock *sk)
+{
+	write_lock_bh(&sk->sk_callback_lock);
+	sk->sk_handshake_data = NULL;
+	write_unlock_bh(&sk->sk_callback_lock);
+	sock_put(sk);	/* Ref: J (err) */
+}
+
+static void handshake_sock_restore_locked(struct sock *sk)
+{
+	struct handshake_info *hsi = sk->sk_handshake_data;
+
+	sk->sk_wq_raw = hsi->hi_saved_wq;
+	sk->sk_socket = hsi->hi_saved_socket;
+	sk->sk_uid = hsi->hi_saved_uid;
+	sk->sk_handshake_data = NULL;
+}
+
+static const struct proto_ops *handshake_saved_ops(struct sock *sk)
+{
+	const struct proto_ops *ops = NULL;
+	struct handshake_info *hsi;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	hsi = sk->sk_handshake_data;
+	if (hsi)
+		ops = hsi->hi_saved_socket->ops;
+	read_unlock_bh(&sk->sk_callback_lock);
+	return ops;
+}
+
+/**
+ * handshake_done - call the registered "done" callback for @sk.
+ * @sk: socket that was requesting a handshake
+ *
+ * Return values:
+ *   %true:  Handshake callback was called
+ *   %false: No handshake callback was set, no-op
+ */
+static bool handshake_done(struct sock *sk)
+{
+	struct handshake_info *hsi;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	hsi = sk->sk_handshake_data;
+	if (hsi) {
+		handshake_sock_restore_locked(sk);
+		hsi->hi_done(hsi);
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	if (hsi) {
+		sock_put(sk);	/* Ref: J */
+		return true;
+	}
+	return false;
+}
+
+/**
+ * handshake_accept_drain - clean up children queued for accept
+ * @listener: listener socket to drain
+ *
+ */
+static void handshake_accept_drain(struct sock *listener)
+{
+	struct sock *sk;
+
+	while ((sk = handshake_accept_dequeue(listener)))
+		handshake_done(sk);
+}
+
+/**
+ * handshake_release - free an AF_HANDSHAKE socket
+ * @sock: socket to release
+ *
+ * Return values:
+ *   %0: success
+ */
+static int handshake_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct handshake_sock *ssk = handshake_sk(sk);
+	int ret = 0;
+
+	if (!sk)
+		return ret;
+
+	trace_handshake_release(sock);
+
+	switch (sk->sk_family) {
+	case AF_HANDSHAKE:
+		sock_hold(sk);	/* Ref: D */
+		sock_orphan(sk);
+		lock_sock(sk);
+
+		handshake_unregister_listener(sk);
+		handshake_accept_drain(sk);
+
+		sk->sk_state = TCP_CLOSE;
+		sk->sk_shutdown |= SEND_SHUTDOWN;
+		sk->sk_state_change(sk);
+
+		ssk->hs_bind_family = AF_UNSPEC;
+		sock->sk = NULL;
+		release_sock(sk);
+		sock_put(sk);	/* Ref: D */
+
+		sock_put(sk);	/* Ref: A */
+		break;
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		if (!handshake_done(sk)) {
+			const struct proto_ops *ops;
+
+			ops = handshake_saved_ops(sk);
+			if (ops)
+				ret = ops->release(sock);
+		}
+		break;
+	}
+
+	return ret;
+}
+
+/**
+ * handshake_bind - bind a name to an AF_HANDSHAKE socket
+ * @sock: socket to be bound
+ * @uaddr: address to bind to
+ * @addrlen: length in bytes of @uaddr
+ *
+ * Binding an AF_HANDSHAKE socket defines the family of addresses that
+ * are able to be accept(2)'d. So, AF_INET for ipv4, AF_INET6 for
+ * ipv6.
+ *
+ * Return values:
+ *   %0: binding was successful.
+ *   %-EPERM: Caller not privileged
+ *   %-EINVAL: Family of @sock or @uaddr not supported
+ */
+static int handshake_bind(struct socket *sock, struct sockaddr *uaddr, int addrlen)
+{
+	struct sock *listener, *sk = sock->sk;
+	struct handshake_sock *ssk = handshake_sk(sk);
+
+	if (!capable(CAP_NET_BIND_SERVICE))
+		return -EPERM;
+
+	switch (uaddr->sa_family) {
+	case AF_INET:
+		if (addrlen != sizeof(struct sockaddr_in))
+			return -EINVAL;
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (addrlen != sizeof(struct sockaddr_in6))
+			return -EINVAL;
+		break;
+#endif
+	default:
+		return -EAFNOSUPPORT;
+	}
+
+	listener = handshake_find_listener(sock_net(sk), uaddr->sa_family);
+	if (listener) {
+		sock_put(listener);	/* Ref: C */
+		return -EADDRINUSE;
+	}
+
+	ssk->hs_bind_family = uaddr->sa_family;
+	trace_handshake_bind(sock);
+	return 0;
+}
+
+/**
+ * handshake_accept - return a connection waiting for a TLS handshake
+ * @listener: listener socket which connection requests arrive on
+ * @newsock: socket to move incoming connection to
+ * @flags: SOCK_NONBLOCK and/or SOCK_CLOEXEC
+ * @kern: "boolean": 1 for kernel-internal sockets
+ *
+ * Return values:
+ *   %0: @newsock has been initialized.
+ *   %-EPERM: caller is not privileged
+ */
+static int handshake_accept(struct socket *listener, struct socket *newsock, int flags,
+			  bool kern)
+{
+	struct sock *sk = listener->sk, *newsk;
+	DECLARE_WAITQUEUE(wait, current);
+	long timeo;
+	int rc;
+
+	trace_handshake_accept(listener);
+
+	rc = -EPERM;
+	if (!capable(CAP_NET_BIND_SERVICE))
+		goto out;
+
+	lock_sock(sk);
+
+	if (sk->sk_state != TCP_LISTEN) {
+		rc = -EBADF;
+		goto out_release;
+	}
+
+	timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);
+
+	rc = 0;
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
+	while (!(newsk = handshake_accept_dequeue(sk))) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!timeo) {
+			rc = -EAGAIN;
+			break;
+		}
+		release_sock(sk);
+
+		timeo = schedule_timeout(timeo);
+
+		lock_sock(sk);
+		if (sk->sk_state != TCP_LISTEN) {
+			rc = -EBADF;
+			break;
+		}
+		if (signal_pending(current)) {
+			rc = sock_intr_errno(timeo);
+			break;
+		}
+	}
+	set_current_state(TASK_RUNNING);
+	remove_wait_queue(sk_sleep(sk), &wait);
+	if (rc) {
+		handshake_done(sk);
+		goto out_release;
+	}
+
+	sock_graft(newsk, newsock);
+	trace_handshake_newsock(newsock, newsk);
+
+out_release:
+	release_sock(sk);
+out:
+	return rc;
+}
+
+/**
+ * handshake_getname - retrieve src/dst address information from an AF_HANDSHAKE socket
+ * @sock: socket to query
+ * @uaddr: buffer to fill in
+ * @peer: value indicates which address to retrieve
+ *
+ * Return values:
+ *   On success, a positive length of the address in @uaddr
+ *   On error, a negative errno
+ */
+static int handshake_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
+{
+	struct sock *sk = sock->sk;
+	const struct proto_ops *ops;
+
+	trace_handshake_getname(sock);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	ops = handshake_saved_ops(sk);
+	if (!ops)
+		return -EBADFD;
+	return ops->getname(sock, uaddr, peer);
+}
+
+/**
+ * handshake_poll - check for data ready on an AF_HANDSHAKE socket
+ * @file: file to check for work
+ * @sock: socket associated with @file
+ * @wait: poll table
+ *
+ * Return values:
+ *    A mask of flags indicating what type of I/O is ready
+ */
+static __poll_t handshake_poll(struct file *file, struct socket *sock,
+			     poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	__poll_t mask;
+
+	sock_poll_wait(file, sock, wait);
+
+	mask = 0;
+
+	if (sk->sk_state == TCP_LISTEN) {
+		if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
+			mask |= EPOLLIN | EPOLLRDNORM;
+		if (sk_is_readable(sk))
+			mask |= EPOLLIN | EPOLLRDNORM;
+		trace_handshake_poll_listener(sock, mask);
+		return mask;
+	}
+
+	if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE)
+		mask |= EPOLLHUP;
+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
+
+	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	if (sk_is_readable(sk))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	/* This barrier is coupled with smp_wmb() in tcp_reset() */
+	smp_rmb();
+	if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
+		mask |= EPOLLERR;
+
+	trace_handshake_poll(sock, mask);
+	return mask;
+}
+
+/**
+ * handshake_listen - move an AF_HANDSHAKE socket into a listening state
+ * @sock: socket to transition to listening state
+ * @backlog: size of backlog queue
+ *
+ * Return values:
+ *   %0: @sock is now in a listening state
+ *   %-EPERM: caller is not privileged
+ *   %-EOPNOTSUPP: @sock is not of a type that supports the listen() operation
+ */
+static int handshake_listen(struct socket *sock, int backlog)
+{
+	struct sock *sk = sock->sk;
+	unsigned char old_state;
+	int rc;
+
+	if (!capable(CAP_NET_BIND_SERVICE))
+		return -EPERM;
+
+	lock_sock(sk);
+
+	rc = -EOPNOTSUPP;
+	if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
+		goto out;
+	old_state = sk->sk_state;
+	if (!((1 << old_state) & (TCPF_CLOSE | TCPF_LISTEN)))
+		goto out;
+
+	sk->sk_max_ack_backlog = backlog;
+	sk->sk_state = TCP_LISTEN;
+	handshake_register_listener(sk);
+
+	trace_handshake_listen(sock);
+	rc = 0;
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+/**
+ * handshake_shutdown - Shutdown an AF_HANDSHAKE socket
+ * @sock: socket to shut down
+ * @how: mask
+ *
+ * Return values:
+ *   %0: Success
+ *   %-EINVAL: @sock is not of a type that supports a shutdown
+ */
+static int handshake_shutdown(struct socket *sock, int how)
+{
+	struct sock *sk = sock->sk;
+
+	trace_handshake_shutdown(sock);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return inet_shutdown(sock, how);
+}
+
+/**
+ * handshake_setsockopt - Set a socket option on an AF_HANDSHAKE socket
+ * @sock: socket to act upon
+ * @level: which network layer to act upon
+ * @optname: which option to set
+ * @optval: new value to set
+ * @optlen: the size of the new value, in bytes
+ *
+ * Return values:
+ *   %0: Success
+ *   %-ENOPROTOOPT: The option is unknown at the level indicated.
+ */
+static int handshake_setsockopt(struct socket *sock, int level, int optname,
+			      sockptr_t optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+
+	trace_handshake_setsockopt(sock);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		break;
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	return sock_common_setsockopt(sock, level, optname, optval, optlen);
+}
+
+/**
+ * handshake_getsockopt - Retrieve a socket option from an AF_HANDSHAKE socket
+ * @sock: socket to act upon
+ * @level: which network layer to act upon
+ * @optname: which option to retrieve
+ * @optval: a buffer into which to receive the option's value
+ * @optlen: the size of the receive buffer, in bytes
+ *
+ * Return values:
+ *   %0: Success
+ *   %-ENOPROTOOPT: The option is unknown at the level indicated.
+ *   %-EINVAL: Invalid argument
+ *   %-EFAULT: Output memory not write-able
+ *   %-EBUSY: Option value not available
+ */
+static int handshake_getsockopt(struct socket *sock, int level, int optname,
+			      char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+
+	trace_handshake_getsockopt(sock);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		break;
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	return sock_common_getsockopt(sock, level, optname, optval, optlen);
+}
+
+/**
+ * handshake_sendmsg - Send a message on an AF_HANDSHAKE socket
+ * @sock: socket to send on
+ * @msg: message to send
+ * @size: size of message, in bytes
+ *
+ * Return values:
+ *   %0: Success
+ *   %-EOPNOTSUPP: Address family does not support this operation
+ */
+static int handshake_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
+{
+	struct sock *sk = sock->sk;
+	int ret;
+
+	trace_handshake_sendmsg_start(sock, size);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (unlikely(inet_send_prepare(sk))) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	ret = sk->sk_prot->sendmsg(sk, msg, size);
+
+out:
+	trace_handshake_sendmsg_result(sock, ret);
+	return ret;
+}
+
+/**
+ * handshake_recvmsg - Receive a message from an AF_HANDSHAKE socket
+ * @sock: socket to receive from
+ * @msg: buffer into which to receive
+ * @size: size of buffer, in bytes
+ * @flags: control settings
+ *
+ * Return values:
+ *   %0: Success
+ *   %-EOPNOTSUPP: Address family does not support this operation
+ */
+static int handshake_recvmsg(struct socket *sock, struct msghdr *msg,
+			   size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int ret;
+
+	trace_handshake_recvmsg_start(sock, size);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+#endif
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (likely(!(flags & MSG_ERRQUEUE)))
+		sock_rps_record_flow(sk);
+	ret = sock_common_recvmsg(sock, msg, size, flags);
+
+out:
+	trace_handshake_recvmsg_result(sock, ret);
+	return ret;
+}
+
+static const struct proto_ops handshake_proto_ops = {
+	.family		= PF_HANDSHAKE,
+	.owner		= THIS_MODULE,
+
+	.release	= handshake_release,
+	.bind		= handshake_bind,
+	.connect	= sock_no_connect,
+	.socketpair	= sock_no_socketpair,
+	.accept		= handshake_accept,
+	.getname	= handshake_getname,
+	.poll		= handshake_poll,
+	.ioctl		= sock_no_ioctl,
+	.gettstamp	= sock_gettstamp,
+	.listen		= handshake_listen,
+	.shutdown	= handshake_shutdown,
+	.setsockopt	= handshake_setsockopt,
+	.getsockopt	= handshake_getsockopt,
+	.sendmsg	= handshake_sendmsg,
+	.recvmsg	= handshake_recvmsg,
+	.mmap		= sock_no_mmap,
+	.sendpage	= sock_no_sendpage,
+};
+
+static struct proto handshake_prot = {
+	.name			= "HANDSHAKE",
+	.owner			= THIS_MODULE,
+	.obj_size		= sizeof(struct handshake_sock),
+};
+
+/**
+ * handshake_pf_create - create an AF_HANDSHAKE socket
+ * @net: network namespace to own the new socket
+ * @sock: socket to initialize
+ * @protocol: IP protocol number (ignored)
+ * @kern: "boolean": 1 for kernel-internal sockets
+ *
+ * Return values:
+ *   %0: @sock was initialized, and module ref count incremented.
+ *   Negative errno values indicate initialization failed.
+ */
+static int handshake_pf_create(struct net *net, struct socket *sock, int protocol,
+			     int kern)
+{
+	struct sock *sk;
+	int rc;
+
+	sock->state = SS_UNCONNECTED;
+	sock->ops = &handshake_proto_ops;
+
+	/* Ref: A */
+	sk = sk_alloc(net, PF_HANDSHAKE, GFP_KERNEL, &handshake_prot, kern);
+	if (!sk)
+		return -ENOMEM;
+
+	sock_init_data(sock, sk);
+	if (sk->sk_prot->init) {
+		rc = sk->sk_prot->init(sk);
+		if (rc)
+			goto err_sk_put;
+	}
+
+	handshake_sk(sk)->hs_bind_family = AF_UNSPEC;
+	trace_handshake_pf_create(sock);
+	return 0;
+
+err_sk_put:
+	sock_orphan(sk);
+	sk_free(sk);	/* Ref: A (err) */
+	return rc;
+}
+
+/**
+ * handshake_enqueue_sock - Queue a socket to be shared with user space
+ * @sock: a connected socket to share with user space
+ * @hsi: info packet tracking this request
+ *
+ * Return values:
+ *   %0: Successfully queued
+ *   %-ENOENT: No listener is available to handle this request
+ *   %-ENOMEM: Memory allocation failed
+ */
+int handshake_enqueue_sock(struct socket *sock, struct handshake_info *hsi)
+{
+	struct sock *listener, *sk = sock->sk;
+	int rc;
+
+	listener = handshake_find_listener(sock_net(sk), sk->sk_family);
+	if (!listener)
+		return -ENOENT;
+
+	handshake_sock_save(sk, hsi);
+	rc = handshake_accept_enqueue(listener, sk);
+	if (rc) {
+		handshake_sock_clear(sk);
+		sock_put(listener);	/* Ref: C (err) */
+	}
+	return rc;
+}
+EXPORT_SYMBOL(handshake_enqueue_sock);
+
+static const struct net_proto_family handshake_pf_ops = {
+	.family = PF_HANDSHAKE,
+	.create = handshake_pf_create,
+	.owner	= THIS_MODULE,
+};
+
+static int __init handshake_register(void)
+{
+	int rc;
+
+	rc = handshake_genetlink_init();
+	if (rc)
+		return rc;
+
+	sock_register(&handshake_pf_ops);
+	return 0;
+}
+
+static void __exit handshake_unregister(void)
+{
+	sock_unregister(PF_HANDSHAKE);
+	handshake_genetlink_exit();
+}
+
+
+module_init(handshake_register);
+module_exit(handshake_unregister);
diff --git a/net/handshake/handshake.h b/net/handshake/handshake.h
new file mode 100644
index 000000000000..62a6c85c5a17
--- /dev/null
+++ b/net/handshake/handshake.h
@@ -0,0 +1,33 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PF_HANDSHAKE protocol family socket handler.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are internal to handshake/
+ * are declared here.
+ */
+
+#ifndef _HANDSHAKE_H
+#define _HANDSHAKE_H
+
+struct handshake_sock {
+	/* struct sock must remain the first field */
+	struct sock	hs_sk;
+
+	int		hs_bind_family;
+};
+
+static inline struct handshake_sock *handshake_sk(struct sock *sk)
+{
+	return container_of(sk, struct handshake_sock, hs_sk);
+}
+
+extern int __init handshake_genetlink_init(void);
+extern void handshake_genetlink_exit(void);
+
+#endif /* _HANDSHAKE_H */
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
new file mode 100644
index 000000000000..1d209473f106
--- /dev/null
+++ b/net/handshake/netlink.c
@@ -0,0 +1,169 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HANDSHAKE generic netlink service
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/inet.h>
+
+#include <net/sock.h>
+#include <net/genetlink.h>
+#include <net/handshake.h>
+
+#include <uapi/linux/handshake.h>
+#include "handshake.h"
+
+static struct genl_family __ro_after_init handshake_genl_family;
+
+static int handshake_genl_op_unsupp(struct sk_buff *skb, struct genl_info *gi)
+{
+	pr_err("Unknown netlink command (%d) ignored\n", gi->genlhdr->cmd);
+	return -EINVAL;
+}
+
+static int handshake_genl_error_reply(struct genl_info *gi,
+				      enum handshake_genl_status status)
+{
+	struct genlmsghdr *hdr;
+	struct sk_buff *msg;
+	int ret;
+
+	ret = -ENOMEM;
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		goto out;
+	hdr = genlmsg_put_reply(msg, gi, &handshake_genl_family, 0,
+				gi->genlhdr->cmd);
+	if (!hdr)
+		goto out_free;
+
+	ret = nla_put_u32(msg, HANDSHAKE_GENL_ATTR_STATUS, status);
+	if (ret < 0)
+		goto out_cancel;
+
+	genlmsg_end(msg, hdr);
+	return genlmsg_reply(msg, gi);
+
+out_cancel:
+	genlmsg_cancel(msg, hdr);
+out_free:
+	nlmsg_free(msg);
+out:
+	return ret;
+}
+
+static int handshake_genl_reply(struct genl_info *gi, struct handshake_info *hsi)
+{
+	struct genlmsghdr *hdr;
+	struct sk_buff *msg;
+	int ret;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		goto out;
+	hdr = genlmsg_put_reply(msg, gi, &handshake_genl_family, 0,
+				gi->genlhdr->cmd);
+	if (!hdr)
+		goto out_free;
+
+	ret = hsi->hi_fd_parms_reply(msg, hsi);
+	if (ret < 0)
+		goto out_cancel;
+
+	genlmsg_end(msg, hdr);
+	return genlmsg_reply(msg, gi);
+
+out_cancel:
+	genlmsg_cancel(msg, hdr);
+out_free:
+	nlmsg_free(msg);
+out:
+	return ret;
+}
+
+static int handshake_genl_op_get_fd_parms(struct sk_buff *skb, struct genl_info *gi)
+{
+	struct handshake_info *hsi;
+	struct socket *sock;
+	struct sock *sk;
+	int ret;
+
+	if (!gi->attrs[HANDSHAKE_GENL_ATTR_SOCKFD])
+		return handshake_genl_error_reply(gi, HANDSHAKE_GENL_STATUS_INVAL);
+
+	ret = 0;
+	sock = sockfd_lookup(nla_get_u32(gi->attrs[HANDSHAKE_GENL_ATTR_SOCKFD]),
+			     &ret);
+	if (ret)
+		return handshake_genl_error_reply(gi, HANDSHAKE_GENL_STATUS_SOCKNOTFOUND);
+
+	sk = sock->sk;
+	write_lock_bh(&sk->sk_callback_lock);
+	hsi = sk->sk_handshake_data;
+	if (!hsi) {
+		write_unlock_bh(&sk->sk_callback_lock);
+		sockfd_put(sock);
+		return handshake_genl_error_reply(gi, HANDSHAKE_GENL_STATUS_SOCKNOTVALID);
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	ret = handshake_genl_reply(gi, hsi);
+
+	sockfd_put(sock);
+	return ret;
+}
+
+static const struct nla_policy
+handshake_genl_policy[HANDSHAKE_GENL_ATTR_MAX + 1] = {
+	[HANDSHAKE_GENL_ATTR_SOCKFD] = {
+		.type = NLA_U32
+	},
+	[HANDSHAKE_GENL_ATTR_STATUS] = {
+		.type = NLA_U32
+	},
+	[HANDSHAKE_GENL_ATTR_PROTOCOL] = {
+		.type = NLA_U32
+	},
+};
+
+static const struct genl_ops handshake_genl_ops[] = {
+	{
+		.cmd	= HANDSHAKE_GENL_CMD_UNSPEC,
+		.doit	= handshake_genl_op_unsupp,
+	},
+	{
+		.cmd	= HANDSHAKE_GENL_CMD_GET_FD_PARAMETERS,
+		.doit	= handshake_genl_op_get_fd_parms,
+	},
+};
+
+static struct genl_family __ro_after_init handshake_genl_family = {
+	.hdrsize	= 0,
+	.name		= HANDSHAKE_GENL_NAME,
+	.version	= HANDSHAKE_GENL_VERSION,
+	.maxattr	= HANDSHAKE_GENL_ATTR_MAX,
+	.netnsok	= true,
+	.n_ops		= ARRAY_SIZE(handshake_genl_ops),
+	.resv_start_op	= HANDSHAKE_GENL_CMD_MAX,
+	.policy		= handshake_genl_policy,
+	.ops		= handshake_genl_ops,
+	.module		= THIS_MODULE,
+};
+
+int __init handshake_genetlink_init(void)
+{
+	return genl_register_family(&handshake_genl_family);
+}
+
+void handshake_genetlink_exit(void)
+{
+	genl_unregister_family(&handshake_genl_family);
+}
diff --git a/net/handshake/trace.c b/net/handshake/trace.c
new file mode 100644
index 000000000000..5968848da0c1
--- /dev/null
+++ b/net/handshake/trace.c
@@ -0,0 +1,20 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PF_HANDSHAKE protocol family trace points
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023 Oracle and/or its affiliates.
+ */
+
+#include <linux/net.h>
+#include <net/sock.h>
+
+#include "handshake.h"
+
+#ifndef __CHECKER__
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/handshake.h>
+
+#endif