diff mbox series

[v3,1/2] net/handshake: Create a NETLINK service for handling handshake requests

Message ID 167580607317.5328.2575913180270613320.stgit@91.116.238.104.host.secureserver.net (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Another crack at a handshake upcall mechanism | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5299 this patch: 5299
netdev/cc_maintainers warning 4 maintainers not CCed: dsahern@kernel.org razor@blackwall.org keescook@chromium.org davem@davemloft.net
netdev/build_clang fail Errors and warnings before: 1084 this patch: 1090
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 5514 this patch: 5518
netdev/checkpatch warning CHECK: Comparison to NULL could be written "!net->hs_sock" CHECK: Please use a blank line after function/struct/union/enum declarations CHECK: extern prototypes should be avoided in .h files WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: networking block comments don't use an empty /* line, use /* Comment...
netdev/kdoc fail Errors and warnings before: 0 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Chuck Lever III Feb. 7, 2023, 9:41 p.m. UTC
When a kernel consumer needs a transport layer security session, it
first needs a handshake to negotiate and establish a session. This
negotiation can be done in user space via one of the several
existing library implementations, or it can be done in the kernel.

No in-kernel handshake implementations yet exist. In their absence,
we add a netlink service akin to NETLINK_ROUTE that can:

a. Notify a user space daemon that a handshake is needed.

b. Once notified, the daemon calls the kernel back via this
   netlink service to get the handshake parameters, including an
   open socket on which to establish the session.

The notification service uses a multicast group. Each handshake
protocol (eg, TLSv1.3, PSP, etc) adopts its own group number so that
the user space daemons for performing the handshakes are completely
independent of one another. The kernel can then tell via
netlink_has_listeners() whether a user space daemon is active and
can handle a handshake request for the desired security layer
protocol.

A new netlink operation, ACCEPT, acts like accept(2) in that it
instantiates a file descriptor in the user space daemon's fd table.
If this operation is successful, the reply carries the fd number,
which can be treated as an open and ready file descriptor.

While user space is performing the handshake, the kernel keeps its
muddy paws off the open socket. The act of closing the user space
file descriptor alerts the kernel that the open socket is safe to
use again. When the user daemon completes a handshake, the kernel is
responsible for checking that a valid transport layer security
session has been established.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/net/handshake.h            |   37 ++++
 include/net/net_namespace.h        |    1 
 include/net/sock.h                 |    1 
 include/uapi/linux/handshake.h     |   65 +++++++
 include/uapi/linux/netlink.h       |    1 
 net/Makefile                       |    1 
 net/handshake/Makefile             |   11 +
 net/handshake/netlink.c            |  320 ++++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/netlink.h |    1 
 9 files changed, 438 insertions(+)
 create mode 100644 include/net/handshake.h
 create mode 100644 include/uapi/linux/handshake.h
 create mode 100644 net/handshake/Makefile
 create mode 100644 net/handshake/netlink.c

Comments

Hannes Reinecke Feb. 8, 2023, 4:20 p.m. UTC | #1
On 2/7/23 22:41, Chuck Lever wrote:
> When a kernel consumer needs a transport layer security session, it
> first needs a handshake to negotiate and establish a session. This
> negotiation can be done in user space via one of the several
> existing library implementations, or it can be done in the kernel.
> 
> No in-kernel handshake implementations yet exist. In their absence,
> we add a netlink service akin to NETLINK_ROUTE that can:
> 
> a. Notify a user space daemon that a handshake is needed.
> 
> b. Once notified, the daemon calls the kernel back via this
>     netlink service to get the handshake parameters, including an
>     open socket on which to establish the session.
> 
> The notification service uses a multicast group. Each handshake
> protocol (eg, TLSv1.3, PSP, etc) adopts its own group number so that
> the user space daemons for performing the handshakes are completely
> independent of one another. The kernel can then tell via
> netlink_has_listeners() whether a user space daemon is active and
> can handle a handshake request for the desired security layer
> protocol.
> 
> A new netlink operation, ACCEPT, acts like accept(2) in that it
> instantiates a file descriptor in the user space daemon's fd table.
> If this operation is successful, the reply carries the fd number,
> which can be treated as an open and ready file descriptor.
> 
> While user space is performing the handshake, the kernel keeps its
> muddy paws off the open socket. The act of closing the user space
> file descriptor alerts the kernel that the open socket is safe to
> use again. When the user daemon completes a handshake, the kernel is
> responsible for checking that a valid transport layer security
> session has been established.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   include/net/handshake.h            |   37 ++++
>   include/net/net_namespace.h        |    1
>   include/net/sock.h                 |    1
>   include/uapi/linux/handshake.h     |   65 +++++++
>   include/uapi/linux/netlink.h       |    1
>   net/Makefile                       |    1
>   net/handshake/Makefile             |   11 +
>   net/handshake/netlink.c            |  320 ++++++++++++++++++++++++++++++++++++
>   tools/include/uapi/linux/netlink.h |    1
>   9 files changed, 438 insertions(+)
>   create mode 100644 include/net/handshake.h
>   create mode 100644 include/uapi/linux/handshake.h
>   create mode 100644 net/handshake/Makefile
>   create mode 100644 net/handshake/netlink.c
> 
Looks good on first glance; I'll give it a go on my testbed.

Cheers,

Hannes
Jakub Kicinski Feb. 9, 2023, 6 a.m. UTC | #2
On Tue, 07 Feb 2023 16:41:13 -0500 Chuck Lever wrote:
> diff --git a/tools/include/uapi/linux/netlink.h b/tools/include/uapi/linux/netlink.h
> index 0a4d73317759..a269d356f358 100644
> --- a/tools/include/uapi/linux/netlink.h
> +++ b/tools/include/uapi/linux/netlink.h
> @@ -29,6 +29,7 @@
>  #define NETLINK_RDMA		20
>  #define NETLINK_CRYPTO		21	/* Crypto layer */
>  #define NETLINK_SMC		22	/* SMC monitoring */
> +#define NETLINK_HANDSHAKE	23	/* transport layer sec handshake requests */

The extra indirection of genetlink introduces some complications?
Chuck Lever III Feb. 9, 2023, 3:43 p.m. UTC | #3
> On Feb 9, 2023, at 1:00 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Tue, 07 Feb 2023 16:41:13 -0500 Chuck Lever wrote:
>> diff --git a/tools/include/uapi/linux/netlink.h b/tools/include/uapi/linux/netlink.h
>> index 0a4d73317759..a269d356f358 100644
>> --- a/tools/include/uapi/linux/netlink.h
>> +++ b/tools/include/uapi/linux/netlink.h
>> @@ -29,6 +29,7 @@
>> #define NETLINK_RDMA		20
>> #define NETLINK_CRYPTO		21	/* Crypto layer */
>> #define NETLINK_SMC		22	/* SMC monitoring */
>> +#define NETLINK_HANDSHAKE	23	/* transport layer sec handshake requests */
> 
> The extra indirection of genetlink introduces some complications?

I don't think it does, necessarily. But neither does it seem
to add any value (for this use case). <shrug>


--
Chuck Lever
Paolo Abeni Feb. 9, 2023, 4:02 p.m. UTC | #4
On Thu, 2023-02-09 at 15:43 +0000, Chuck Lever III wrote:
> > On Feb 9, 2023, at 1:00 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> > 
> > On Tue, 07 Feb 2023 16:41:13 -0500 Chuck Lever wrote:
> > > diff --git a/tools/include/uapi/linux/netlink.h
> > > b/tools/include/uapi/linux/netlink.h
> > > index 0a4d73317759..a269d356f358 100644
> > > --- a/tools/include/uapi/linux/netlink.h
> > > +++ b/tools/include/uapi/linux/netlink.h
> > > @@ -29,6 +29,7 @@
> > > #define NETLINK_RDMA		20
> > > #define NETLINK_CRYPTO		21	/* Crypto layer */
> > > #define NETLINK_SMC		22	/* SMC monitoring */
> > > +#define NETLINK_HANDSHAKE	23	/* transport layer sec
> > > handshake requests */
> > 
> > The extra indirection of genetlink introduces some complications?
> 
> I don't think it does, necessarily. But neither does it seem
> to add any value (for this use case). <shrug>

To me it introduces a good separation between the handshake mechanism
itself and the current subject (sock).

IIRC the previous version allowed the user-space to create a socket of
the HANDSHAKE family which in turn accept()ed tcp sockets. That kind of
construct - assuming I interpreted it correctly - did not sound right
to me.

Back to these patches, they looks sane to me, even if the whole
architecture is a bit hard to follow, given the non trivial cross
references between the patches - I can likely have missed some relevant
point. 

I'm wondering if this approach scales well enough with the number of
concurrent handshakes: the single list looks like a potential bottle-
neck.

Cheers,

Paolo
Chuck Lever III Feb. 9, 2023, 4:34 p.m. UTC | #5
> On Feb 9, 2023, at 11:02 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> On Thu, 2023-02-09 at 15:43 +0000, Chuck Lever III wrote:
>>> On Feb 9, 2023, at 1:00 AM, Jakub Kicinski <kuba@kernel.org> wrote:
>>> 
>>> On Tue, 07 Feb 2023 16:41:13 -0500 Chuck Lever wrote:
>>>> diff --git a/tools/include/uapi/linux/netlink.h
>>>> b/tools/include/uapi/linux/netlink.h
>>>> index 0a4d73317759..a269d356f358 100644
>>>> --- a/tools/include/uapi/linux/netlink.h
>>>> +++ b/tools/include/uapi/linux/netlink.h
>>>> @@ -29,6 +29,7 @@
>>>> #define NETLINK_RDMA		20
>>>> #define NETLINK_CRYPTO		21	/* Crypto layer */
>>>> #define NETLINK_SMC		22	/* SMC monitoring */
>>>> +#define NETLINK_HANDSHAKE	23	/* transport layer sec
>>>> handshake requests */
>>> 
>>> The extra indirection of genetlink introduces some complications?
>> 
>> I don't think it does, necessarily. But neither does it seem
>> to add any value (for this use case). <shrug>
> 
> To me it introduces a good separation between the handshake mechanism
> itself and the current subject (sock).
> 
> IIRC the previous version allowed the user-space to create a socket of
> the HANDSHAKE family which in turn accept()ed tcp sockets. That kind of
> construct - assuming I interpreted it correctly - did not sound right
> to me.
> 
> Back to these patches, they looks sane to me, even if the whole
> architecture is a bit hard to follow, given the non trivial cross
> references between the patches - I can likely have missed some relevant
> point.

One of the original goals was to support other security protocols
besides TLS v1.3, which is why the code is split between two
patches. I know that is cumbersome for some review workflows.

Now is a good time to simplify, if we see a sensible opportunity
to do so.


> I'm wondering if this approach scales well enough with the number of
> concurrent handshakes: the single list looks like a potential bottle-
> neck.

It's not clear how much scaling is needed. I don't have a strong
sense of how frequently a busy storage server will need a handshake,
for instance, but it seems like it would be relatively less frequent
than, say, I/O. Network storage connections are typically long-lived,
unlike http.

In terms of scalability, I am a little more concerned about the
handshake_mutex. Maybe that isn't needed since the pending list is
spinlock protected?

All that said, the single pending list can be replaced easily. It
would be straightforward to move it into struct net, for example.


--
Chuck Lever
Jakub Kicinski Feb. 10, 2023, 2:07 a.m. UTC | #6
On Thu, 9 Feb 2023 15:43:06 +0000 Chuck Lever III wrote:
> >> @@ -29,6 +29,7 @@
> >> #define NETLINK_RDMA		20
> >> #define NETLINK_CRYPTO		21	/* Crypto layer */
> >> #define NETLINK_SMC		22	/* SMC monitoring */
> >> +#define NETLINK_HANDSHAKE	23	/* transport layer sec handshake requests */  
> > 
> > The extra indirection of genetlink introduces some complications?  
> 
> I don't think it does, necessarily. But neither does it seem
> to add any value (for this use case). <shrug>

Our default is to go for generic netlink, it's where we invest most time
in terms of infrastructure. It can take care of attribute parsing, and
allows user space to dump information about the family (eg. the parsing
policy). Most recently we added support for writing the policies in
(simple) YAML:

https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html

It takes care of a lot of the netlink tedium. If you're willing to make
use of it I'm happy to help converting etc. We merged this stuff last
month, so there are likely sharp edges.
Paolo Abeni Feb. 10, 2023, 11:41 a.m. UTC | #7
On Thu, 2023-02-09 at 16:34 +0000, Chuck Lever III wrote:
> 
> > On Feb 9, 2023, at 11:02 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Thu, 2023-02-09 at 15:43 +0000, Chuck Lever III wrote:
> > > > On Feb 9, 2023, at 1:00 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> > > > 
> > > > On Tue, 07 Feb 2023 16:41:13 -0500 Chuck Lever wrote:
> > > > > diff --git a/tools/include/uapi/linux/netlink.h
> > > > > b/tools/include/uapi/linux/netlink.h
> > > > > index 0a4d73317759..a269d356f358 100644
> > > > > --- a/tools/include/uapi/linux/netlink.h
> > > > > +++ b/tools/include/uapi/linux/netlink.h
> > > > > @@ -29,6 +29,7 @@
> > > > > #define NETLINK_RDMA		20
> > > > > #define NETLINK_CRYPTO		21	/* Crypto layer */
> > > > > #define NETLINK_SMC		22	/* SMC monitoring */
> > > > > +#define NETLINK_HANDSHAKE	23	/* transport layer sec
> > > > > handshake requests */
> > > > 
> > > > The extra indirection of genetlink introduces some complications?
> > > 
> > > I don't think it does, necessarily. But neither does it seem
> > > to add any value (for this use case). <shrug>
> > 
> > To me it introduces a good separation between the handshake mechanism
> > itself and the current subject (sock).
> > 
> > IIRC the previous version allowed the user-space to create a socket of
> > the HANDSHAKE family which in turn accept()ed tcp sockets. That kind of
> > construct - assuming I interpreted it correctly - did not sound right
> > to me.
> > 
> > Back to these patches, they looks sane to me, even if the whole
> > architecture is a bit hard to follow, given the non trivial cross
> > references between the patches - I can likely have missed some relevant
> > point.
> 
> One of the original goals was to support other security protocols
> besides TLS v1.3, which is why the code is split between two
> patches. I know that is cumbersome for some review workflows.
> 
> Now is a good time to simplify, if we see a sensible opportunity
> to do so.

I think that adding a 'hi_free'/'hi_release' op inside the
handshake_info struct - and moving the handshake info deallocation
inside the 'core' could possibly simplify a bit the architecture.

Since it looks like there is a reasonable agreement on this path
(@Dave, @Eric, @Jakub: please educate me otherwise!), and no
clear/immediate show stoppers, I suggested start hammering some
documentation with an high level overview that will help also
understanding/reviewing the code.

> > I'm wondering if this approach scales well enough with the number of
> > concurrent handshakes: the single list looks like a potential bottle-
> > neck.
> 
> It's not clear how much scaling is needed. I don't have a strong
> sense of how frequently a busy storage server will need a handshake,
> for instance, but it seems like it would be relatively less frequent
> than, say, I/O. Network storage connections are typically long-lived,
> unlike http.
> 
> In terms of scalability, I am a little more concerned about the
> handshake_mutex. Maybe that isn't needed since the pending list is
> spinlock protected?

Good point. Indeed it looks like that is not needed.

> All that said, the single pending list can be replaced easily. It
> would be straightforward to move it into struct net, for example.

In the end I don't see a operations needing a full list traversal.
handshake_nl_msg_accept walk that, but it stops at netns/proto matching
which should be ~always /~very soon in the typical use-case. And as you
said it should be easy to avoid even that.

I think it could be useful limiting the number of pending handshake to
some maximum, to avoid problems in pathological/malicious scenarios.

Cheers,

Paolo
Chuck Lever III Feb. 10, 2023, 2:17 p.m. UTC | #8
> On Feb 9, 2023, at 9:07 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Thu, 9 Feb 2023 15:43:06 +0000 Chuck Lever III wrote:
>>>> @@ -29,6 +29,7 @@
>>>> #define NETLINK_RDMA		20
>>>> #define NETLINK_CRYPTO		21	/* Crypto layer */
>>>> #define NETLINK_SMC		22	/* SMC monitoring */
>>>> +#define NETLINK_HANDSHAKE	23	/* transport layer sec handshake requests */  
>>> 
>>> The extra indirection of genetlink introduces some complications?  
>> 
>> I don't think it does, necessarily. But neither does it seem
>> to add any value (for this use case). <shrug>
> 
> Our default is to go for generic netlink, it's where we invest most time
> in terms of infrastructure.

v2 of the series used generic netlink for the downcall piece.
I can convert back to using generic netlink for v4 of the
series.


> It can take care of attribute parsing, and
> allows user space to dump information about the family (eg. the parsing
> policy). Most recently we added support for writing the policies in
> (simple) YAML:
> 
> https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html
> 
> It takes care of a lot of the netlink tedium. If you're willing to make
> use of it I'm happy to help converting etc. We merged this stuff last
> month, so there are likely sharp edges.

--
Chuck Lever
Chuck Lever III Feb. 10, 2023, 2:31 p.m. UTC | #9
> On Feb 10, 2023, at 6:41 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> On Thu, 2023-02-09 at 16:34 +0000, Chuck Lever III wrote:
>> 
>>> On Feb 9, 2023, at 11:02 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>>> 
>>> On Thu, 2023-02-09 at 15:43 +0000, Chuck Lever III wrote:
>>>>> On Feb 9, 2023, at 1:00 AM, Jakub Kicinski <kuba@kernel.org> wrote:
>>>>> 
>>>>> On Tue, 07 Feb 2023 16:41:13 -0500 Chuck Lever wrote:
>>>>>> diff --git a/tools/include/uapi/linux/netlink.h
>>>>>> b/tools/include/uapi/linux/netlink.h
>>>>>> index 0a4d73317759..a269d356f358 100644
>>>>>> --- a/tools/include/uapi/linux/netlink.h
>>>>>> +++ b/tools/include/uapi/linux/netlink.h
>>>>>> @@ -29,6 +29,7 @@
>>>>>> #define NETLINK_RDMA		20
>>>>>> #define NETLINK_CRYPTO		21	/* Crypto layer */
>>>>>> #define NETLINK_SMC		22	/* SMC monitoring */
>>>>>> +#define NETLINK_HANDSHAKE	23	/* transport layer sec
>>>>>> handshake requests */
>>>>> 
>>>>> The extra indirection of genetlink introduces some complications?
>>>> 
>>>> I don't think it does, necessarily. But neither does it seem
>>>> to add any value (for this use case). <shrug>
>>> 
>>> To me it introduces a good separation between the handshake mechanism
>>> itself and the current subject (sock).
>>> 
>>> IIRC the previous version allowed the user-space to create a socket of
>>> the HANDSHAKE family which in turn accept()ed tcp sockets. That kind of
>>> construct - assuming I interpreted it correctly - did not sound right
>>> to me.
>>> 
>>> Back to these patches, they looks sane to me, even if the whole
>>> architecture is a bit hard to follow, given the non trivial cross
>>> references between the patches - I can likely have missed some relevant
>>> point.
>> 
>> One of the original goals was to support other security protocols
>> besides TLS v1.3, which is why the code is split between two
>> patches. I know that is cumbersome for some review workflows.
>> 
>> Now is a good time to simplify, if we see a sensible opportunity
>> to do so.
> 
> I think that adding a 'hi_free'/'hi_release' op inside the
> handshake_info struct - and moving the handshake info deallocation
> inside the 'core' could possibly simplify a bit the architecture.

I'm concerned about lifetime issues for handshake_info. I was
thinking maybe these objects need to be reference-counted as
well. I'll experiment with adding a destructor method too.


> Since it looks like there is a reasonable agreement on this path
> (@Dave, @Eric, @Jakub: please educate me otherwise!), and no
> clear/immediate show stoppers, I suggested start hammering some
> documentation with an high level overview that will help also
> understanding/reviewing the code.

In previous generations of this series, there was an addition
to Documentation/ that explained how kernel TLS consumers use
the tls_ handshake API. I can add that back now that things
are settling down.

But maybe you are thinking of some other topics. I'm happy to
write down whatever is needed, but I'd like suggestions about
what particular areas would be most helpful.


>>> I'm wondering if this approach scales well enough with the number of
>>> concurrent handshakes: the single list looks like a potential bottle-
>>> neck.
>> 
>> It's not clear how much scaling is needed. I don't have a strong
>> sense of how frequently a busy storage server will need a handshake,
>> for instance, but it seems like it would be relatively less frequent
>> than, say, I/O. Network storage connections are typically long-lived,
>> unlike http.
>> 
>> In terms of scalability, I am a little more concerned about the
>> handshake_mutex. Maybe that isn't needed since the pending list is
>> spinlock protected?
> 
> Good point. Indeed it looks like that is not needed.

I will remove the handshake_mutex in v4.


>> All that said, the single pending list can be replaced easily. It
>> would be straightforward to move it into struct net, for example.
> 
> In the end I don't see a operations needing a full list traversal.
> handshake_nl_msg_accept walk that, but it stops at netns/proto matching
> which should be ~always /~very soon in the typical use-case. And as you
> said it should be easy to avoid even that.
> 
> I think it could be useful limiting the number of pending handshake to
> some maximum, to avoid problems in pathological/malicious scenarios.

Defending against DoS is sensible. Maybe having a per-net
maximum of 5 or 10 pending handshakes? handshake_request() can
return an error code if a handshake is requested while we're
over that maximum.


--
Chuck Lever
Hannes Reinecke Feb. 10, 2023, 3:06 p.m. UTC | #10
On 2/10/23 15:31, Chuck Lever III wrote:
> 
> 
>> On Feb 10, 2023, at 6:41 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On Thu, 2023-02-09 at 16:34 +0000, Chuck Lever III wrote:
>>>
[ .. ]
>>> All that said, the single pending list can be replaced easily. It
>>> would be straightforward to move it into struct net, for example.
>>
>> In the end I don't see a operations needing a full list traversal.
>> handshake_nl_msg_accept walk that, but it stops at netns/proto matching
>> which should be ~always /~very soon in the typical use-case. And as you
>> said it should be easy to avoid even that.
>>
>> I think it could be useful limiting the number of pending handshake to
>> some maximum, to avoid problems in pathological/malicious scenarios.
> 
> Defending against DoS is sensible. Maybe having a per-net
> maximum of 5 or 10 pending handshakes? handshake_request() can
> return an error code if a handshake is requested while we're
> over that maximum.
> 
Can we check the source settings? Having more than one handshake in the 
queue coming from the same SRC IP/SRC Port seems a bit pointless, 
doesn't it?

Cheers,

Hannes
Paolo Abeni Feb. 10, 2023, 3:21 p.m. UTC | #11
On Fri, 2023-02-10 at 14:31 +0000, Chuck Lever III wrote:
> In previous generations of this series, there was an addition
> to Documentation/ that explained how kernel TLS consumers use
> the tls_ handshake API. I can add that back now that things
> are settling down.

That would be useful, thank!

> But maybe you are thinking of some other topics. I'm happy to
> write down whatever is needed, but I'd like suggestions about
> what particular areas would be most helpful.

A reference user-space implementation would be very interesting, too. 

Even a completely "dummy" one for self-tests purpose only could be
useful. 

Speaking of that, at some point we will need some self-tests ;)

> > > > I'm wondering if this approach scales well enough with the number of
> > > > concurrent handshakes: the single list looks like a potential bottle-
> > > > neck.
> > > 
> > > It's not clear how much scaling is needed. I don't have a strong
> > > sense of how frequently a busy storage server will need a handshake,
> > > for instance, but it seems like it would be relatively less frequent
> > > than, say, I/O. Network storage connections are typically long-lived,
> > > unlike http.
> > > 
> > > In terms of scalability, I am a little more concerned about the
> > > handshake_mutex. Maybe that isn't needed since the pending list is
> > > spinlock protected?
> > 
> > Good point. Indeed it looks like that is not needed.
> 
> I will remove the handshake_mutex in v4.
> 
> 
> > > All that said, the single pending list can be replaced easily. It
> > > would be straightforward to move it into struct net, for example.
> > 
> > In the end I don't see a operations needing a full list traversal.
> > handshake_nl_msg_accept walk that, but it stops at netns/proto matching
> > which should be ~always /~very soon in the typical use-case. And as you
> > said it should be easy to avoid even that.
> > 
> > I think it could be useful limiting the number of pending handshake to
> > some maximum, to avoid problems in pathological/malicious scenarios.
> 
> Defending against DoS is sensible. Maybe having a per-net
> maximum of 5 or 10 pending handshakes? handshake_request() can
> return an error code if a handshake is requested while we're
> over that maximum.

I'm wondering if we could use an {r,w}mem based limits, so that the
user-space could eventually tune it as/if needed without any additional
knob.

Cheers,

Paolo
Chuck Lever III Feb. 10, 2023, 3:38 p.m. UTC | #12
> On Feb 10, 2023, at 10:21 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> On Fri, 2023-02-10 at 14:31 +0000, Chuck Lever III wrote:
>> In previous generations of this series, there was an addition
>> to Documentation/ that explained how kernel TLS consumers use
>> the tls_ handshake API. I can add that back now that things
>> are settling down.
> 
> That would be useful, thank!
> 
>> But maybe you are thinking of some other topics. I'm happy to
>> write down whatever is needed, but I'd like suggestions about
>> what particular areas would be most helpful.
> 
> A reference user-space implementation would be very interesting, too.

We've got one of those, specifically for TLSv1.3:

   https://github.com/oracle/ktls-utils

netlink support is added on the "netlink" branch. The user space
handshake agent for TLS is under src/tlshd. The netlink stuff is
pretty fresh, so there's clean-up to be done.


> Even a completely "dummy" one for self-tests purpose only could be
> useful. 
> 
> Speaking of that, at some point we will need some self-tests ;)

Jakub mentioned that during the first round of review last year.
I've got some Kunit chops, so I can construct tests. But I'm
coming up empty on exactly what would need to be tested. Right,
maybe Kunit is the wrong tool for this job...


>>>>> I'm wondering if this approach scales well enough with the number of
>>>>> concurrent handshakes: the single list looks like a potential bottle-
>>>>> neck.
>>>> 
>>>> It's not clear how much scaling is needed. I don't have a strong
>>>> sense of how frequently a busy storage server will need a handshake,
>>>> for instance, but it seems like it would be relatively less frequent
>>>> than, say, I/O. Network storage connections are typically long-lived,
>>>> unlike http.
>>>> 
>>>> In terms of scalability, I am a little more concerned about the
>>>> handshake_mutex. Maybe that isn't needed since the pending list is
>>>> spinlock protected?
>>> 
>>> Good point. Indeed it looks like that is not needed.
>> 
>> I will remove the handshake_mutex in v4.
>> 
>> 
>>>> All that said, the single pending list can be replaced easily. It
>>>> would be straightforward to move it into struct net, for example.
>>> 
>>> In the end I don't see a operations needing a full list traversal.
>>> handshake_nl_msg_accept walk that, but it stops at netns/proto matching
>>> which should be ~always /~very soon in the typical use-case. And as you
>>> said it should be easy to avoid even that.
>>> 
>>> I think it could be useful limiting the number of pending handshake to
>>> some maximum, to avoid problems in pathological/malicious scenarios.
>> 
>> Defending against DoS is sensible. Maybe having a per-net
>> maximum of 5 or 10 pending handshakes? handshake_request() can
>> return an error code if a handshake is requested while we're
>> over that maximum.
> 
> I'm wondering if we could use an {r,w}mem based limits, so that the
> user-space could eventually tune it as/if needed without any additional
> knob.


--
Chuck Lever
Jakub Kicinski Feb. 10, 2023, 6:09 p.m. UTC | #13
On Fri, 10 Feb 2023 14:17:28 +0000 Chuck Lever III wrote:
> >> I don't think it does, necessarily. But neither does it seem
> >> to add any value (for this use case). <shrug>  
> > 
> > Our default is to go for generic netlink, it's where we invest most time
> > in terms of infrastructure.  
> 
> v2 of the series used generic netlink for the downcall piece.
> I can convert back to using generic netlink for v4 of the
> series.

Would you be able to write the spec for it? I'm happy to help with that
as I mentioned. Perhaps you have the user space already hand-written
here but in case the mechanism/family gets reused it'd be sad if people
had to hand write bindings for other programming languages.
Chuck Lever III Feb. 10, 2023, 7:04 p.m. UTC | #14
> On Feb 10, 2023, at 1:09 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Fri, 10 Feb 2023 14:17:28 +0000 Chuck Lever III wrote:
>>>> I don't think it does, necessarily. But neither does it seem
>>>> to add any value (for this use case). <shrug>  
>>> 
>>> Our default is to go for generic netlink, it's where we invest most time
>>> in terms of infrastructure.  
>> 
>> v2 of the series used generic netlink for the downcall piece.
>> I can convert back to using generic netlink for v4 of the
>> series.
> 
> Would you be able to write the spec for it? I'm happy to help with that
> as I mentioned.

I'm coming from an RPC background, we usually do start from an
XDR protocol specification. So, I'm used to that, and it might
give us some new ideas about protocol correctness or
simplification.

Point me to a sample spec or maybe a language reference and we
can discuss it further.


> Perhaps you have the user space already hand-written
> here but in case the mechanism/family gets reused it'd be sad if people
> had to hand write bindings for other programming languages.

Yes, the user space implementation is currently hand-written C,
but it can easily be converted to machine-generated if you have
a favorite tool to do that.


--
Chuck Lever
Jakub Kicinski Feb. 10, 2023, 9:44 p.m. UTC | #15
On Fri, 10 Feb 2023 19:04:34 +0000 Chuck Lever III wrote:
> >> v2 of the series used generic netlink for the downcall piece.
> >> I can convert back to using generic netlink for v4 of the
> >> series.  
> > 
> > Would you be able to write the spec for it? I'm happy to help with that
> > as I mentioned.  
> 
> I'm coming from an RPC background, we usually do start from an
> XDR protocol specification. So, I'm used to that, and it might
> give us some new ideas about protocol correctness or
> simplification.

Nice, our thing is completely homegrown and unprofessional.
Hopefully it won't make you run away.

> Point me to a sample spec or maybe a language reference and we
> can discuss it further.

There are only two specs so far in net-next:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/Documentation/netlink/specs

Neither of these is great (fou is a bit legacy, and ethtool is not
fully expressed), a better example may be this one which is pending 
in the bpf-next tree:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/Documentation/netlink/specs/netdev.yaml

There is a JSON schema spec (which may be useful for checking available
fields quickly):

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/Documentation/netlink/genetlink.yaml

And (uncharacteristically?), docs:

https://docs.kernel.org/next/userspace-api/netlink/index.html

> > Perhaps you have the user space already hand-written
> > here but in case the mechanism/family gets reused it'd be sad if people
> > had to hand write bindings for other programming languages.  
> 
> Yes, the user space implementation is currently hand-written C,
> but it can easily be converted to machine-generated if you have
> a favorite tool to do that.

I started hacking on a code generator for C in net-next in
tools/net/ynl/ynl-gen-c.py but it's likely bitrotted already.
I don't actually have a strong user in C to justify the time
investment. All the cool kids these days want to use Rust or Go
(and the less cool C++). For development I use Python
(tools/net/ynl/cli.py tools/net/ynl/lib/).

It should work fairly well for generating the kernel bits 
(uAPI header, policy and op tables).
Hannes Reinecke Feb. 11, 2023, 12:11 p.m. UTC | #16
On 2/10/23 19:09, Jakub Kicinski wrote:
> On Fri, 10 Feb 2023 14:17:28 +0000 Chuck Lever III wrote:
>>>> I don't think it does, necessarily. But neither does it seem
>>>> to add any value (for this use case). <shrug>
>>>
>>> Our default is to go for generic netlink, it's where we invest most time
>>> in terms of infrastructure.
>>
>> v2 of the series used generic netlink for the downcall piece.
>> I can convert back to using generic netlink for v4 of the
>> series.
> 
> Would you be able to write the spec for it? I'm happy to help with that
> as I mentioned. Perhaps you have the user space already hand-written
> here but in case the mechanism/family gets reused it'd be sad if people
> had to hand write bindings for other programming languages.

Can you send me a pointer to the YAML specification (and parser)?
I couldn't find anything in the linux sources; but maybe I'm looking in 
the wrong tree or somesuch.

Cheers,

Hannes
Chuck Lever III Feb. 11, 2023, 8:55 p.m. UTC | #17
> On Feb 10, 2023, at 4:44 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Fri, 10 Feb 2023 19:04:34 +0000 Chuck Lever III wrote:
>>>> v2 of the series used generic netlink for the downcall piece.
>>>> I can convert back to using generic netlink for v4 of the
>>>> series.  
>>> 
>>> Would you be able to write the spec for it? I'm happy to help with that
>>> as I mentioned.  
>> 
>> I'm coming from an RPC background, we usually do start from an
>> XDR protocol specification. So, I'm used to that, and it might
>> give us some new ideas about protocol correctness or
>> simplification.
> 
> Nice, our thing is completely homegrown and unprofessional.
> Hopefully it won't make you run away.
> 
>> Point me to a sample spec or maybe a language reference and we
>> can discuss it further.
> 
> There are only two specs so far in net-next:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/Documentation/netlink/specs
> 
> Neither of these is great (fou is a bit legacy, and ethtool is not
> fully expressed), a better example may be this one which is pending 
> in the bpf-next tree:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/Documentation/netlink/specs/netdev.yaml
> 
> There is a JSON schema spec (which may be useful for checking available
> fields quickly):
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/Documentation/netlink/genetlink.yaml
> 
> And (uncharacteristically?), docs:
> 
> https://docs.kernel.org/next/userspace-api/netlink/index.html

Based on this reply I was unsure whether you wanted an English
spec (similar to an Internet Draft) or a machine-readable one.

But now that I look at these, I think I get it: you'd like a
YAML file that can be used with tools to either generate a
parser or maybe do some correctness analysis.

I think others will benefit as more security protocols come
to this party, so it's a good thing to do for extensibility.

I will look into this for v5 definitely and maybe v4. v4
already has significant churn...


>>> Perhaps you have the user space already hand-written
>>> here but in case the mechanism/family gets reused it'd be sad if people
>>> had to hand write bindings for other programming languages.  
>> 
>> Yes, the user space implementation is currently hand-written C,
>> but it can easily be converted to machine-generated if you have
>> a favorite tool to do that.
> 
> I started hacking on a code generator for C in net-next in
> tools/net/ynl/ynl-gen-c.py but it's likely bitrotted already.
> I don't actually have a strong user in C to justify the time
> investment. All the cool kids these days want to use Rust or Go
> (and the less cool C++). For development I use Python
> (tools/net/ynl/cli.py tools/net/ynl/lib/).
> 
> It should work fairly well for generating the kernel bits 
> (uAPI header, policy and op tables).

Makes sense.


--
Chuck Lever
Jamal Hadi Salim Feb. 12, 2023, 3:40 p.m. UTC | #18
On Thu, Feb 9, 2023 at 11:36 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Feb 9, 2023, at 11:02 AM, Paolo Abeni <pabeni@redhat.com> wrote:

[..]

> > IIRC the previous version allowed the user-space to create a socket of
> > the HANDSHAKE family which in turn accept()ed tcp sockets. That kind of
> > construct - assuming I interpreted it correctly - did not sound right
> > to me.
> >
> > Back to these patches, they looks sane to me, even if the whole
> > architecture is a bit hard to follow, given the non trivial cross
> > references between the patches - I can likely have missed some relevant
> > point.
>
> One of the original goals was to support other security protocols
> besides TLS v1.3, which is why the code is split between two
> patches. I know that is cumbersome for some review workflows.
>
> Now is a good time to simplify, if we see a sensible opportunity
> to do so.
>
>
> > I'm wondering if this approach scales well enough with the number of
> > concurrent handshakes: the single list looks like a potential bottle-
> > neck.
>
> It's not clear how much scaling is needed. I don't have a strong
> sense of how frequently a busy storage server will need a handshake,
> for instance, but it seems like it would be relatively less frequent
> than, say, I/O. Network storage connections are typically long-lived,
> unlike http.
>

So this is for storage type traffic only? Assuming TCP/NVME probably.
IOW, is it worth doing the handshake via the kernel for a short flow?
We have done some analysis and TLS handshake certainly affects overall
performance, so moving it into the kernel is a good idea. Question is
how would this interact with a) KTLS b) KTLS offload c) user space
type of crypto needs like quic, etc?

cheers,
jamal
Chuck Lever III Feb. 12, 2023, 5:24 p.m. UTC | #19
Hello Jamal-

> On Feb 12, 2023, at 10:40 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> 
> On Thu, Feb 9, 2023 at 11:36 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Feb 9, 2023, at 11:02 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> [..]
> 
>>> IIRC the previous version allowed the user-space to create a socket of
>>> the HANDSHAKE family which in turn accept()ed tcp sockets. That kind of
>>> construct - assuming I interpreted it correctly - did not sound right
>>> to me.
>>> 
>>> Back to these patches, they looks sane to me, even if the whole
>>> architecture is a bit hard to follow, given the non trivial cross
>>> references between the patches - I can likely have missed some relevant
>>> point.
>> 
>> One of the original goals was to support other security protocols
>> besides TLS v1.3, which is why the code is split between two
>> patches. I know that is cumbersome for some review workflows.
>> 
>> Now is a good time to simplify, if we see a sensible opportunity
>> to do so.
>> 
>> 
>>> I'm wondering if this approach scales well enough with the number of
>>> concurrent handshakes: the single list looks like a potential bottle-
>>> neck.
>> 
>> It's not clear how much scaling is needed. I don't have a strong
>> sense of how frequently a busy storage server will need a handshake,
>> for instance, but it seems like it would be relatively less frequent
>> than, say, I/O. Network storage connections are typically long-lived,
>> unlike http.
>> 
> 
> So this is for storage type traffic only? Assuming TCP/NVME probably.
> IOW, is it worth doing the handshake via the kernel for a short flow?
> We have done some analysis and TLS handshake certainly affects overall
> performance, so moving it into the kernel is a good idea. Question is
> how would this interact with a) KTLS b) KTLS offload c) user space
> type of crypto needs like quic, etc?

I believe a summary of this thread is due.

---

This is for any in-kernel transport security layer consumer.
It is expressly not for the purpose of adding a handshake
facility that can be used directly by user space consumers.

Today we have several in-kernel consumers ready to use it,
including SunRPC-with-TLS, NVMe/TCP with TLS, and SMB with
QUIC (that one is less far along because the Linux kernel
doesn't have a QUIC implementation yet).

They all happen to be storage-related, but I believe that
is coincidental. It is possible that once transport layer
security is easily available to kernel consumers, other
usage scenarios will appear.

But networked storage connections are, on average, long-
lived. We believe that session set-up overhead for such
consumers will not be as significant as the increased
latency and compute resources used by the security layer's
record protocols.

---

We would prefer an in-kernel handshake implementation, and
for TLSv1.3 in particular, we believe everything is already
available in-kernel (ciphers, certificate management, and so
on) except for the TLS handshake protocol engine. So maybe
not a heavy lift technically, especially because the TLSv1.3
handshake subprotocol is simpler than the ones in past
versions.

We know of at least one out-of-tree in-kernel handshake
implementation, so it's definitely feasible. The purpose of
that implementation is scalable handshake performance for
user space consumers (like web front ends) so what you say
about potential better performance is certainly plausible.
But it's not the purpose of what I'm doing here.

---

Most pertinently, however, the kernel security folks have
stated that an in-kernel handshake is out of the question
because it would be yet another handshake implementation to
maintain and keep secure. They greatly prefer that these
kinds of things be done in user space using a well-vetted
library implementation.

As an example, our upcall TLSv1.3 handshake prototype has
chosen to use GnuTLS. Once the prototype handshake mechanism
has established a TLS session, the user space library
configures kTLS on the kernel's socket. Thus, our prototype
enables in-kernel TLS consumers to take advantage of both
software ciphers and NIC offload today via kTLS (thanks to
Boris and his team).

---

Lastly, there are potentially other transport security layer
protocols aside from TLSv1.3 that will need similar treatment.
Jakub is working on one now, I believe.

I would expect an uphill battle for getting each one of those
added to the kernel.

---

So the solution we have derived is an upcall mechanism that
will be at least a stop-gap, but may become a longer-term
solution depending on the politics of getting handshake
protocol implementations into kernel space. I believe that
activity is also ongoing for TLSv1.3 at least, but it seems
to me like a more distant solution.

We have immediate cloud provider demand for both the RPC and
NVMe usage scenarios with TLSv1.3, and the SMB service in
Azure implements QUIC on Windows (which re-uses TLSv1.3's
handshake protocol). Thus I'd like to see a handshake
facility made available as soon as we can muster a sensible
one.


--
Chuck Lever
Jakub Kicinski Feb. 13, 2023, 9:40 p.m. UTC | #20
On Sat, 11 Feb 2023 20:55:58 +0000 Chuck Lever III wrote:
> Based on this reply I was unsure whether you wanted an English
> spec (similar to an Internet Draft) or a machine-readable one.

I meant machine-readable.

> But now that I look at these, I think I get it: you'd like a
> YAML file that can be used with tools to either generate a
> parser or maybe do some correctness analysis.
> 
> I think others will benefit as more security protocols come
> to this party, so it's a good thing to do for extensibility.

Yup, it's great for parsers and we also plan to add syzbot metadata 
to it with semantics of the fields.

> I will look into this for v5 definitely and maybe v4. v4
> already has significant churn...
Jakub Kicinski Feb. 13, 2023, 9:55 p.m. UTC | #21
On Sat, 11 Feb 2023 13:11:02 +0100 Hannes Reinecke wrote:
> > Would you be able to write the spec for it? I'm happy to help with that
> > as I mentioned. Perhaps you have the user space already hand-written
> > here but in case the mechanism/family gets reused it'd be sad if people
> > had to hand write bindings for other programming languages.  
> 
> Can you send me a pointer to the YAML specification (and parser)?
> I couldn't find anything in the linux sources; but maybe I'm looking in 
> the wrong tree or somesuch.

The ready-for-consumption specs are only in net-next, but the user
space code gen did not end up there (yet?) 

I pushed some old branch where I had started typing up user space C
code gen there:

https://github.com/kuba-moo/ynl/tree/yaml-ynl-c-wip

It's an old branch taken out of trash so there's a lot of unrelated
garbage :(  Only stuff of note would be the under tools/net/ynl/
diff mbox series

Patch

diff --git a/include/net/handshake.h b/include/net/handshake.h
new file mode 100644
index 000000000000..a439d823e828
--- /dev/null
+++ b/include/net/handshake.h
@@ -0,0 +1,37 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * NETLINK_HANDSHAKE service.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are visible only within the
+ * kernel are declared here.
+ */
+
+#ifndef _NET_HANDSHAKE_H
+#define _NET_HANDSHAKE_H
+
+struct handshake_info {
+	struct list_head	hi_list;
+
+	struct socket		*hi_sock;
+	int			hi_fd;
+	int			hi_mcgrp;
+	int			hi_protocol;
+
+	struct sk_buff		*(*hi_accept)(struct handshake_info *hsi,
+					      struct sk_buff *skb,
+					      struct nlmsghdr *nlh);
+	void			(*hi_done)(struct handshake_info *hsi,
+					   struct sk_buff *skb,
+					   struct nlmsghdr *nlh,
+					   struct nlattr *args);
+};
+
+extern int handshake_request(struct handshake_info *hsi, gfp_t flags);
+
+#endif /* _NET_HANDSHAKE_H */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 8c3587d5c308..88fd0442249c 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -105,6 +105,7 @@  struct net {
 	struct sock		*genl_sock;
 
 	struct uevent_sock	*uevent_sock;		/* uevent socket */
+	struct sock		*hs_sock;		/* handshake requests */
 
 	struct hlist_head 	*dev_name_head;
 	struct hlist_head	*dev_index_head;
diff --git a/include/net/sock.h b/include/net/sock.h
index e0517ecc6531..de0510306e28 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -515,6 +515,7 @@  struct sock {
 
 	struct socket		*sk_socket;
 	void			*sk_user_data;
+	void			*sk_handshake_info;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
 #endif
diff --git a/include/uapi/linux/handshake.h b/include/uapi/linux/handshake.h
new file mode 100644
index 000000000000..39cab687eece
--- /dev/null
+++ b/include/uapi/linux/handshake.h
@@ -0,0 +1,65 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * NETLINK_HANDSHAKE service.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are visible to user space are
+ * declared here. This file constitutes an API contract between the
+ * Linux kernel and user space.
+ */
+
+#ifndef _UAPI_LINUX_HANDSHAKE_H
+#define _UAPI_LINUX_HANDSHAKE_H
+
+/* Multicast Netlink socket groups */
+enum handshake_nlgrps {
+	HANDSHAKE_NLGRP_NONE = 0,
+	__HANDSHAKE_NLGRP_MAX
+};
+#define HSNLGRP_MAX	(__HANDSHAKE_NLGRP_MAX - 1)
+
+enum handshake_nl_msgs {
+	HANDSHAKE_NL_MSG_BASE = NLMSG_MIN_TYPE,
+	HANDSHAKE_NL_MSG_READY,
+	HANDSHAKE_NL_MSG_ACCEPT,
+	HANDSHAKE_NL_MSG_DONE,
+	__HANDSHAKE_NL_MSG_MAX
+};
+#define HANDSHAKE_NL_MSG_MAX	(__HANDSHAKE_NL_MSG_MAX - 1)
+
+enum handshake_nl_attrs {
+	HANDSHAKE_NL_ATTR_UNSPEC = 0,
+	HANDSHAKE_NL_ATTR_MSG_STATUS,
+	HANDSHAKE_NL_ATTR_SOCKFD,
+	HANDSHAKE_NL_ATTR_PROTOCOL,
+	HANDSHAKE_NL_ATTR_ACCEPT_RESP,
+	HANDSHAKE_NL_ATTR_DONE_ARGS,
+
+	__HANDSHAKE_NL_ATTR_MAX
+};
+#define HANDSHAKE_NL_ATTR_MAX	(__HANDSHAKE_NL_ATTR_MAX - 1)
+
+enum handshake_nl_status {
+	HANDSHAKE_NL_STATUS_OK = 0,
+	HANDSHAKE_NL_STATUS_INVAL,
+	HANDSHAKE_NL_STATUS_BADF,
+	HANDSHAKE_NL_STATUS_NOTREADY,
+	HANDSHAKE_NL_STATUS_SYSTEMFAULT,
+};
+
+enum handshake_nl_protocol {
+	HANDSHAKE_NL_PROTO_UNSPEC = 0,
+};
+
+enum handshake_nl_tls_session_status {
+	HANDSHAKE_NL_TLS_SESS_STATUS_OK = 0,	/* session established */
+	HANDSHAKE_NL_TLS_SESS_STATUS_FAULT,	/* failure to launch */
+	HANDSHAKE_NL_TLS_SESS_STATUS_REJECTED,	/* remote hates us */
+};
+
+#endif /* _UAPI_LINUX_HANDSHAKE_H */
diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index e2ae82e3f9f7..a29b2db5fa8a 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -29,6 +29,7 @@ 
 #define NETLINK_RDMA		20
 #define NETLINK_CRYPTO		21	/* Crypto layer */
 #define NETLINK_SMC		22	/* SMC monitoring */
+#define NETLINK_HANDSHAKE	23	/* transport layer sec handshake requests */
 
 #define NETLINK_INET_DIAG	NETLINK_SOCK_DIAG
 
diff --git a/net/Makefile b/net/Makefile
index 6a62e5b27378..c1bb53f00486 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -78,3 +78,4 @@  obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
 obj-$(CONFIG_MPTCP)		+= mptcp/
 obj-$(CONFIG_MCTP)		+= mctp/
+obj-y				+= handshake/
diff --git a/net/handshake/Makefile b/net/handshake/Makefile
new file mode 100644
index 000000000000..b27400c01427
--- /dev/null
+++ b/net/handshake/Makefile
@@ -0,0 +1,11 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for the NETLINK_HANDSHAKE service
+#
+# Author: Chuck Lever <chuck.lever@oracle.com>
+#
+# Copyright (c) 2023, Oracle and/or its affiliates.
+#
+
+obj-y += handshake.o
+handshake-y := netlink.o
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
new file mode 100644
index 000000000000..49e05fa34df3
--- /dev/null
+++ b/net/handshake/netlink.c
@@ -0,0 +1,320 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * NETLINK_HANDSHAKE service
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/inet.h>
+
+#include <net/sock.h>
+#include <net/netlink.h>
+#include <net/handshake.h>
+
+#include <uapi/linux/handshake.h>
+
+static DEFINE_SPINLOCK(handshake_lock);
+static LIST_HEAD(handshake_pending);
+
+/*
+ * Send a "ready" notification to the multicast group for this
+ * security handshake type, in the same net namespace as @hi_sock.
+ */
+static int handshake_notify(struct handshake_info *hsi, gfp_t flags)
+{
+	struct net *net = sock_net(hsi->hi_sock->sk);
+	struct sock *nls = net->hs_sock;
+	struct sk_buff *msg;
+	struct nlmsghdr *nlh;
+	int err;
+
+	if (!netlink_has_listeners(nls, hsi->hi_mcgrp))
+		return -ESRCH;
+
+	err = -ENOMEM;
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, flags);
+	if (!msg)
+		goto out_err;
+	nlh = nlmsg_put(msg, 0, 0, HANDSHAKE_NL_MSG_READY, 0, 0);
+	if (!nlh)
+		goto out_err;
+	nlmsg_end(msg, nlh);
+
+	return nlmsg_notify(nls, msg, 0, hsi->hi_mcgrp, nlmsg_report(nlh),
+			    flags);
+
+out_err:
+	nlmsg_free(msg);
+	return err;
+}
+
+/**
+ * handshake_request - consumer API to request a handshake
+ * @hsi: socket and callback information
+ * @flags: memory allocation flags
+ *
+ * Return values:
+ *   %0: Request queued
+ *   %-ESRCH: No user space HANDSHAKE listeners
+ *   %-ENOMEM: Memory allocation failed
+ */
+int handshake_request(struct handshake_info *hsi, gfp_t flags)
+{
+	int ret;
+
+	spin_lock(&handshake_lock);
+	list_add(&hsi->hi_list, &handshake_pending);
+	hsi->hi_sock->sk->sk_handshake_info = NULL;
+	spin_unlock(&handshake_lock);
+
+	/* XXX: racy */
+	ret = handshake_notify(hsi, flags);
+	if (ret) {
+		spin_lock(&handshake_lock);
+		if (!list_empty(&hsi->hi_list))
+			list_del_init(&hsi->hi_list);
+		spin_unlock(&handshake_lock);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(handshake_request);
+
+static int handshake_accept(struct socket *sock)
+{
+	int flags = O_CLOEXEC;
+	struct file *file;
+	int fd;
+
+	fd = get_unused_fd_flags(flags);
+	if (fd < 0)
+		return fd;
+	file = sock_alloc_file(sock, flags, sock->sk->sk_prot_creator->name);
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		return PTR_ERR(file);
+	}
+
+	fd_install(fd, file);
+	return fd;
+}
+
+static const struct nla_policy
+handshake_nl_attr_policy[HANDSHAKE_NL_ATTR_MAX + 1] = {
+	[HANDSHAKE_NL_ATTR_MSG_STATUS] = {
+		.type = NLA_U32
+	},
+	[HANDSHAKE_NL_ATTR_SOCKFD] = {
+		.type = NLA_U32
+	},
+	[HANDSHAKE_NL_ATTR_PROTOCOL] = {
+		.type = NLA_U32
+	},
+	[HANDSHAKE_NL_ATTR_ACCEPT_RESP] = {
+		.type = NLA_NESTED,
+	},
+	[HANDSHAKE_NL_ATTR_DONE_ARGS] = {
+		.type = NLA_NESTED,
+	},
+};
+
+static int handshake_nl_msg_unsupp(struct sk_buff *skb, struct nlmsghdr *nlh,
+				   struct nlattr **tb)
+{
+	pr_err("Handshake: Unknown command (%d) was ignored\n", nlh->nlmsg_type);
+	return -EINVAL;
+}
+
+static int handshake_nl_status_reply(struct sk_buff *skb, struct nlmsghdr *nlh,
+				     enum handshake_nl_status status)
+{
+	struct net *net = sock_net(skb->sk);
+	struct nlmsghdr *hdr;
+	struct sk_buff *msg;
+	int ret;
+
+	ret = -ENOMEM;
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		goto out;
+	hdr = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
+			nlh->nlmsg_type, 0, 0);
+	if (!hdr)
+		goto out_free;
+
+	ret = -EMSGSIZE;
+	ret = nla_put_u32(msg, HANDSHAKE_NL_ATTR_MSG_STATUS, status);
+	if (ret < 0)
+		goto out_free;
+
+	nlmsg_end(msg, hdr);
+	return nlmsg_unicast(net->hs_sock, msg, NETLINK_CB(skb).portid);
+
+out_free:
+	nlmsg_free(msg);
+out:
+	return ret;
+}
+
+static int handshake_nl_msg_accept(struct sk_buff *skb, struct nlmsghdr *nlh,
+				   struct nlattr **tb)
+{
+	struct net *net = sock_net(skb->sk);
+	struct handshake_info *pos, *hsi;
+	struct sk_buff *msg;
+	int protocol;
+
+	if (!tb[HANDSHAKE_NL_ATTR_PROTOCOL])
+		return handshake_nl_status_reply(skb, nlh,
+						 HANDSHAKE_NL_STATUS_INVAL);
+	protocol = nla_get_u32(tb[HANDSHAKE_NL_ATTR_PROTOCOL]);
+
+	hsi = NULL;
+	spin_lock(&handshake_lock);
+	list_for_each_entry(pos, &handshake_pending, hi_list) {
+		if (sock_net(pos->hi_sock->sk) != net)
+			continue;
+		if (pos->hi_protocol != protocol)
+			continue;
+
+		list_del_init(&pos->hi_list);
+		hsi = pos;
+		break;
+	}
+	spin_unlock(&handshake_lock);
+	if (!hsi)
+		return handshake_nl_status_reply(skb, nlh,
+						 HANDSHAKE_NL_STATUS_NOTREADY);
+
+	hsi->hi_fd = handshake_accept(hsi->hi_sock);
+	if (hsi->hi_fd < 0)
+		return handshake_nl_status_reply(skb, nlh,
+						 HANDSHAKE_NL_STATUS_SYSTEMFAULT);
+
+	msg = hsi->hi_accept(hsi, skb, nlh);
+	if (IS_ERR(msg))
+		return PTR_ERR(msg);
+
+	hsi->hi_sock->sk->sk_handshake_info = hsi;
+	return nlmsg_unicast(net->hs_sock, msg, NETLINK_CB(skb).portid);
+}
+
+/*
+ * This function is careful to not close the socket. It merely removes
+ * it from the file descriptor table so that it is no longer visible
+ * to the calling process.
+ */
+static int handshake_nl_msg_done(struct sk_buff *skb, struct nlmsghdr *nlh,
+				 struct nlattr **tb)
+{
+	struct handshake_info *hsi;
+	struct socket *sock;
+	int fd, err;
+
+	if (!tb[HANDSHAKE_NL_ATTR_SOCKFD])
+		return handshake_nl_status_reply(skb, nlh,
+						 HANDSHAKE_NL_STATUS_INVAL);
+
+	err = 0;
+	fd = nla_get_u32(tb[HANDSHAKE_NL_ATTR_SOCKFD]);
+	sock = sockfd_lookup(fd, &err);
+	if (err)
+		return handshake_nl_status_reply(skb, nlh,
+						 HANDSHAKE_NL_STATUS_BADF);
+
+	put_unused_fd(fd);
+
+	hsi = sock->sk->sk_handshake_info;
+	if (hsi) {
+		hsi->hi_done(hsi, skb, nlh, tb[HANDSHAKE_NL_ATTR_DONE_ARGS]);
+		sock->sk->sk_handshake_info = NULL;
+	}
+	return 0;
+}
+
+static const struct handshake_link {
+	int (*doit)(struct sk_buff *skb, struct nlmsghdr *nlh,
+		    struct nlattr **tb);
+} handshake_dispatch[] = {
+	[HANDSHAKE_NL_MSG_ACCEPT - HANDSHAKE_NL_MSG_BASE] = {
+		.doit	= handshake_nl_msg_accept,
+	},
+	[HANDSHAKE_NL_MSG_DONE - HANDSHAKE_NL_MSG_BASE] = {
+		.doit	= handshake_nl_msg_done,
+	},
+};
+
+static int handshake_nl_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[HANDSHAKE_NL_ATTR_MAX + 1];
+	const struct handshake_link *link;
+	int err;
+
+	if (!netlink_net_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (nlh->nlmsg_type > HANDSHAKE_NL_MSG_MAX)
+		return handshake_nl_msg_unsupp(skb, nlh, tb);
+	link = &handshake_dispatch[nlh->nlmsg_type - HANDSHAKE_NL_MSG_BASE];
+	if (!link->doit)
+		return handshake_nl_msg_unsupp(skb, nlh, tb);
+
+	err = nlmsg_parse(nlh, 0, tb, HANDSHAKE_NL_ATTR_MAX,
+			  handshake_nl_attr_policy, extack);
+	if (err < 0)
+		return err;
+
+	return link->doit(skb, nlh, tb);
+}
+
+static void handshake_nl_rcv(struct sk_buff *skb)
+{
+	static DEFINE_MUTEX(handshake_mutex);
+
+	mutex_lock(&handshake_mutex);
+	netlink_rcv_skb(skb, &handshake_nl_rcv_skb);
+	mutex_unlock(&handshake_mutex);
+}
+
+static int __net_init handshake_nl_net_init(struct net *net)
+{
+	struct netlink_kernel_cfg cfg = {
+		.groups		= HSNLGRP_MAX,
+		.input		= handshake_nl_rcv,
+	};
+
+	net->hs_sock = netlink_kernel_create(net, NETLINK_HANDSHAKE, &cfg);
+	return net->hs_sock == NULL ? -ENOMEM : 0;
+}
+
+static void __net_exit handshake_nl_net_exit(struct net *net)
+{
+	netlink_kernel_release(net->hs_sock);
+	net->hs_sock = NULL;
+}
+
+static struct pernet_operations handshake_nl_net_ops = {
+	.init		= handshake_nl_net_init,
+	.exit		= handshake_nl_net_exit,
+};
+
+static int __init handshake_nl_init(void)
+{
+	return register_pernet_subsys(&handshake_nl_net_ops);
+}
+
+static void __exit handshake_nl_exit(void)
+{
+	unregister_pernet_subsys(&handshake_nl_net_ops);
+}
+
+module_init(handshake_nl_init);
+module_exit(handshake_nl_exit);
diff --git a/tools/include/uapi/linux/netlink.h b/tools/include/uapi/linux/netlink.h
index 0a4d73317759..a269d356f358 100644
--- a/tools/include/uapi/linux/netlink.h
+++ b/tools/include/uapi/linux/netlink.h
@@ -29,6 +29,7 @@ 
 #define NETLINK_RDMA		20
 #define NETLINK_CRYPTO		21	/* Crypto layer */
 #define NETLINK_SMC		22	/* SMC monitoring */
+#define NETLINK_HANDSHAKE	23	/* transport layer sec handshake requests */
 
 #define NETLINK_INET_DIAG	NETLINK_SOCK_DIAG