diff mbox series

[v5,1/2] net/handshake: Create a NETLINK service for handling handshake requests

Message ID 167726635921.5428.7879951165266317921.stgit@91.116.238.104.host.secureserver.net (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Another crack at a handshake upcall mechanism | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4553 this patch: 4553
netdev/cc_maintainers warning 7 maintainers not CCed: chuck.lever@oracle.com linux-trace-kernel@vger.kernel.org mhiramat@kernel.org corbet@lwn.net linux-doc@vger.kernel.org rostedt@goodmis.org davem@davemloft.net
netdev/build_clang success Errors and warnings before: 1067 this patch: 1067
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4765 this patch: 4765
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Lines should not end with a '(' CHECK: Please don't use multiple blank lines CHECK: extern prototypes should be avoided in .h files WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: networking block comments don't use an empty /* line, use /* Comment...
netdev/kdoc fail Errors and warnings before: 0 this patch: 5
netdev/source_inline success Was 0 now: 0

Commit Message

Chuck Lever Feb. 24, 2023, 7:19 p.m. UTC
From: Chuck Lever <chuck.lever@oracle.com>

When a kernel consumer needs a transport layer security session, it
first needs a handshake to negotiate and establish a session. This
negotiation can be done in user space via one of the several
existing library implementations, or it can be done in the kernel.

No in-kernel handshake implementations yet exist. In their absence,
we add a netlink service that can:

a. Notify a user space daemon that a handshake is needed.

b. Once notified, the daemon calls the kernel back via this
   netlink service to get the handshake parameters, including an
   open socket on which to establish the session.

c. Once the handshake is complete, the daemon reports the
   session status and other information via a second netlink
   operation. This operation marks that it is safe for the
   kernel to use the open socket and the security session
   established there.

The notification service uses a multicast group. Each handshake
mechanism (eg, tlshd) adopts its own group number so that the
handshake services are completely independent of one another. The
kernel can then tell via netlink_has_listeners() whether a handshake
service is active and prepared to handle a handshake request.

A new netlink operation, ACCEPT, acts like accept(2) in that it
instantiates a file descriptor in the user space daemon's fd table.
If this operation is successful, the reply carries the fd number,
which can be treated as an open and ready file descriptor.

While user space is performing the handshake, the kernel keeps its
muddy paws off the open socket. A second new netlink operation,
DONE, indicates that the user space daemon is finished with the
socket and it is safe for the kernel to use again. The operation
also indicates whether a session was established successfully.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 Documentation/netlink/specs/handshake.yaml |  134 +++++++++++
 include/net/handshake.h                    |   45 ++++
 include/net/net_namespace.h                |    5 
 include/net/sock.h                         |    1 
 include/trace/events/handshake.h           |  159 +++++++++++++
 include/uapi/linux/handshake.h             |   63 +++++
 net/Makefile                               |    1 
 net/handshake/Makefile                     |   11 +
 net/handshake/handshake.h                  |   41 +++
 net/handshake/netlink.c                    |  340 ++++++++++++++++++++++++++++
 net/handshake/request.c                    |  246 ++++++++++++++++++++
 net/handshake/trace.c                      |   17 +
 12 files changed, 1063 insertions(+)
 create mode 100644 Documentation/netlink/specs/handshake.yaml
 create mode 100644 include/net/handshake.h
 create mode 100644 include/trace/events/handshake.h
 create mode 100644 include/uapi/linux/handshake.h
 create mode 100644 net/handshake/Makefile
 create mode 100644 net/handshake/handshake.h
 create mode 100644 net/handshake/netlink.c
 create mode 100644 net/handshake/request.c
 create mode 100644 net/handshake/trace.c

Comments

Hannes Reinecke Feb. 27, 2023, 9:24 a.m. UTC | #1
On 2/24/23 20:19, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> When a kernel consumer needs a transport layer security session, it
> first needs a handshake to negotiate and establish a session. This
> negotiation can be done in user space via one of the several
> existing library implementations, or it can be done in the kernel.
> 
> No in-kernel handshake implementations yet exist. In their absence,
> we add a netlink service that can:
> 
> a. Notify a user space daemon that a handshake is needed.
> 
> b. Once notified, the daemon calls the kernel back via this
>     netlink service to get the handshake parameters, including an
>     open socket on which to establish the session.
> 
> c. Once the handshake is complete, the daemon reports the
>     session status and other information via a second netlink
>     operation. This operation marks that it is safe for the
>     kernel to use the open socket and the security session
>     established there.
> 
> The notification service uses a multicast group. Each handshake
> mechanism (eg, tlshd) adopts its own group number so that the
> handshake services are completely independent of one another. The
> kernel can then tell via netlink_has_listeners() whether a handshake
> service is active and prepared to handle a handshake request.
> 
> A new netlink operation, ACCEPT, acts like accept(2) in that it
> instantiates a file descriptor in the user space daemon's fd table.
> If this operation is successful, the reply carries the fd number,
> which can be treated as an open and ready file descriptor.
> 
> While user space is performing the handshake, the kernel keeps its
> muddy paws off the open socket. A second new netlink operation,
> DONE, indicates that the user space daemon is finished with the
> socket and it is safe for the kernel to use again. The operation
> also indicates whether a session was established successfully.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   Documentation/netlink/specs/handshake.yaml |  134 +++++++++++
>   include/net/handshake.h                    |   45 ++++
>   include/net/net_namespace.h                |    5
>   include/net/sock.h                         |    1
>   include/trace/events/handshake.h           |  159 +++++++++++++
>   include/uapi/linux/handshake.h             |   63 +++++
>   net/Makefile                               |    1
>   net/handshake/Makefile                     |   11 +
>   net/handshake/handshake.h                  |   41 +++
>   net/handshake/netlink.c                    |  340 ++++++++++++++++++++++++++++
>   net/handshake/request.c                    |  246 ++++++++++++++++++++
>   net/handshake/trace.c                      |   17 +
>   12 files changed, 1063 insertions(+)
>   create mode 100644 Documentation/netlink/specs/handshake.yaml
>   create mode 100644 include/net/handshake.h
>   create mode 100644 include/trace/events/handshake.h
>   create mode 100644 include/uapi/linux/handshake.h
>   create mode 100644 net/handshake/Makefile
>   create mode 100644 net/handshake/handshake.h
>   create mode 100644 net/handshake/netlink.c
>   create mode 100644 net/handshake/request.c
>   create mode 100644 net/handshake/trace.c
> 
> diff --git a/Documentation/netlink/specs/handshake.yaml b/Documentation/netlink/specs/handshake.yaml
> new file mode 100644
> index 000000000000..683a8f2df0a7
> --- /dev/null
> +++ b/Documentation/netlink/specs/handshake.yaml
> @@ -0,0 +1,134 @@
> +# SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
> +#
> +# GENL HANDSHAKE service.
> +#
> +# Author: Chuck Lever <chuck.lever@oracle.com>
> +#
> +# Copyright (c) 2023, Oracle and/or its affiliates.
> +#
> +
> +name: handshake
> +
> +protocol: genetlink-c
> +
> +doc: Netlink protocol to request a transport layer security handshake.
> +
> +uapi-header: linux/net/handshake.h
> +
> +definitions:
> +  -
> +    type: enum
> +    name: handler-class
> +    enum-name:
> +    value-start: 0
> +    entries: [ none ]
> +  -
> +    type: enum
> +    name: msg-type
> +    enum-name:
> +    value-start: 0
> +    entries: [ unspec, clienthello, serverhello ]
> +  -
> +    type: enum
> +    name: auth
> +    enum-name:
> +    value-start: 0
> +    entries: [ unspec, unauth, x509, psk ]
> +
> +attribute-sets:
> +  -
> +    name: accept
> +    attributes:
> +      -
> +        name: status
> +        doc: Status of this accept operation
> +        type: u32
> +        value: 1
> +      -
> +        name: sockfd
> +        doc: File descriptor of socket to use
> +        type: u32
> +      -
> +        name: handler-class
> +        doc: Which type of handler is responding
> +        type: u32
> +        enum: handler-class
> +      -
> +        name: message-type
> +        doc: Handshake message type
> +        type: u32
> +        enum: msg-type
> +      -
> +        name: auth
> +        doc: Authentication mode
> +        type: u32
> +        enum: auth
> +      -
> +        name: gnutls-priorities
> +        doc: GnuTLS priority string
> +        type: string
> +      -
> +        name: my-peerid
> +        doc: Serial no of key containing local identity
> +        type: u32
> +      -
> +        name: my-privkey
> +        doc: Serial no of key containing optional private key
> +        type: u32
> +  -
> +    name: done
> +    attributes:
> +      -
> +        name: status
> +        doc: Session status
> +        type: u32
> +        value: 1
> +      -
> +        name: sockfd
> +        doc: File descriptor of socket that has completed
> +        type: u32
> +      -
> +        name: remote-peerid
> +        doc: Serial no of keys containing identities of remote peer
> +        type: u32
> +
> +operations:
> +  list:
> +    -
> +      name: ready
> +      doc: Notify handlers that a new handshake request is waiting
> +      value: 1
> +      notify: accept
> +    -
> +      name: accept
> +      doc: Handler retrieves next queued handshake request
> +      attribute-set: accept
> +      flags: [ admin-perm ]
> +      do:
> +        request:
> +          attributes:
> +            - handler-class
> +        reply:
> +          attributes:
> +            - status
> +            - sockfd
> +            - message-type
> +            - auth
> +            - gnutls-priorities
> +            - my-peerid
> +            - my-privkey
> +    -
> +      name: done
> +      doc: Handler reports handshake completion
> +      attribute-set: done
> +      do:
> +        request:
> +          attributes:
> +            - status
> +            - sockfd
> +            - remote-peerid
> +
> +mcast-groups:
> +  list:
> +    -
> +      name: none
> diff --git a/include/net/handshake.h b/include/net/handshake.h
> new file mode 100644
> index 000000000000..08f859237936
> --- /dev/null
> +++ b/include/net/handshake.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Generic HANDSHAKE service.
> + *
> + * Author: Chuck Lever <chuck.lever@oracle.com>
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +/*
> + * Data structures and functions that are visible only within the
> + * kernel are declared here.
> + */
> +
> +#ifndef _NET_HANDSHAKE_H
> +#define _NET_HANDSHAKE_H
> +
> +struct handshake_req;
> +
> +/*
> + * Invariants for all handshake requests for one transport layer
> + * security protocol
> + */
> +struct handshake_proto {
> +	int			hp_handler_class;
> +	size_t			hp_privsize;
> +
> +	int			(*hp_accept)(struct handshake_req *req,
> +					     struct genl_info *gi, int fd);
> +	void			(*hp_done)(struct handshake_req *req,
> +					   int status, struct nlattr **tb);
> +	void			(*hp_destroy)(struct handshake_req *req);
> +};
> +
> +extern struct handshake_req *
> +handshake_req_alloc(struct socket *sock, const struct handshake_proto *proto,
> +		    gfp_t flags);
> +extern void *handshake_req_private(struct handshake_req *req);
> +extern int handshake_req_submit(struct handshake_req *req, gfp_t flags);
> +extern int handshake_req_cancel(struct socket *sock);
> +
> +extern struct nlmsghdr *handshake_genl_put(struct sk_buff *msg,
> +					   struct genl_info *gi);
> +
> +#endif /* _NET_HANDSHAKE_H */
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 78beaa765c73..a0ce9de4dab1 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -188,6 +188,11 @@ struct net {
>   #if IS_ENABLED(CONFIG_SMC)
>   	struct netns_smc	smc;
>   #endif
> +
> +	/* transport layer security handshake requests */
> +	spinlock_t		hs_lock;
> +	struct list_head	hs_requests;
> +	int			hs_pending;
>   } __randomize_layout;
>   
>   #include <linux/seq_file_net.h>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 573f2bf7e0de..2a7345ce2540 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -519,6 +519,7 @@ struct sock {
>   
>   	struct socket		*sk_socket;
>   	void			*sk_user_data;
> +	void			*sk_handshake_req;
>   #ifdef CONFIG_SECURITY
>   	void			*sk_security;
>   #endif
> diff --git a/include/trace/events/handshake.h b/include/trace/events/handshake.h
> new file mode 100644
> index 000000000000..feffcd1d6256
> --- /dev/null
> +++ b/include/trace/events/handshake.h
> @@ -0,0 +1,159 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM handshake
> +
> +#if !defined(_TRACE_HANDSHAKE_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_HANDSHAKE_H
> +
> +#include <linux/net.h>
> +#include <linux/tracepoint.h>
> +
> +DECLARE_EVENT_CLASS(handshake_event_class,
> +	TP_PROTO(
> +		const struct net *net,
> +		const struct handshake_req *req,
> +		const struct socket *sock
> +	),
> +	TP_ARGS(net, req, sock),
> +	TP_STRUCT__entry(
> +		__field(const void *, req)
> +		__field(const void *, sock)
> +		__field(unsigned int, netns_ino)
> +	),
> +	TP_fast_assign(
> +		__entry->req = req;
> +		__entry->sock = sock;
> +		__entry->netns_ino = net->ns.inum;
> +	),
> +	TP_printk("req=%p sock=%p",
> +		__entry->req, __entry->sock
> +	)
> +);
> +#define DEFINE_HANDSHAKE_EVENT(name)				\
> +	DEFINE_EVENT(handshake_event_class, name,		\
> +		TP_PROTO(					\
> +			const struct net *net,			\
> +			const struct handshake_req *req,	\
> +			const struct socket *sock		\
> +		),						\
> +		TP_ARGS(net, req, sock))
> +
> +DECLARE_EVENT_CLASS(handshake_fd_class,
> +	TP_PROTO(
> +		const struct net *net,
> +		const struct handshake_req *req,
> +		const struct socket *sock,
> +		int fd
> +	),
> +	TP_ARGS(net, req, sock, fd),
> +	TP_STRUCT__entry(
> +		__field(const void *, req)
> +		__field(const void *, sock)
> +		__field(int, fd)
> +		__field(unsigned int, netns_ino)
> +	),
> +	TP_fast_assign(
> +		__entry->req = req;
> +		__entry->sock = req->hr_sock;
> +		__entry->fd = fd;
> +		__entry->netns_ino = net->ns.inum;
> +	),
> +	TP_printk("req=%p sock=%p fd=%d",
> +		__entry->req, __entry->sock, __entry->fd
> +	)
> +);
> +#define DEFINE_HANDSHAKE_FD_EVENT(name)				\
> +	DEFINE_EVENT(handshake_fd_class, name,			\
> +		TP_PROTO(					\
> +			const struct net *net,			\
> +			const struct handshake_req *req,	\
> +			const struct socket *sock,		\
> +			int fd					\
> +		),						\
> +		TP_ARGS(net, req, sock, fd))
> +
> +DECLARE_EVENT_CLASS(handshake_error_class,
> +	TP_PROTO(
> +		const struct net *net,
> +		const struct handshake_req *req,
> +		const struct socket *sock,
> +		int err
> +	),
> +	TP_ARGS(net, req, sock, err),
> +	TP_STRUCT__entry(
> +		__field(const void *, req)
> +		__field(const void *, sock)
> +		__field(int, err)
> +		__field(unsigned int, netns_ino)
> +	),
> +	TP_fast_assign(
> +		__entry->req = req;
> +		__entry->sock = sock;
> +		__entry->err = err;
> +		__entry->netns_ino = net->ns.inum;
> +	),
> +	TP_printk("req=%p sock=%p err=%d",
> +		__entry->req, __entry->sock, __entry->err
> +	)
> +);
> +#define DEFINE_HANDSHAKE_ERROR(name)				\
> +	DEFINE_EVENT(handshake_error_class, name,		\
> +		TP_PROTO(					\
> +			const struct net *net,			\
> +			const struct handshake_req *req,	\
> +			const struct socket *sock,		\
> +			int err					\
> +		),						\
> +		TP_ARGS(net, req, sock, err))
> +
> +
> +/**
> + ** Request lifetime events
> + **/
> +
> +DEFINE_HANDSHAKE_EVENT(handshake_submit);
> +DEFINE_HANDSHAKE_ERROR(handshake_submit_err);
> +DEFINE_HANDSHAKE_EVENT(handshake_cancel);
> +DEFINE_HANDSHAKE_EVENT(handshake_cancel_none);
> +DEFINE_HANDSHAKE_EVENT(handshake_cancel_busy);
> +DEFINE_HANDSHAKE_EVENT(handshake_destruct);
> +
> +
> +TRACE_EVENT(handshake_complete,
> +	TP_PROTO(
> +		const struct net *net,
> +		const struct handshake_req *req,
> +		const struct socket *sock,
> +		int status
> +	),
> +	TP_ARGS(net, req, sock, status),
> +	TP_STRUCT__entry(
> +		__field(const void *, req)
> +		__field(const void *, sock)
> +		__field(int, status)
> +		__field(unsigned int, netns_ino)
> +	),
> +	TP_fast_assign(
> +		__entry->req = req;
> +		__entry->sock = sock;
> +		__entry->status = status;
> +		__entry->netns_ino = net->ns.inum;
> +	),
> +	TP_printk("req=%p sock=%p status=%d",
> +		__entry->req, __entry->sock, __entry->status
> +	)
> +);
> +
> +/**
> + ** Netlink events
> + **/
> +
> +DEFINE_HANDSHAKE_ERROR(handshake_notify_err);
> +DEFINE_HANDSHAKE_FD_EVENT(handshake_cmd_accept);
> +DEFINE_HANDSHAKE_ERROR(handshake_cmd_accept_err);
> +DEFINE_HANDSHAKE_FD_EVENT(handshake_cmd_done);
> +DEFINE_HANDSHAKE_ERROR(handshake_cmd_done_err);
> +
> +#endif /* _TRACE_HANDSHAKE_H */
> +
> +#include <trace/define_trace.h>
> diff --git a/include/uapi/linux/handshake.h b/include/uapi/linux/handshake.h
> new file mode 100644
> index 000000000000..09fd7c37cba4
> --- /dev/null
> +++ b/include/uapi/linux/handshake.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* Do not edit directly, auto-generated from: */
> +/*	Documentation/netlink/specs/handshake.yaml */
> +/* YNL-GEN uapi header */
> +
> +#ifndef _UAPI_LINUX_HANDSHAKE_H
> +#define _UAPI_LINUX_HANDSHAKE_H
> +
> +#define HANDSHAKE_FAMILY_NAME		"handshake"
> +#define HANDSHAKE_FAMILY_VERSION	1
> +
> +enum {
> +	HANDSHAKE_HANDLER_CLASS_NONE,
> +};
> +
> +enum {
> +	HANDSHAKE_MSG_TYPE_UNSPEC,
> +	HANDSHAKE_MSG_TYPE_CLIENTHELLO,
> +	HANDSHAKE_MSG_TYPE_SERVERHELLO,
> +};
> +
> +enum {
> +	HANDSHAKE_AUTH_UNSPEC,
> +	HANDSHAKE_AUTH_UNAUTH,
> +	HANDSHAKE_AUTH_X509,
> +	HANDSHAKE_AUTH_PSK,
> +};
> +
> +enum {
> +	HANDSHAKE_A_ACCEPT_STATUS = 1,
> +	HANDSHAKE_A_ACCEPT_SOCKFD,
> +	HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
> +	HANDSHAKE_A_ACCEPT_MESSAGE_TYPE,
> +	HANDSHAKE_A_ACCEPT_AUTH,
> +	HANDSHAKE_A_ACCEPT_GNUTLS_PRIORITIES,
> +	HANDSHAKE_A_ACCEPT_MY_PEERID,
> +	HANDSHAKE_A_ACCEPT_MY_PRIVKEY,
> +
> +	__HANDSHAKE_A_ACCEPT_MAX,
> +	HANDSHAKE_A_ACCEPT_MAX = (__HANDSHAKE_A_ACCEPT_MAX - 1)
> +};
> +
> +enum {
> +	HANDSHAKE_A_DONE_STATUS = 1,
> +	HANDSHAKE_A_DONE_SOCKFD,
> +	HANDSHAKE_A_DONE_REMOTE_PEERID,
> +
> +	__HANDSHAKE_A_DONE_MAX,
> +	HANDSHAKE_A_DONE_MAX = (__HANDSHAKE_A_DONE_MAX - 1)
> +};
> +
> +enum {
> +	HANDSHAKE_CMD_READY = 1,
> +	HANDSHAKE_CMD_ACCEPT,
> +	HANDSHAKE_CMD_DONE,
> +
> +	__HANDSHAKE_CMD_MAX,
> +	HANDSHAKE_CMD_MAX = (__HANDSHAKE_CMD_MAX - 1)
> +};
> +
> +#define HANDSHAKE_MCGRP_NONE	"none"
> +
> +#endif /* _UAPI_LINUX_HANDSHAKE_H */
> diff --git a/net/Makefile b/net/Makefile
> index 0914bea9c335..adbb64277601 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -79,3 +79,4 @@ obj-$(CONFIG_NET_NCSI)		+= ncsi/
>   obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
>   obj-$(CONFIG_MPTCP)		+= mptcp/
>   obj-$(CONFIG_MCTP)		+= mctp/
> +obj-y				+= handshake/
> diff --git a/net/handshake/Makefile b/net/handshake/Makefile
> new file mode 100644
> index 000000000000..a41b03f4837b
> --- /dev/null
> +++ b/net/handshake/Makefile
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Makefile for the Generic HANDSHAKE service
> +#
> +# Author: Chuck Lever <chuck.lever@oracle.com>
> +#
> +# Copyright (c) 2023, Oracle and/or its affiliates.
> +#
> +
> +obj-y += handshake.o
> +handshake-y := netlink.o request.o trace.o
> diff --git a/net/handshake/handshake.h b/net/handshake/handshake.h
> new file mode 100644
> index 000000000000..366c7659ec09
> --- /dev/null
> +++ b/net/handshake/handshake.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Generic netlink handshake service
> + *
> + * Author: Chuck Lever <chuck.lever@oracle.com>
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +/*
> + * Data structures and functions that are visible only within the
> + * handshake module are declared here.
> + */
> +
> +#ifndef _INTERNAL_HANDSHAKE_H
> +#define _INTERNAL_HANDSHAKE_H
> +
> +/*
> + * One handshake request
> + */
> +struct handshake_req {
> +	struct list_head		hr_list;
> +	unsigned long			hr_flags;
> +	const struct handshake_proto	*hr_proto;
> +	struct socket			*hr_sock;
> +
> +	void				(*hr_saved_destruct)(struct sock *sk);
> +};
> +
> +#define HANDSHAKE_F_COMPLETED	BIT(0)
> +
> +/* netlink.c */
> +extern bool handshake_genl_inited;
> +int handshake_genl_notify(struct net *net, int handler_class, gfp_t flags);
> +
> +/* request.c */
> +void __remove_pending_locked(struct net *net, struct handshake_req *req);
> +void handshake_complete(struct handshake_req *req, int status,
> +			struct nlattr **tb);
> +
> +#endif /* _INTERNAL_HANDSHAKE_H */
> diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
> new file mode 100644
> index 000000000000..581e382236cf
> --- /dev/null
> +++ b/net/handshake/netlink.c
> @@ -0,0 +1,340 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Generic netlink handshake service
> + *
> + * Author: Chuck Lever <chuck.lever@oracle.com>
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/inet.h>
> +
> +#include <net/sock.h>
> +#include <net/genetlink.h>
> +#include <net/handshake.h>
> +
> +#include <uapi/linux/handshake.h>
> +#include <trace/events/handshake.h>
> +#include "handshake.h"
> +
> +static struct genl_family __ro_after_init handshake_genl_family;
> +bool handshake_genl_inited;
> +
> +/**
> + * handshake_genl_notify - Notify handlers that a request is waiting
> + * @net: target network namespace
> + * @handler_class: target handler
> + * @flags: memory allocation control flags
> + *
> + * Returns zero on success or a negative errno if notification failed.
> + */
> +int handshake_genl_notify(struct net *net, int handler_class, gfp_t flags)
> +{
> +	struct sk_buff *msg;
> +	void *hdr;
> +
> +	if (!genl_has_listeners(&handshake_genl_family, net, handler_class))
> +		return -ESRCH;
> +
> +	msg = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +
> +	hdr = genlmsg_put(msg, 0, 0, &handshake_genl_family, 0,
> +			  HANDSHAKE_CMD_READY);
> +	if (!hdr)
> +		goto out_free;
> +
> +	if (nla_put_u32(msg, HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
> +			handler_class) < 0) {
> +		genlmsg_cancel(msg, hdr);
> +		goto out_free;
> +	}
> +
> +	genlmsg_end(msg, hdr);
> +	return genlmsg_multicast_netns(&handshake_genl_family, net, msg,
> +				       0, handler_class, flags);
> +
> +out_free:
> +	nlmsg_free(msg);
> +	return -EMSGSIZE;
> +}
> +
> +/**
> + * handshake_genl_put - Create a generic netlink message header
> + * @msg: buffer in which to create the header
> + * @gi: generic netlink message context
> + *
> + * Returns a ready-to-use header, or NULL.
> + */
> +struct nlmsghdr *handshake_genl_put(struct sk_buff *msg, struct genl_info *gi)
> +{
> +	return genlmsg_put(msg, gi->snd_portid, gi->snd_seq,
> +			   &handshake_genl_family, 0, gi->genlhdr->cmd);
> +}
> +EXPORT_SYMBOL(handshake_genl_put);
> +
> +static int handshake_status_reply(struct sk_buff *skb, struct genl_info *gi,
> +				  int status)
> +{
> +	struct nlmsghdr *hdr;
> +	struct sk_buff *msg;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	msg = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!msg)
> +		goto out;
> +	hdr = handshake_genl_put(msg, gi);
> +	if (!hdr)
> +		goto out_free;
> +
> +	ret = -EMSGSIZE;
> +	ret = nla_put_u32(msg, HANDSHAKE_A_ACCEPT_STATUS, status);
> +	if (ret < 0)
> +		goto out_free;
> +
> +	genlmsg_end(msg, hdr);
> +	return genlmsg_reply(msg, gi);
> +
> +out_free:
> +	genlmsg_cancel(msg, hdr);
> +out:
> +	return ret;
> +}
> +
> +/*
> + * dup() a kernel socket for use as a user space file descriptor
> + * in the current process.
> + *
> + * Implicit argument: "current()"
> + */
> +static int handshake_dup(struct socket *kernsock)
> +{
> +	struct file *file = get_file(kernsock->file);
> +	int newfd;
> +
> +	newfd = get_unused_fd_flags(O_CLOEXEC);
> +	if (newfd < 0) {
> +		fput(file);
> +		return newfd;
> +	}
> +
> +	fd_install(newfd, file);
> +	return newfd;
> +}
> +
> +static const struct nla_policy
> +handshake_accept_nl_policy[HANDSHAKE_A_ACCEPT_HANDLER_CLASS + 1] = {
> +	[HANDSHAKE_A_ACCEPT_HANDLER_CLASS] = { .type = NLA_U32, },
> +};
> +
> +static int handshake_nl_accept_doit(struct sk_buff *skb, struct genl_info *gi)
> +{
> +	struct nlattr *tb[HANDSHAKE_A_ACCEPT_MAX + 1];
> +	struct net *net = sock_net(skb->sk);
> +	struct handshake_req *pos, *req;
> +	int fd, err;
> +
> +	err = -EINVAL;
> +	if (genlmsg_parse(nlmsg_hdr(skb), &handshake_genl_family, tb,
> +			  HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
> +			  handshake_accept_nl_policy, NULL))
> +		goto out_status;
> +	if (!tb[HANDSHAKE_A_ACCEPT_HANDLER_CLASS])
> +		goto out_status;
> +
> +	req = NULL;
> +	spin_lock(&net->hs_lock);
> +	list_for_each_entry(pos, &net->hs_requests, hr_list) {
> +		if (pos->hr_proto->hp_handler_class !=
> +		    nla_get_u32(tb[HANDSHAKE_A_ACCEPT_HANDLER_CLASS]))
> +			continue;
> +		__remove_pending_locked(net, pos);
> +		req = pos;
> +		break;
> +	}
> +	spin_unlock(&net->hs_lock);
> +	if (!req)
> +		goto out_status;
> +
> +	fd = handshake_dup(req->hr_sock);
> +	if (fd < 0) {
> +		err = fd;
> +		goto out_complete;
> +	}
> +	err = req->hr_proto->hp_accept(req, gi, fd);
> +	if (err)
> +		goto out_complete;
> +
> +	trace_handshake_cmd_accept(net, req, req->hr_sock, fd);
> +	return 0;
> +
> +out_complete:
> +	handshake_complete(req, -EIO, NULL);
> +	fput(req->hr_sock->file);
> +out_status:
> +	trace_handshake_cmd_accept_err(net, req, NULL, err);
> +	return handshake_status_reply(skb, gi, err);
> +}
> +
> +static const struct nla_policy
> +handshake_done_nl_policy[HANDSHAKE_A_DONE_MAX + 1] = {
> +	[HANDSHAKE_A_DONE_SOCKFD] = { .type = NLA_U32, },
> +	[HANDSHAKE_A_DONE_STATUS] = { .type = NLA_U32, },
> +	[HANDSHAKE_A_DONE_REMOTE_PEERID] = { .type = NLA_U32, },
> +};
> +
> +static int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *gi)
> +{
> +	struct nlattr *tb[HANDSHAKE_A_DONE_MAX + 1];
> +	struct net *net = sock_net(skb->sk);
> +	struct socket *sock = NULL;
> +	struct handshake_req *req;
> +	int fd, status, err;
> +
> +	err = genlmsg_parse(nlmsg_hdr(skb), &handshake_genl_family, tb,
> +			    HANDSHAKE_A_DONE_MAX, handshake_done_nl_policy,
> +			    NULL);
> +	if (err || !tb[HANDSHAKE_A_DONE_SOCKFD]) {
> +		err = -EINVAL;
> +		goto out_status;
> +	}
> +
> +	fd = nla_get_u32(tb[HANDSHAKE_A_DONE_SOCKFD]);
> +
> +	err = 0;
> +	sock = sockfd_lookup(fd, &err);
> +	if (err) {
> +		err = -EBADF;
> +		goto out_status;
> +	}
> +
> +	req = sock->sk->sk_handshake_req;
> +	if (!req) {
> +		err = -EBUSY;
> +		goto out_status;
> +	}
> +
> +	trace_handshake_cmd_done(net, req, sock, fd);
> +
> +	status = -EIO;
> +	if (tb[HANDSHAKE_A_DONE_STATUS])
> +		status = nla_get_u32(tb[HANDSHAKE_A_DONE_STATUS]);
> +
And this makes me ever so slightly uneasy.

As 'status' is a netlink attribute it's inevitably defined as 'unsigned'.
Yet we assume that 'status' is a negative number, leaving us 
_technically_ in unchartered territory.

And that is notwithstanding the problem that we haven't even defined 
_what_ should be in the status attribute.

Reading the code I assume that it's either '0' for success or a negative 
number (ie the error code) on failure.
Which implicitely means that we _never_ set a positive number here.
So what would we lose if we declare 'status' to carry the _positive_ 
error number instead?
It would bring us in-line with the actual netlink attribute definition, 
we wouldn't need to worry about possible integer overflows, yadda yadda...

Hmm?

> +	handshake_complete(req, status, tb);
> +	fput(sock->file);
> +	return 0;
> +
> +out_status:
> +	trace_handshake_cmd_done_err(net, req, sock, err);
> +	return handshake_status_reply(skb, gi, err);
> +}
> +
> +static const struct genl_split_ops handshake_nl_ops[] = {
> +	{
> +		.cmd		= HANDSHAKE_CMD_ACCEPT,
> +		.doit		= handshake_nl_accept_doit,
> +		.policy		= handshake_accept_nl_policy,
> +		.maxattr	= HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> +	},
> +	{
> +		.cmd		= HANDSHAKE_CMD_DONE,
> +		.doit		= handshake_nl_done_doit,
> +		.policy		= handshake_done_nl_policy,
> +		.maxattr	= HANDSHAKE_A_DONE_REMOTE_PEERID,
> +		.flags		= GENL_CMD_CAP_DO,
> +	},
> +};
> +
> +static const struct genl_multicast_group handshake_nl_mcgrps[] = {
> +	[HANDSHAKE_HANDLER_CLASS_NONE] = { .name = HANDSHAKE_MCGRP_NONE, },
> +};
> +
> +static struct genl_family __ro_after_init handshake_genl_family = {
> +	.hdrsize		= 0,
> +	.name			= HANDSHAKE_FAMILY_NAME,
> +	.version		= HANDSHAKE_FAMILY_VERSION,
> +	.netnsok		= true,
> +	.parallel_ops		= true,
> +	.n_mcgrps		= ARRAY_SIZE(handshake_nl_mcgrps),
> +	.n_split_ops		= ARRAY_SIZE(handshake_nl_ops),
> +	.split_ops		= handshake_nl_ops,
> +	.mcgrps			= handshake_nl_mcgrps,
> +	.module			= THIS_MODULE,
> +};
> +
> +static int __net_init handshake_net_init(struct net *net)
> +{
> +	spin_lock_init(&net->hs_lock);
> +	INIT_LIST_HEAD(&net->hs_requests);
> +	net->hs_pending	= 0;
> +	return 0;
> +}
> +
> +static void __net_exit handshake_net_exit(struct net *net)
> +{
> +	struct handshake_req *req;
> +	LIST_HEAD(requests);
> +
> +	/*
> +	 * This drains the net's pending list. Requests that
> +	 * have been accepted and are in progress will be
> +	 * destroyed when the socket is closed.
> +	 */
> +	spin_lock(&net->hs_lock);
> +	list_splice_init(&requests, &net->hs_requests);
> +	spin_unlock(&net->hs_lock);
> +
> +	while (!list_empty(&requests)) {
> +		req = list_first_entry(&requests, struct handshake_req, hr_list);
> +		list_del(&req->hr_list);
> +
> +		/*
> +		 * Requests on this list have not yet been
> +		 * accepted, so they do not have an fd to put.
> +		 */
> +
> +		handshake_complete(req, -ETIMEDOUT, NULL);
> +	}
> +}
> +
> +static struct pernet_operations handshake_genl_net_ops = {
> +	.init		= handshake_net_init,
> +	.exit		= handshake_net_exit,
> +};
> +
> +static int __init handshake_init(void)
> +{
> +	int ret;
> +
> +	ret = genl_register_family(&handshake_genl_family);
> +	if (ret) {
> +		pr_warn("handshake: netlink registration failed (%d)\n", ret);
> +		return ret;
> +	}
> +
> +	ret = register_pernet_subsys(&handshake_genl_net_ops);
> +	if (ret) {
> +		pr_warn("handshake: pernet registration failed (%d)\n", ret);
> +		genl_unregister_family(&handshake_genl_family);
> +	}
> +
> +	handshake_genl_inited = true;
> +	return ret;
> +}
> +
> +static void __exit handshake_exit(void)
> +{
> +	unregister_pernet_subsys(&handshake_genl_net_ops);
> +	genl_unregister_family(&handshake_genl_family);
> +}
> +
> +module_init(handshake_init);
> +module_exit(handshake_exit);
> diff --git a/net/handshake/request.c b/net/handshake/request.c
> new file mode 100644
> index 000000000000..1d3b8e76dd2c
> --- /dev/null
> +++ b/net/handshake/request.c
> @@ -0,0 +1,246 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Handshake request lifetime events
> + *
> + * Author: Chuck Lever <chuck.lever@oracle.com>
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/inet.h>
> +#include <linux/fdtable.h>
> +
> +#include <net/sock.h>
> +#include <net/genetlink.h>
> +#include <net/handshake.h>
> +
> +#include <uapi/linux/handshake.h>
> +#include <trace/events/handshake.h>
> +#include "handshake.h"
> +
> +/*
> + * This limit is to prevent slow remotes from causing denial of service.
> + * A ulimit-style tunable might be used instead.
> + */
> +#define HANDSHAKE_PENDING_MAX (10)
> +
> +static void __add_pending_locked(struct net *net, struct handshake_req *req)
> +{
> +	net->hs_pending++;
> +	list_add_tail(&req->hr_list, &net->hs_requests);
> +}
> +
> +void __remove_pending_locked(struct net *net, struct handshake_req *req)
> +{
> +	net->hs_pending--;
> +	list_del_init(&req->hr_list);
> +}
> +
> +/*
> + * Return values:
> + *   %true - the request was found on @net's pending list
> + *   %false - the request was not found on @net's pending list
> + *
> + * If @req was on a pending list, it has not yet been accepted.
> + */
> +static bool remove_pending(struct net *net, struct handshake_req *req)
> +{
> +	bool ret;
> +
> +	ret = false;
> +
> +	spin_lock(&net->hs_lock);
> +	if (!list_empty(&req->hr_list)) {
> +		__remove_pending_locked(net, req);
> +		ret = true;
> +	}
> +	spin_unlock(&net->hs_lock);
> +
> +	return ret;
> +}
> +
> +static void handshake_req_destroy(struct handshake_req *req, struct sock *sk)
> +{
> +	req->hr_proto->hp_destroy(req);
> +	sk->sk_handshake_req = NULL;
> +	kfree(req);
> +}
> +
> +static void handshake_sk_destruct(struct sock *sk)
> +{
> +	struct handshake_req *req = sk->sk_handshake_req;
> +
> +	if (req) {
> +		trace_handshake_destruct(sock_net(sk), req, req->hr_sock);
> +		handshake_req_destroy(req, sk);
> +	}
> +}
> +
> +/**
> + * handshake_req_alloc - consumer API to allocate a request
> + * @sock: open socket on which to perform a handshake
> + * @proto: security protocol
> + * @flags: memory allocation flags
> + *
> + * Returns an initialized handshake_req or NULL.
> + */
> +struct handshake_req *handshake_req_alloc(struct socket *sock,
> +					  const struct handshake_proto *proto,
> +					  gfp_t flags)
> +{
> +	struct handshake_req *req;
> +
> +	/* Avoid accessing uninitialized global variables later on */
> +	if (!handshake_genl_inited)
> +		return NULL;
> +
> +	req = kzalloc(sizeof(*req) + proto->hp_privsize, flags);
> +	if (!req)
> +		return NULL;
> +
> +	sock_hold(sock->sk);
> +
> +	INIT_LIST_HEAD(&req->hr_list);
> +	req->hr_sock = sock;
> +	req->hr_proto = proto;
> +	return req;
> +}
> +EXPORT_SYMBOL(handshake_req_alloc);
> +
> +/**
> + * handshake_req_private - consumer API to return per-handshake private data
> + * @req: handshake arguments
> + *
> + */
> +void *handshake_req_private(struct handshake_req *req)
> +{
> +	return (void *)(req + 1);
> +}
> +EXPORT_SYMBOL(handshake_req_private);
> +
> +/**
> + * handshake_req_submit - consumer API to submit a handshake request
> + * @req: handshake arguments
> + * @flags: memory allocation flags
> + *
> + * Return values:
> + *   %0: Request queued
> + *   %-EBUSY: A handshake is already under way for this socket
> + *   %-ESRCH: No handshake agent is available
> + *   %-EAGAIN: Too many pending handshake requests
> + *   %-ENOMEM: Failed to allocate memory
> + *   %-EMSGSIZE: Failed to construct notification message
> + *
> + * A zero return value from handshake_request() means that
> + * exactly one subsequent completion callback is guaranteed.
> + *
> + * A negative return value from handshake_request() means that
> + * no completion callback will be done and that @req is
> + * destroyed.
> + */
> +int handshake_req_submit(struct handshake_req *req, gfp_t flags)
> +{
> +	struct socket *sock = req->hr_sock;
> +	struct sock *sk = sock->sk;
> +	struct net *net = sock_net(sk);
> +	int ret;
> +
> +	ret = -EAGAIN;
> +	if (READ_ONCE(net->hs_pending) >= HANDSHAKE_PENDING_MAX)
> +		goto out_err;
> +
> +	ret = -EBUSY;
> +	spin_lock(&net->hs_lock);
> +	if (sk->sk_handshake_req || !list_empty(&req->hr_list)) {
> +		spin_unlock(&net->hs_lock);
> +		goto out_err;
> +	}
> +	req->hr_saved_destruct = sk->sk_destruct;
> +	sk->sk_destruct = handshake_sk_destruct;
> +	sk->sk_handshake_req = req;
> +	__add_pending_locked(net, req);
> +	spin_unlock(&net->hs_lock);
> +
> +	ret = handshake_genl_notify(net, req->hr_proto->hp_handler_class,
> +				    flags);
> +	if (ret) {
> +		trace_handshake_notify_err(net, req, sock, ret);
> +		if (remove_pending(net, req))
> +			goto out_err;
> +	}
> +
> +	trace_handshake_submit(net, req, sock);
> +	return 0;
> +
> +out_err:
> +	trace_handshake_submit_err(net, req, sock, ret);
> +	handshake_req_destroy(req, sk);
> +	return ret;
> +}
> +EXPORT_SYMBOL(handshake_req_submit);
> +
> +void handshake_complete(struct handshake_req *req, int status,
> +			struct nlattr **tb)
> +{
> +	struct socket *sock = req->hr_sock;
> +	struct net *net = sock_net(sock->sk);
> +
> +	if (!test_and_set_bit(HANDSHAKE_F_COMPLETED, &req->hr_flags)) {
> +		trace_handshake_complete(net, req, sock, status);
> +		req->hr_proto->hp_done(req, status, tb);
> +		__sock_put(sock->sk);
> +	}
> +}
> +
> +/**
> + * handshake_req_cancel - consumer API to cancel an in-progress handshake
> + * @sock: socket on which there is an ongoing handshake
> + *
> + * XXX: Perhaps killing the user space agent might also be necessary?

I thought we had agreed that we would be sending a signal to the 
userspace process?
Ideally we would be sending a SIGHUP, wait for some time on the 
userspace process to respond with a 'done' message, and send a 'KILL' 
signal if we haven't received one.

Obs: Sending a KILL signal would imply that userspace is able to cope 
with children dying. Which pretty much excludes pthreads, I would think.

Guess I'll have to consult Stevens :-)

> + *
> + * Request cancellation races with request completion. To determine
> + * who won, callers examine the return value from this function.
> + *
> + * Return values:
> + *   %0 - Uncompleted handshake request was canceled or not found
> + *   %-EBUSY - Handshake request already completed

EBUSY? Wouldn't be EAGAIN more approriate?
After all, the request is everything _but_ busy...

> + */
> +int handshake_req_cancel(struct socket *sock)
> +{
> +	struct handshake_req *req;
> +	struct sock *sk;
> +	struct net *net;
> +
> +	if (!sock)
> +		return 0;
> +
> +	sk = sock->sk;
> +	req = sk->sk_handshake_req;
> +	net = sock_net(sk);
> +
> +	if (!req) {
> +		trace_handshake_cancel_none(net, req, sock);
> +		return 0;
> +	}
> +
> +	if (remove_pending(net, req)) {
> +		/* Request hadn't been accepted */
> +		trace_handshake_cancel(net, req, sock);
> +		return 0;
> +	}
> +	if (test_and_set_bit(HANDSHAKE_F_COMPLETED, &req->hr_flags)) {
> +		/* Request already completed */
> +		trace_handshake_cancel_busy(net, req, sock);
> +		return -EBUSY;
> +	}
> +
> +	__sock_put(sk);
> +	trace_handshake_cancel(net, req, sock);
> +	return 0;
> +}
> +EXPORT_SYMBOL(handshake_req_cancel);
> diff --git a/net/handshake/trace.c b/net/handshake/trace.c
> new file mode 100644
> index 000000000000..3a5b6f29a2b8
> --- /dev/null
> +++ b/net/handshake/trace.c
> @@ -0,0 +1,17 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Trace points for transport security layer handshakes.
> + *
> + * Author: Chuck Lever <chuck.lever@oracle.com>
> + *
> + * Copyright (c) 2023, Oracle and/or its affiliates.
> + */
> +
> +#include <linux/types.h>
> +#include <net/sock.h>
> +
> +#include "handshake.h"
> +
> +#define CREATE_TRACE_POINTS
> +
> +#include <trace/events/handshake.h>
> 
Cheers,

Hannes
Chuck Lever III Feb. 27, 2023, 2:59 p.m. UTC | #2
> On Feb 27, 2023, at 4:24 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 2/24/23 20:19, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>> When a kernel consumer needs a transport layer security session, it
>> first needs a handshake to negotiate and establish a session. This
>> negotiation can be done in user space via one of the several
>> existing library implementations, or it can be done in the kernel.
>> No in-kernel handshake implementations yet exist. In their absence,
>> we add a netlink service that can:
>> a. Notify a user space daemon that a handshake is needed.
>> b. Once notified, the daemon calls the kernel back via this
>>    netlink service to get the handshake parameters, including an
>>    open socket on which to establish the session.
>> c. Once the handshake is complete, the daemon reports the
>>    session status and other information via a second netlink
>>    operation. This operation marks that it is safe for the
>>    kernel to use the open socket and the security session
>>    established there.
>> The notification service uses a multicast group. Each handshake
>> mechanism (eg, tlshd) adopts its own group number so that the
>> handshake services are completely independent of one another. The
>> kernel can then tell via netlink_has_listeners() whether a handshake
>> service is active and prepared to handle a handshake request.
>> A new netlink operation, ACCEPT, acts like accept(2) in that it
>> instantiates a file descriptor in the user space daemon's fd table.
>> If this operation is successful, the reply carries the fd number,
>> which can be treated as an open and ready file descriptor.
>> While user space is performing the handshake, the kernel keeps its
>> muddy paws off the open socket. A second new netlink operation,
>> DONE, indicates that the user space daemon is finished with the
>> socket and it is safe for the kernel to use again. The operation
>> also indicates whether a session was established successfully.
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  Documentation/netlink/specs/handshake.yaml |  134 +++++++++++
>>  include/net/handshake.h                    |   45 ++++
>>  include/net/net_namespace.h                |    5
>>  include/net/sock.h                         |    1
>>  include/trace/events/handshake.h           |  159 +++++++++++++
>>  include/uapi/linux/handshake.h             |   63 +++++
>>  net/Makefile                               |    1
>>  net/handshake/Makefile                     |   11 +
>>  net/handshake/handshake.h                  |   41 +++
>>  net/handshake/netlink.c                    |  340 ++++++++++++++++++++++++++++
>>  net/handshake/request.c                    |  246 ++++++++++++++++++++
>>  net/handshake/trace.c                      |   17 +
>>  12 files changed, 1063 insertions(+)
>>  create mode 100644 Documentation/netlink/specs/handshake.yaml
>>  create mode 100644 include/net/handshake.h
>>  create mode 100644 include/trace/events/handshake.h
>>  create mode 100644 include/uapi/linux/handshake.h
>>  create mode 100644 net/handshake/Makefile
>>  create mode 100644 net/handshake/handshake.h
>>  create mode 100644 net/handshake/netlink.c
>>  create mode 100644 net/handshake/request.c
>>  create mode 100644 net/handshake/trace.c
>> diff --git a/Documentation/netlink/specs/handshake.yaml b/Documentation/netlink/specs/handshake.yaml
>> new file mode 100644
>> index 000000000000..683a8f2df0a7
>> --- /dev/null
>> +++ b/Documentation/netlink/specs/handshake.yaml
>> @@ -0,0 +1,134 @@
>> +# SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
>> +#
>> +# GENL HANDSHAKE service.
>> +#
>> +# Author: Chuck Lever <chuck.lever@oracle.com>
>> +#
>> +# Copyright (c) 2023, Oracle and/or its affiliates.
>> +#
>> +
>> +name: handshake
>> +
>> +protocol: genetlink-c
>> +
>> +doc: Netlink protocol to request a transport layer security handshake.
>> +
>> +uapi-header: linux/net/handshake.h
>> +
>> +definitions:
>> +  -
>> +    type: enum
>> +    name: handler-class
>> +    enum-name:
>> +    value-start: 0
>> +    entries: [ none ]
>> +  -
>> +    type: enum
>> +    name: msg-type
>> +    enum-name:
>> +    value-start: 0
>> +    entries: [ unspec, clienthello, serverhello ]
>> +  -
>> +    type: enum
>> +    name: auth
>> +    enum-name:
>> +    value-start: 0
>> +    entries: [ unspec, unauth, x509, psk ]
>> +
>> +attribute-sets:
>> +  -
>> +    name: accept
>> +    attributes:
>> +      -
>> +        name: status
>> +        doc: Status of this accept operation
>> +        type: u32
>> +        value: 1
>> +      -
>> +        name: sockfd
>> +        doc: File descriptor of socket to use
>> +        type: u32
>> +      -
>> +        name: handler-class
>> +        doc: Which type of handler is responding
>> +        type: u32
>> +        enum: handler-class
>> +      -
>> +        name: message-type
>> +        doc: Handshake message type
>> +        type: u32
>> +        enum: msg-type
>> +      -
>> +        name: auth
>> +        doc: Authentication mode
>> +        type: u32
>> +        enum: auth
>> +      -
>> +        name: gnutls-priorities
>> +        doc: GnuTLS priority string
>> +        type: string
>> +      -
>> +        name: my-peerid
>> +        doc: Serial no of key containing local identity
>> +        type: u32
>> +      -
>> +        name: my-privkey
>> +        doc: Serial no of key containing optional private key
>> +        type: u32
>> +  -
>> +    name: done
>> +    attributes:
>> +      -
>> +        name: status
>> +        doc: Session status
>> +        type: u32
>> +        value: 1
>> +      -
>> +        name: sockfd
>> +        doc: File descriptor of socket that has completed
>> +        type: u32
>> +      -
>> +        name: remote-peerid
>> +        doc: Serial no of keys containing identities of remote peer
>> +        type: u32
>> +
>> +operations:
>> +  list:
>> +    -
>> +      name: ready
>> +      doc: Notify handlers that a new handshake request is waiting
>> +      value: 1
>> +      notify: accept
>> +    -
>> +      name: accept
>> +      doc: Handler retrieves next queued handshake request
>> +      attribute-set: accept
>> +      flags: [ admin-perm ]
>> +      do:
>> +        request:
>> +          attributes:
>> +            - handler-class
>> +        reply:
>> +          attributes:
>> +            - status
>> +            - sockfd
>> +            - message-type
>> +            - auth
>> +            - gnutls-priorities
>> +            - my-peerid
>> +            - my-privkey
>> +    -
>> +      name: done
>> +      doc: Handler reports handshake completion
>> +      attribute-set: done
>> +      do:
>> +        request:
>> +          attributes:
>> +            - status
>> +            - sockfd
>> +            - remote-peerid
>> +
>> +mcast-groups:
>> +  list:
>> +    -
>> +      name: none
>> diff --git a/include/net/handshake.h b/include/net/handshake.h
>> new file mode 100644
>> index 000000000000..08f859237936
>> --- /dev/null
>> +++ b/include/net/handshake.h
>> @@ -0,0 +1,45 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * Generic HANDSHAKE service.
>> + *
>> + * Author: Chuck Lever <chuck.lever@oracle.com>
>> + *
>> + * Copyright (c) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +/*
>> + * Data structures and functions that are visible only within the
>> + * kernel are declared here.
>> + */
>> +
>> +#ifndef _NET_HANDSHAKE_H
>> +#define _NET_HANDSHAKE_H
>> +
>> +struct handshake_req;
>> +
>> +/*
>> + * Invariants for all handshake requests for one transport layer
>> + * security protocol
>> + */
>> +struct handshake_proto {
>> +	int			hp_handler_class;
>> +	size_t			hp_privsize;
>> +
>> +	int			(*hp_accept)(struct handshake_req *req,
>> +					     struct genl_info *gi, int fd);
>> +	void			(*hp_done)(struct handshake_req *req,
>> +					   int status, struct nlattr **tb);
>> +	void			(*hp_destroy)(struct handshake_req *req);
>> +};
>> +
>> +extern struct handshake_req *
>> +handshake_req_alloc(struct socket *sock, const struct handshake_proto *proto,
>> +		    gfp_t flags);
>> +extern void *handshake_req_private(struct handshake_req *req);
>> +extern int handshake_req_submit(struct handshake_req *req, gfp_t flags);
>> +extern int handshake_req_cancel(struct socket *sock);
>> +
>> +extern struct nlmsghdr *handshake_genl_put(struct sk_buff *msg,
>> +					   struct genl_info *gi);
>> +
>> +#endif /* _NET_HANDSHAKE_H */
>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>> index 78beaa765c73..a0ce9de4dab1 100644
>> --- a/include/net/net_namespace.h
>> +++ b/include/net/net_namespace.h
>> @@ -188,6 +188,11 @@ struct net {
>>  #if IS_ENABLED(CONFIG_SMC)
>>  	struct netns_smc	smc;
>>  #endif
>> +
>> +	/* transport layer security handshake requests */
>> +	spinlock_t		hs_lock;
>> +	struct list_head	hs_requests;
>> +	int			hs_pending;
>>  } __randomize_layout;
>>    #include <linux/seq_file_net.h>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 573f2bf7e0de..2a7345ce2540 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -519,6 +519,7 @@ struct sock {
>>    	struct socket		*sk_socket;
>>  	void			*sk_user_data;
>> +	void			*sk_handshake_req;
>>  #ifdef CONFIG_SECURITY
>>  	void			*sk_security;
>>  #endif
>> diff --git a/include/trace/events/handshake.h b/include/trace/events/handshake.h
>> new file mode 100644
>> index 000000000000..feffcd1d6256
>> --- /dev/null
>> +++ b/include/trace/events/handshake.h
>> @@ -0,0 +1,159 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM handshake
>> +
>> +#if !defined(_TRACE_HANDSHAKE_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_HANDSHAKE_H
>> +
>> +#include <linux/net.h>
>> +#include <linux/tracepoint.h>
>> +
>> +DECLARE_EVENT_CLASS(handshake_event_class,
>> +	TP_PROTO(
>> +		const struct net *net,
>> +		const struct handshake_req *req,
>> +		const struct socket *sock
>> +	),
>> +	TP_ARGS(net, req, sock),
>> +	TP_STRUCT__entry(
>> +		__field(const void *, req)
>> +		__field(const void *, sock)
>> +		__field(unsigned int, netns_ino)
>> +	),
>> +	TP_fast_assign(
>> +		__entry->req = req;
>> +		__entry->sock = sock;
>> +		__entry->netns_ino = net->ns.inum;
>> +	),
>> +	TP_printk("req=%p sock=%p",
>> +		__entry->req, __entry->sock
>> +	)
>> +);
>> +#define DEFINE_HANDSHAKE_EVENT(name)				\
>> +	DEFINE_EVENT(handshake_event_class, name,		\
>> +		TP_PROTO(					\
>> +			const struct net *net,			\
>> +			const struct handshake_req *req,	\
>> +			const struct socket *sock		\
>> +		),						\
>> +		TP_ARGS(net, req, sock))
>> +
>> +DECLARE_EVENT_CLASS(handshake_fd_class,
>> +	TP_PROTO(
>> +		const struct net *net,
>> +		const struct handshake_req *req,
>> +		const struct socket *sock,
>> +		int fd
>> +	),
>> +	TP_ARGS(net, req, sock, fd),
>> +	TP_STRUCT__entry(
>> +		__field(const void *, req)
>> +		__field(const void *, sock)
>> +		__field(int, fd)
>> +		__field(unsigned int, netns_ino)
>> +	),
>> +	TP_fast_assign(
>> +		__entry->req = req;
>> +		__entry->sock = req->hr_sock;
>> +		__entry->fd = fd;
>> +		__entry->netns_ino = net->ns.inum;
>> +	),
>> +	TP_printk("req=%p sock=%p fd=%d",
>> +		__entry->req, __entry->sock, __entry->fd
>> +	)
>> +);
>> +#define DEFINE_HANDSHAKE_FD_EVENT(name)				\
>> +	DEFINE_EVENT(handshake_fd_class, name,			\
>> +		TP_PROTO(					\
>> +			const struct net *net,			\
>> +			const struct handshake_req *req,	\
>> +			const struct socket *sock,		\
>> +			int fd					\
>> +		),						\
>> +		TP_ARGS(net, req, sock, fd))
>> +
>> +DECLARE_EVENT_CLASS(handshake_error_class,
>> +	TP_PROTO(
>> +		const struct net *net,
>> +		const struct handshake_req *req,
>> +		const struct socket *sock,
>> +		int err
>> +	),
>> +	TP_ARGS(net, req, sock, err),
>> +	TP_STRUCT__entry(
>> +		__field(const void *, req)
>> +		__field(const void *, sock)
>> +		__field(int, err)
>> +		__field(unsigned int, netns_ino)
>> +	),
>> +	TP_fast_assign(
>> +		__entry->req = req;
>> +		__entry->sock = sock;
>> +		__entry->err = err;
>> +		__entry->netns_ino = net->ns.inum;
>> +	),
>> +	TP_printk("req=%p sock=%p err=%d",
>> +		__entry->req, __entry->sock, __entry->err
>> +	)
>> +);
>> +#define DEFINE_HANDSHAKE_ERROR(name)				\
>> +	DEFINE_EVENT(handshake_error_class, name,		\
>> +		TP_PROTO(					\
>> +			const struct net *net,			\
>> +			const struct handshake_req *req,	\
>> +			const struct socket *sock,		\
>> +			int err					\
>> +		),						\
>> +		TP_ARGS(net, req, sock, err))
>> +
>> +
>> +/**
>> + ** Request lifetime events
>> + **/
>> +
>> +DEFINE_HANDSHAKE_EVENT(handshake_submit);
>> +DEFINE_HANDSHAKE_ERROR(handshake_submit_err);
>> +DEFINE_HANDSHAKE_EVENT(handshake_cancel);
>> +DEFINE_HANDSHAKE_EVENT(handshake_cancel_none);
>> +DEFINE_HANDSHAKE_EVENT(handshake_cancel_busy);
>> +DEFINE_HANDSHAKE_EVENT(handshake_destruct);
>> +
>> +
>> +TRACE_EVENT(handshake_complete,
>> +	TP_PROTO(
>> +		const struct net *net,
>> +		const struct handshake_req *req,
>> +		const struct socket *sock,
>> +		int status
>> +	),
>> +	TP_ARGS(net, req, sock, status),
>> +	TP_STRUCT__entry(
>> +		__field(const void *, req)
>> +		__field(const void *, sock)
>> +		__field(int, status)
>> +		__field(unsigned int, netns_ino)
>> +	),
>> +	TP_fast_assign(
>> +		__entry->req = req;
>> +		__entry->sock = sock;
>> +		__entry->status = status;
>> +		__entry->netns_ino = net->ns.inum;
>> +	),
>> +	TP_printk("req=%p sock=%p status=%d",
>> +		__entry->req, __entry->sock, __entry->status
>> +	)
>> +);
>> +
>> +/**
>> + ** Netlink events
>> + **/
>> +
>> +DEFINE_HANDSHAKE_ERROR(handshake_notify_err);
>> +DEFINE_HANDSHAKE_FD_EVENT(handshake_cmd_accept);
>> +DEFINE_HANDSHAKE_ERROR(handshake_cmd_accept_err);
>> +DEFINE_HANDSHAKE_FD_EVENT(handshake_cmd_done);
>> +DEFINE_HANDSHAKE_ERROR(handshake_cmd_done_err);
>> +
>> +#endif /* _TRACE_HANDSHAKE_H */
>> +
>> +#include <trace/define_trace.h>
>> diff --git a/include/uapi/linux/handshake.h b/include/uapi/linux/handshake.h
>> new file mode 100644
>> index 000000000000..09fd7c37cba4
>> --- /dev/null
>> +++ b/include/uapi/linux/handshake.h
>> @@ -0,0 +1,63 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/* Do not edit directly, auto-generated from: */
>> +/*	Documentation/netlink/specs/handshake.yaml */
>> +/* YNL-GEN uapi header */
>> +
>> +#ifndef _UAPI_LINUX_HANDSHAKE_H
>> +#define _UAPI_LINUX_HANDSHAKE_H
>> +
>> +#define HANDSHAKE_FAMILY_NAME		"handshake"
>> +#define HANDSHAKE_FAMILY_VERSION	1
>> +
>> +enum {
>> +	HANDSHAKE_HANDLER_CLASS_NONE,
>> +};
>> +
>> +enum {
>> +	HANDSHAKE_MSG_TYPE_UNSPEC,
>> +	HANDSHAKE_MSG_TYPE_CLIENTHELLO,
>> +	HANDSHAKE_MSG_TYPE_SERVERHELLO,
>> +};
>> +
>> +enum {
>> +	HANDSHAKE_AUTH_UNSPEC,
>> +	HANDSHAKE_AUTH_UNAUTH,
>> +	HANDSHAKE_AUTH_X509,
>> +	HANDSHAKE_AUTH_PSK,
>> +};
>> +
>> +enum {
>> +	HANDSHAKE_A_ACCEPT_STATUS = 1,
>> +	HANDSHAKE_A_ACCEPT_SOCKFD,
>> +	HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
>> +	HANDSHAKE_A_ACCEPT_MESSAGE_TYPE,
>> +	HANDSHAKE_A_ACCEPT_AUTH,
>> +	HANDSHAKE_A_ACCEPT_GNUTLS_PRIORITIES,
>> +	HANDSHAKE_A_ACCEPT_MY_PEERID,
>> +	HANDSHAKE_A_ACCEPT_MY_PRIVKEY,
>> +
>> +	__HANDSHAKE_A_ACCEPT_MAX,
>> +	HANDSHAKE_A_ACCEPT_MAX = (__HANDSHAKE_A_ACCEPT_MAX - 1)
>> +};
>> +
>> +enum {
>> +	HANDSHAKE_A_DONE_STATUS = 1,
>> +	HANDSHAKE_A_DONE_SOCKFD,
>> +	HANDSHAKE_A_DONE_REMOTE_PEERID,
>> +
>> +	__HANDSHAKE_A_DONE_MAX,
>> +	HANDSHAKE_A_DONE_MAX = (__HANDSHAKE_A_DONE_MAX - 1)
>> +};
>> +
>> +enum {
>> +	HANDSHAKE_CMD_READY = 1,
>> +	HANDSHAKE_CMD_ACCEPT,
>> +	HANDSHAKE_CMD_DONE,
>> +
>> +	__HANDSHAKE_CMD_MAX,
>> +	HANDSHAKE_CMD_MAX = (__HANDSHAKE_CMD_MAX - 1)
>> +};
>> +
>> +#define HANDSHAKE_MCGRP_NONE	"none"
>> +
>> +#endif /* _UAPI_LINUX_HANDSHAKE_H */
>> diff --git a/net/Makefile b/net/Makefile
>> index 0914bea9c335..adbb64277601 100644
>> --- a/net/Makefile
>> +++ b/net/Makefile
>> @@ -79,3 +79,4 @@ obj-$(CONFIG_NET_NCSI)		+= ncsi/
>>  obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
>>  obj-$(CONFIG_MPTCP)		+= mptcp/
>>  obj-$(CONFIG_MCTP)		+= mctp/
>> +obj-y				+= handshake/
>> diff --git a/net/handshake/Makefile b/net/handshake/Makefile
>> new file mode 100644
>> index 000000000000..a41b03f4837b
>> --- /dev/null
>> +++ b/net/handshake/Makefile
>> @@ -0,0 +1,11 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +# Makefile for the Generic HANDSHAKE service
>> +#
>> +# Author: Chuck Lever <chuck.lever@oracle.com>
>> +#
>> +# Copyright (c) 2023, Oracle and/or its affiliates.
>> +#
>> +
>> +obj-y += handshake.o
>> +handshake-y := netlink.o request.o trace.o
>> diff --git a/net/handshake/handshake.h b/net/handshake/handshake.h
>> new file mode 100644
>> index 000000000000..366c7659ec09
>> --- /dev/null
>> +++ b/net/handshake/handshake.h
>> @@ -0,0 +1,41 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * Generic netlink handshake service
>> + *
>> + * Author: Chuck Lever <chuck.lever@oracle.com>
>> + *
>> + * Copyright (c) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +/*
>> + * Data structures and functions that are visible only within the
>> + * handshake module are declared here.
>> + */
>> +
>> +#ifndef _INTERNAL_HANDSHAKE_H
>> +#define _INTERNAL_HANDSHAKE_H
>> +
>> +/*
>> + * One handshake request
>> + */
>> +struct handshake_req {
>> +	struct list_head		hr_list;
>> +	unsigned long			hr_flags;
>> +	const struct handshake_proto	*hr_proto;
>> +	struct socket			*hr_sock;
>> +
>> +	void				(*hr_saved_destruct)(struct sock *sk);
>> +};
>> +
>> +#define HANDSHAKE_F_COMPLETED	BIT(0)
>> +
>> +/* netlink.c */
>> +extern bool handshake_genl_inited;
>> +int handshake_genl_notify(struct net *net, int handler_class, gfp_t flags);
>> +
>> +/* request.c */
>> +void __remove_pending_locked(struct net *net, struct handshake_req *req);
>> +void handshake_complete(struct handshake_req *req, int status,
>> +			struct nlattr **tb);
>> +
>> +#endif /* _INTERNAL_HANDSHAKE_H */
>> diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
>> new file mode 100644
>> index 000000000000..581e382236cf
>> --- /dev/null
>> +++ b/net/handshake/netlink.c
>> @@ -0,0 +1,340 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Generic netlink handshake service
>> + *
>> + * Author: Chuck Lever <chuck.lever@oracle.com>
>> + *
>> + * Copyright (c) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/socket.h>
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/skbuff.h>
>> +#include <linux/inet.h>
>> +
>> +#include <net/sock.h>
>> +#include <net/genetlink.h>
>> +#include <net/handshake.h>
>> +
>> +#include <uapi/linux/handshake.h>
>> +#include <trace/events/handshake.h>
>> +#include "handshake.h"
>> +
>> +static struct genl_family __ro_after_init handshake_genl_family;
>> +bool handshake_genl_inited;
>> +
>> +/**
>> + * handshake_genl_notify - Notify handlers that a request is waiting
>> + * @net: target network namespace
>> + * @handler_class: target handler
>> + * @flags: memory allocation control flags
>> + *
>> + * Returns zero on success or a negative errno if notification failed.
>> + */
>> +int handshake_genl_notify(struct net *net, int handler_class, gfp_t flags)
>> +{
>> +	struct sk_buff *msg;
>> +	void *hdr;
>> +
>> +	if (!genl_has_listeners(&handshake_genl_family, net, handler_class))
>> +		return -ESRCH;
>> +
>> +	msg = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
>> +	if (!msg)
>> +		return -ENOMEM;
>> +
>> +	hdr = genlmsg_put(msg, 0, 0, &handshake_genl_family, 0,
>> +			  HANDSHAKE_CMD_READY);
>> +	if (!hdr)
>> +		goto out_free;
>> +
>> +	if (nla_put_u32(msg, HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
>> +			handler_class) < 0) {
>> +		genlmsg_cancel(msg, hdr);
>> +		goto out_free;
>> +	}
>> +
>> +	genlmsg_end(msg, hdr);
>> +	return genlmsg_multicast_netns(&handshake_genl_family, net, msg,
>> +				       0, handler_class, flags);
>> +
>> +out_free:
>> +	nlmsg_free(msg);
>> +	return -EMSGSIZE;
>> +}
>> +
>> +/**
>> + * handshake_genl_put - Create a generic netlink message header
>> + * @msg: buffer in which to create the header
>> + * @gi: generic netlink message context
>> + *
>> + * Returns a ready-to-use header, or NULL.
>> + */
>> +struct nlmsghdr *handshake_genl_put(struct sk_buff *msg, struct genl_info *gi)
>> +{
>> +	return genlmsg_put(msg, gi->snd_portid, gi->snd_seq,
>> +			   &handshake_genl_family, 0, gi->genlhdr->cmd);
>> +}
>> +EXPORT_SYMBOL(handshake_genl_put);
>> +
>> +static int handshake_status_reply(struct sk_buff *skb, struct genl_info *gi,
>> +				  int status)
>> +{
>> +	struct nlmsghdr *hdr;
>> +	struct sk_buff *msg;
>> +	int ret;
>> +
>> +	ret = -ENOMEM;
>> +	msg = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
>> +	if (!msg)
>> +		goto out;
>> +	hdr = handshake_genl_put(msg, gi);
>> +	if (!hdr)
>> +		goto out_free;
>> +
>> +	ret = -EMSGSIZE;
>> +	ret = nla_put_u32(msg, HANDSHAKE_A_ACCEPT_STATUS, status);
>> +	if (ret < 0)
>> +		goto out_free;
>> +
>> +	genlmsg_end(msg, hdr);
>> +	return genlmsg_reply(msg, gi);
>> +
>> +out_free:
>> +	genlmsg_cancel(msg, hdr);
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * dup() a kernel socket for use as a user space file descriptor
>> + * in the current process.
>> + *
>> + * Implicit argument: "current()"
>> + */
>> +static int handshake_dup(struct socket *kernsock)
>> +{
>> +	struct file *file = get_file(kernsock->file);
>> +	int newfd;
>> +
>> +	newfd = get_unused_fd_flags(O_CLOEXEC);
>> +	if (newfd < 0) {
>> +		fput(file);
>> +		return newfd;
>> +	}
>> +
>> +	fd_install(newfd, file);
>> +	return newfd;
>> +}
>> +
>> +static const struct nla_policy
>> +handshake_accept_nl_policy[HANDSHAKE_A_ACCEPT_HANDLER_CLASS + 1] = {
>> +	[HANDSHAKE_A_ACCEPT_HANDLER_CLASS] = { .type = NLA_U32, },
>> +};
>> +
>> +static int handshake_nl_accept_doit(struct sk_buff *skb, struct genl_info *gi)
>> +{
>> +	struct nlattr *tb[HANDSHAKE_A_ACCEPT_MAX + 1];
>> +	struct net *net = sock_net(skb->sk);
>> +	struct handshake_req *pos, *req;
>> +	int fd, err;
>> +
>> +	err = -EINVAL;
>> +	if (genlmsg_parse(nlmsg_hdr(skb), &handshake_genl_family, tb,
>> +			  HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
>> +			  handshake_accept_nl_policy, NULL))
>> +		goto out_status;
>> +	if (!tb[HANDSHAKE_A_ACCEPT_HANDLER_CLASS])
>> +		goto out_status;
>> +
>> +	req = NULL;
>> +	spin_lock(&net->hs_lock);
>> +	list_for_each_entry(pos, &net->hs_requests, hr_list) {
>> +		if (pos->hr_proto->hp_handler_class !=
>> +		    nla_get_u32(tb[HANDSHAKE_A_ACCEPT_HANDLER_CLASS]))
>> +			continue;
>> +		__remove_pending_locked(net, pos);
>> +		req = pos;
>> +		break;
>> +	}
>> +	spin_unlock(&net->hs_lock);
>> +	if (!req)
>> +		goto out_status;
>> +
>> +	fd = handshake_dup(req->hr_sock);
>> +	if (fd < 0) {
>> +		err = fd;
>> +		goto out_complete;
>> +	}
>> +	err = req->hr_proto->hp_accept(req, gi, fd);
>> +	if (err)
>> +		goto out_complete;
>> +
>> +	trace_handshake_cmd_accept(net, req, req->hr_sock, fd);
>> +	return 0;
>> +
>> +out_complete:
>> +	handshake_complete(req, -EIO, NULL);
>> +	fput(req->hr_sock->file);
>> +out_status:
>> +	trace_handshake_cmd_accept_err(net, req, NULL, err);
>> +	return handshake_status_reply(skb, gi, err);
>> +}
>> +
>> +static const struct nla_policy
>> +handshake_done_nl_policy[HANDSHAKE_A_DONE_MAX + 1] = {
>> +	[HANDSHAKE_A_DONE_SOCKFD] = { .type = NLA_U32, },
>> +	[HANDSHAKE_A_DONE_STATUS] = { .type = NLA_U32, },
>> +	[HANDSHAKE_A_DONE_REMOTE_PEERID] = { .type = NLA_U32, },
>> +};
>> +
>> +static int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *gi)
>> +{
>> +	struct nlattr *tb[HANDSHAKE_A_DONE_MAX + 1];
>> +	struct net *net = sock_net(skb->sk);
>> +	struct socket *sock = NULL;
>> +	struct handshake_req *req;
>> +	int fd, status, err;
>> +
>> +	err = genlmsg_parse(nlmsg_hdr(skb), &handshake_genl_family, tb,
>> +			    HANDSHAKE_A_DONE_MAX, handshake_done_nl_policy,
>> +			    NULL);
>> +	if (err || !tb[HANDSHAKE_A_DONE_SOCKFD]) {
>> +		err = -EINVAL;
>> +		goto out_status;
>> +	}
>> +
>> +	fd = nla_get_u32(tb[HANDSHAKE_A_DONE_SOCKFD]);
>> +
>> +	err = 0;
>> +	sock = sockfd_lookup(fd, &err);
>> +	if (err) {
>> +		err = -EBADF;
>> +		goto out_status;
>> +	}
>> +
>> +	req = sock->sk->sk_handshake_req;
>> +	if (!req) {
>> +		err = -EBUSY;
>> +		goto out_status;
>> +	}
>> +
>> +	trace_handshake_cmd_done(net, req, sock, fd);
>> +
>> +	status = -EIO;
>> +	if (tb[HANDSHAKE_A_DONE_STATUS])
>> +		status = nla_get_u32(tb[HANDSHAKE_A_DONE_STATUS]);
>> +
> And this makes me ever so slightly uneasy.
> 
> As 'status' is a netlink attribute it's inevitably defined as 'unsigned'.
> Yet we assume that 'status' is a negative number, leaving us _technically_ in unchartered territory.

Ah, that's an oversight.


> And that is notwithstanding the problem that we haven't even defined _what_ should be in the status attribute.

It's now an errno value.


> Reading the code I assume that it's either '0' for success or a negative number (ie the error code) on failure.
> Which implicitely means that we _never_ set a positive number here.
> So what would we lose if we declare 'status' to carry the _positive_ error number instead?
> It would bring us in-line with the actual netlink attribute definition, we wouldn't need to worry about possible integer overflows, yadda yadda...
> 
> Hmm?

It can also be argued that errnos in user space are positive-valued,
therefore, this user space visible protocol should use a positive
errno.


>> +	handshake_complete(req, status, tb);
>> +	fput(sock->file);
>> +	return 0;
>> +
>> +out_status:
>> +	trace_handshake_cmd_done_err(net, req, sock, err);
>> +	return handshake_status_reply(skb, gi, err);
>> +}
>> +
>> +static const struct genl_split_ops handshake_nl_ops[] = {
>> +	{
>> +		.cmd		= HANDSHAKE_CMD_ACCEPT,
>> +		.doit		= handshake_nl_accept_doit,
>> +		.policy		= handshake_accept_nl_policy,
>> +		.maxattr	= HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
>> +		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>> +	},
>> +	{
>> +		.cmd		= HANDSHAKE_CMD_DONE,
>> +		.doit		= handshake_nl_done_doit,
>> +		.policy		= handshake_done_nl_policy,
>> +		.maxattr	= HANDSHAKE_A_DONE_REMOTE_PEERID,
>> +		.flags		= GENL_CMD_CAP_DO,
>> +	},
>> +};
>> +
>> +static const struct genl_multicast_group handshake_nl_mcgrps[] = {
>> +	[HANDSHAKE_HANDLER_CLASS_NONE] = { .name = HANDSHAKE_MCGRP_NONE, },
>> +};
>> +
>> +static struct genl_family __ro_after_init handshake_genl_family = {
>> +	.hdrsize		= 0,
>> +	.name			= HANDSHAKE_FAMILY_NAME,
>> +	.version		= HANDSHAKE_FAMILY_VERSION,
>> +	.netnsok		= true,
>> +	.parallel_ops		= true,
>> +	.n_mcgrps		= ARRAY_SIZE(handshake_nl_mcgrps),
>> +	.n_split_ops		= ARRAY_SIZE(handshake_nl_ops),
>> +	.split_ops		= handshake_nl_ops,
>> +	.mcgrps			= handshake_nl_mcgrps,
>> +	.module			= THIS_MODULE,
>> +};
>> +
>> +static int __net_init handshake_net_init(struct net *net)
>> +{
>> +	spin_lock_init(&net->hs_lock);
>> +	INIT_LIST_HEAD(&net->hs_requests);
>> +	net->hs_pending	= 0;
>> +	return 0;
>> +}
>> +
>> +static void __net_exit handshake_net_exit(struct net *net)
>> +{
>> +	struct handshake_req *req;
>> +	LIST_HEAD(requests);
>> +
>> +	/*
>> +	 * This drains the net's pending list. Requests that
>> +	 * have been accepted and are in progress will be
>> +	 * destroyed when the socket is closed.
>> +	 */
>> +	spin_lock(&net->hs_lock);
>> +	list_splice_init(&requests, &net->hs_requests);
>> +	spin_unlock(&net->hs_lock);
>> +
>> +	while (!list_empty(&requests)) {
>> +		req = list_first_entry(&requests, struct handshake_req, hr_list);
>> +		list_del(&req->hr_list);
>> +
>> +		/*
>> +		 * Requests on this list have not yet been
>> +		 * accepted, so they do not have an fd to put.
>> +		 */
>> +
>> +		handshake_complete(req, -ETIMEDOUT, NULL);
>> +	}
>> +}
>> +
>> +static struct pernet_operations handshake_genl_net_ops = {
>> +	.init		= handshake_net_init,
>> +	.exit		= handshake_net_exit,
>> +};
>> +
>> +static int __init handshake_init(void)
>> +{
>> +	int ret;
>> +
>> +	ret = genl_register_family(&handshake_genl_family);
>> +	if (ret) {
>> +		pr_warn("handshake: netlink registration failed (%d)\n", ret);
>> +		return ret;
>> +	}
>> +
>> +	ret = register_pernet_subsys(&handshake_genl_net_ops);
>> +	if (ret) {
>> +		pr_warn("handshake: pernet registration failed (%d)\n", ret);
>> +		genl_unregister_family(&handshake_genl_family);
>> +	}
>> +
>> +	handshake_genl_inited = true;
>> +	return ret;
>> +}
>> +
>> +static void __exit handshake_exit(void)
>> +{
>> +	unregister_pernet_subsys(&handshake_genl_net_ops);
>> +	genl_unregister_family(&handshake_genl_family);
>> +}
>> +
>> +module_init(handshake_init);
>> +module_exit(handshake_exit);
>> diff --git a/net/handshake/request.c b/net/handshake/request.c
>> new file mode 100644
>> index 000000000000..1d3b8e76dd2c
>> --- /dev/null
>> +++ b/net/handshake/request.c
>> @@ -0,0 +1,246 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Handshake request lifetime events
>> + *
>> + * Author: Chuck Lever <chuck.lever@oracle.com>
>> + *
>> + * Copyright (c) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/socket.h>
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/skbuff.h>
>> +#include <linux/inet.h>
>> +#include <linux/fdtable.h>
>> +
>> +#include <net/sock.h>
>> +#include <net/genetlink.h>
>> +#include <net/handshake.h>
>> +
>> +#include <uapi/linux/handshake.h>
>> +#include <trace/events/handshake.h>
>> +#include "handshake.h"
>> +
>> +/*
>> + * This limit is to prevent slow remotes from causing denial of service.
>> + * A ulimit-style tunable might be used instead.
>> + */
>> +#define HANDSHAKE_PENDING_MAX (10)
>> +
>> +static void __add_pending_locked(struct net *net, struct handshake_req *req)
>> +{
>> +	net->hs_pending++;
>> +	list_add_tail(&req->hr_list, &net->hs_requests);
>> +}
>> +
>> +void __remove_pending_locked(struct net *net, struct handshake_req *req)
>> +{
>> +	net->hs_pending--;
>> +	list_del_init(&req->hr_list);
>> +}
>> +
>> +/*
>> + * Return values:
>> + *   %true - the request was found on @net's pending list
>> + *   %false - the request was not found on @net's pending list
>> + *
>> + * If @req was on a pending list, it has not yet been accepted.
>> + */
>> +static bool remove_pending(struct net *net, struct handshake_req *req)
>> +{
>> +	bool ret;
>> +
>> +	ret = false;
>> +
>> +	spin_lock(&net->hs_lock);
>> +	if (!list_empty(&req->hr_list)) {
>> +		__remove_pending_locked(net, req);
>> +		ret = true;
>> +	}
>> +	spin_unlock(&net->hs_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +static void handshake_req_destroy(struct handshake_req *req, struct sock *sk)
>> +{
>> +	req->hr_proto->hp_destroy(req);
>> +	sk->sk_handshake_req = NULL;
>> +	kfree(req);
>> +}
>> +
>> +static void handshake_sk_destruct(struct sock *sk)
>> +{
>> +	struct handshake_req *req = sk->sk_handshake_req;
>> +
>> +	if (req) {
>> +		trace_handshake_destruct(sock_net(sk), req, req->hr_sock);
>> +		handshake_req_destroy(req, sk);
>> +	}
>> +}
>> +
>> +/**
>> + * handshake_req_alloc - consumer API to allocate a request
>> + * @sock: open socket on which to perform a handshake
>> + * @proto: security protocol
>> + * @flags: memory allocation flags
>> + *
>> + * Returns an initialized handshake_req or NULL.
>> + */
>> +struct handshake_req *handshake_req_alloc(struct socket *sock,
>> +					  const struct handshake_proto *proto,
>> +					  gfp_t flags)
>> +{
>> +	struct handshake_req *req;
>> +
>> +	/* Avoid accessing uninitialized global variables later on */
>> +	if (!handshake_genl_inited)
>> +		return NULL;
>> +
>> +	req = kzalloc(sizeof(*req) + proto->hp_privsize, flags);
>> +	if (!req)
>> +		return NULL;
>> +
>> +	sock_hold(sock->sk);
>> +
>> +	INIT_LIST_HEAD(&req->hr_list);
>> +	req->hr_sock = sock;
>> +	req->hr_proto = proto;
>> +	return req;
>> +}
>> +EXPORT_SYMBOL(handshake_req_alloc);
>> +
>> +/**
>> + * handshake_req_private - consumer API to return per-handshake private data
>> + * @req: handshake arguments
>> + *
>> + */
>> +void *handshake_req_private(struct handshake_req *req)
>> +{
>> +	return (void *)(req + 1);
>> +}
>> +EXPORT_SYMBOL(handshake_req_private);
>> +
>> +/**
>> + * handshake_req_submit - consumer API to submit a handshake request
>> + * @req: handshake arguments
>> + * @flags: memory allocation flags
>> + *
>> + * Return values:
>> + *   %0: Request queued
>> + *   %-EBUSY: A handshake is already under way for this socket
>> + *   %-ESRCH: No handshake agent is available
>> + *   %-EAGAIN: Too many pending handshake requests
>> + *   %-ENOMEM: Failed to allocate memory
>> + *   %-EMSGSIZE: Failed to construct notification message
>> + *
>> + * A zero return value from handshake_request() means that
>> + * exactly one subsequent completion callback is guaranteed.
>> + *
>> + * A negative return value from handshake_request() means that
>> + * no completion callback will be done and that @req is
>> + * destroyed.
>> + */
>> +int handshake_req_submit(struct handshake_req *req, gfp_t flags)
>> +{
>> +	struct socket *sock = req->hr_sock;
>> +	struct sock *sk = sock->sk;
>> +	struct net *net = sock_net(sk);
>> +	int ret;
>> +
>> +	ret = -EAGAIN;
>> +	if (READ_ONCE(net->hs_pending) >= HANDSHAKE_PENDING_MAX)
>> +		goto out_err;
>> +
>> +	ret = -EBUSY;
>> +	spin_lock(&net->hs_lock);
>> +	if (sk->sk_handshake_req || !list_empty(&req->hr_list)) {
>> +		spin_unlock(&net->hs_lock);
>> +		goto out_err;
>> +	}
>> +	req->hr_saved_destruct = sk->sk_destruct;
>> +	sk->sk_destruct = handshake_sk_destruct;
>> +	sk->sk_handshake_req = req;
>> +	__add_pending_locked(net, req);
>> +	spin_unlock(&net->hs_lock);
>> +
>> +	ret = handshake_genl_notify(net, req->hr_proto->hp_handler_class,
>> +				    flags);
>> +	if (ret) {
>> +		trace_handshake_notify_err(net, req, sock, ret);
>> +		if (remove_pending(net, req))
>> +			goto out_err;
>> +	}
>> +
>> +	trace_handshake_submit(net, req, sock);
>> +	return 0;
>> +
>> +out_err:
>> +	trace_handshake_submit_err(net, req, sock, ret);
>> +	handshake_req_destroy(req, sk);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(handshake_req_submit);
>> +
>> +void handshake_complete(struct handshake_req *req, int status,
>> +			struct nlattr **tb)
>> +{
>> +	struct socket *sock = req->hr_sock;
>> +	struct net *net = sock_net(sock->sk);
>> +
>> +	if (!test_and_set_bit(HANDSHAKE_F_COMPLETED, &req->hr_flags)) {
>> +		trace_handshake_complete(net, req, sock, status);
>> +		req->hr_proto->hp_done(req, status, tb);
>> +		__sock_put(sock->sk);
>> +	}
>> +}
>> +
>> +/**
>> + * handshake_req_cancel - consumer API to cancel an in-progress handshake
>> + * @sock: socket on which there is an ongoing handshake
>> + *
>> + * XXX: Perhaps killing the user space agent might also be necessary?
> 
> I thought we had agreed that we would be sending a signal to the userspace process?

We had discussed killing the handler, but I don't think it's necessary.
I'd rather not do something that drastic unless we have no other choice.
So far my testing hasn't shown a need for killing the child process.

I'm also concerned that the kernel could reuse the handler's process ID.
handshake_req_cancel would kill something that is not a handshake agent.


> Ideally we would be sending a SIGHUP, wait for some time on the userspace process to respond with a 'done' message, and send a 'KILL' signal if we haven't received one.
> 
> Obs: Sending a KILL signal would imply that userspace is able to cope with children dying. Which pretty much excludes pthreads, I would think.
> 
> Guess I'll have to consult Stevens :-)

Basically what cancel does is atomically disarm the "done" callback.

The socket belongs to the kernel, so it will live until the kernel is
good and through with it.


>> + *
>> + * Request cancellation races with request completion. To determine
>> + * who won, callers examine the return value from this function.
>> + *
>> + * Return values:
>> + *   %0 - Uncompleted handshake request was canceled or not found
>> + *   %-EBUSY - Handshake request already completed
> 
> EBUSY? Wouldn't be EAGAIN more approriate?

I don't think EAGAIN would be appropriate at all. The situation
is that the handshake completed, so there's no need to call cancel
again. It's synonym, EWOULDBLOCK, is also not a good semantic fit.


> After all, the request is everything _but_ busy...

I'm open to suggestion.

One option is to use a boolean return value instead of an errno.


>> + */
>> +int handshake_req_cancel(struct socket *sock)
>> +{
>> +	struct handshake_req *req;
>> +	struct sock *sk;
>> +	struct net *net;
>> +
>> +	if (!sock)
>> +		return 0;
>> +
>> +	sk = sock->sk;
>> +	req = sk->sk_handshake_req;
>> +	net = sock_net(sk);
>> +
>> +	if (!req) {
>> +		trace_handshake_cancel_none(net, req, sock);
>> +		return 0;
>> +	}
>> +
>> +	if (remove_pending(net, req)) {
>> +		/* Request hadn't been accepted */
>> +		trace_handshake_cancel(net, req, sock);
>> +		return 0;
>> +	}
>> +	if (test_and_set_bit(HANDSHAKE_F_COMPLETED, &req->hr_flags)) {
>> +		/* Request already completed */
>> +		trace_handshake_cancel_busy(net, req, sock);
>> +		return -EBUSY;
>> +	}
>> +
>> +	__sock_put(sk);
>> +	trace_handshake_cancel(net, req, sock);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(handshake_req_cancel);
>> diff --git a/net/handshake/trace.c b/net/handshake/trace.c
>> new file mode 100644
>> index 000000000000..3a5b6f29a2b8
>> --- /dev/null
>> +++ b/net/handshake/trace.c
>> @@ -0,0 +1,17 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Trace points for transport security layer handshakes.
>> + *
>> + * Author: Chuck Lever <chuck.lever@oracle.com>
>> + *
>> + * Copyright (c) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <net/sock.h>
>> +
>> +#include "handshake.h"
>> +
>> +#define CREATE_TRACE_POINTS
>> +
>> +#include <trace/events/handshake.h>
> Cheers,
> 
> Hannes
> 
> 
> 

--
Chuck Lever
Hannes Reinecke Feb. 27, 2023, 3:14 p.m. UTC | #3
On 2/27/23 15:59, Chuck Lever III wrote:
> 
> 
>> On Feb 27, 2023, at 4:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 2/24/23 20:19, Chuck Lever wrote:
[ .. ]
>>> +	req = sock->sk->sk_handshake_req;
>>> +	if (!req) {
>>> +		err = -EBUSY;
>>> +		goto out_status;
>>> +	}
>>> +
>>> +	trace_handshake_cmd_done(net, req, sock, fd);
>>> +
>>> +	status = -EIO;
>>> +	if (tb[HANDSHAKE_A_DONE_STATUS])
>>> +		status = nla_get_u32(tb[HANDSHAKE_A_DONE_STATUS]);
>>> +
>> And this makes me ever so slightly uneasy.
>>
>> As 'status' is a netlink attribute it's inevitably defined as 'unsigned'.
>> Yet we assume that 'status' is a negative number, leaving us _technically_ in unchartered territory.
> 
> Ah, that's an oversight.
> 
> 
>> And that is notwithstanding the problem that we haven't even defined _what_ should be in the status attribute.
> 
> It's now an errno value.
> 
> 
>> Reading the code I assume that it's either '0' for success or a negative number (ie the error code) on failure.
>> Which implicitely means that we _never_ set a positive number here.
>> So what would we lose if we declare 'status' to carry the _positive_ error number instead?
>> It would bring us in-line with the actual netlink attribute definition, we wouldn't need
>> to worry about possible integer overflows, yadda yadda...
>>
>> Hmm?
> 
> It can also be argued that errnos in user space are positive-valued,
> therefore, this user space visible protocol should use a positive
> errno.
> 
> 
Thanks.

[ .. ]
>>> +
>>> +/**
>>> + * handshake_req_cancel - consumer API to cancel an in-progress handshake
>>> + * @sock: socket on which there is an ongoing handshake
>>> + *
>>> + * XXX: Perhaps killing the user space agent might also be necessary?
>>
>> I thought we had agreed that we would be sending a signal to the userspace process?
> 
> We had discussed killing the handler, but I don't think it's necessary.
> I'd rather not do something that drastic unless we have no other choice.
> So far my testing hasn't shown a need for killing the child process.
> 
> I'm also concerned that the kernel could reuse the handler's process ID.
> handshake_req_cancel would kill something that is not a handshake agent.
> 
Hmm? If that were the case, wouldn't we be sending the netlink message 
to the
wrong process, to?

And in the absence of any timeout handler: what do we do if userspace is 
stuck / doesn't make forward progress?
At one point TCP will timeout, and the client will close the connection.
Leaving us with (potentially) broken / stuck processes. Sure we would 
need to initiate some cleanup here, no?

>> Ideally we would be sending a SIGHUP, wait for some time on the userspace
>> process to respond with a 'done' message, and send a 'KILL' signal if we
>> haven't received one.
>>
>> Obs: Sending a KILL signal would imply that userspace is able to cope with
>> children dying. Which pretty much excludes pthreads, I would think.
>>
>> Guess I'll have to consult Stevens :-)
> 
> Basically what cancel does is atomically disarm the "done" callback.
> 
> The socket belongs to the kernel, so it will live until the kernel is
> good and through with it.
> 
Oh, the socket does. But the process handling the socket is not.
So even if we close the socket from the kernel there's no guarantee that 
userspace will react to it.

Problem here is with using different key materials.
As the current handshake can only deal with one key at a time the only 
chance we have for several possible keys is to retry the handshake with 
the next key.
But out of necessity we have to use the _same_ connection (as tlshd 
doesn't control the socket). So we cannot close the socket, and hence we 
can't notify userspace to give up the handshake attempt.
Being able to send a signal would be simple; sending SIGHUP to 
userspace, and wait for the 'done' call.
If it doesn't come we can terminate all attempts.
But if we get the 'done' call we know it's safe to start with the next 
attempt.

> 
>>> + *
>>> + * Request cancellation races with request completion. To determine
>>> + * who won, callers examine the return value from this function.
>>> + *
>>> + * Return values:
>>> + *   %0 - Uncompleted handshake request was canceled or not found
>>> + *   %-EBUSY - Handshake request already completed
>>
>> EBUSY? Wouldn't be EAGAIN more approriate?
> 
> I don't think EAGAIN would be appropriate at all. The situation
> is that the handshake completed, so there's no need to call cancel
> again. It's synonym, EWOULDBLOCK, is also not a good semantic fit.
> 
> 
>> After all, the request is everything _but_ busy...
> 
> I'm open to suggestion.
> 
> One option is to use a boolean return value instead of an errno.
> 
> 
Yeah, that's probably better.

BTW: thanks for the tracepoints!

Cheers,

Hannes
Chuck Lever III Feb. 27, 2023, 3:39 p.m. UTC | #4
> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 2/27/23 15:59, Chuck Lever III wrote:
>>> On Feb 27, 2023, at 4:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>> 
>>> On 2/24/23 20:19, Chuck Lever wrote:
> [ .. ]
>>>> +	req = sock->sk->sk_handshake_req;
>>>> +	if (!req) {
>>>> +		err = -EBUSY;
>>>> +		goto out_status;
>>>> +	}
>>>> +
>>>> +	trace_handshake_cmd_done(net, req, sock, fd);
>>>> +
>>>> +	status = -EIO;
>>>> +	if (tb[HANDSHAKE_A_DONE_STATUS])
>>>> +		status = nla_get_u32(tb[HANDSHAKE_A_DONE_STATUS]);
>>>> +
>>> And this makes me ever so slightly uneasy.
>>> 
>>> As 'status' is a netlink attribute it's inevitably defined as 'unsigned'.
>>> Yet we assume that 'status' is a negative number, leaving us _technically_ in unchartered territory.
>> Ah, that's an oversight.
>>> And that is notwithstanding the problem that we haven't even defined _what_ should be in the status attribute.
>> It's now an errno value.
>>> Reading the code I assume that it's either '0' for success or a negative number (ie the error code) on failure.
>>> Which implicitely means that we _never_ set a positive number here.
>>> So what would we lose if we declare 'status' to carry the _positive_ error number instead?
>>> It would bring us in-line with the actual netlink attribute definition, we wouldn't need
>>> to worry about possible integer overflows, yadda yadda...
>>> 
>>> Hmm?
>> It can also be argued that errnos in user space are positive-valued,
>> therefore, this user space visible protocol should use a positive
>> errno.
> Thanks.
> 
> [ .. ]
>>>> +
>>>> +/**
>>>> + * handshake_req_cancel - consumer API to cancel an in-progress handshake
>>>> + * @sock: socket on which there is an ongoing handshake
>>>> + *
>>>> + * XXX: Perhaps killing the user space agent might also be necessary?
>>> 
>>> I thought we had agreed that we would be sending a signal to the userspace process?
>> We had discussed killing the handler, but I don't think it's necessary.
>> I'd rather not do something that drastic unless we have no other choice.
>> So far my testing hasn't shown a need for killing the child process.
>> I'm also concerned that the kernel could reuse the handler's process ID.
>> handshake_req_cancel would kill something that is not a handshake agent.
> Hmm? If that were the case, wouldn't we be sending the netlink message to the
> wrong process, to?

Notifications go to anyone who is listening for handshake requests
and contain nothing but the handler class number. "Who is to respond
to this notification". It is up to those processes to send an ACCEPT
to the kernel, and then later a DONE.

So... listeners have to register to get notifications, and the
registration goes away as soon as the netlink socket is closed. That
is what the long-lived parent tlshd process does.

After notification, the handshake is driven entirely by the handshake
agent (the tlshd child process). The kernel is not otherwise sending
unsolicited netlink messages to anyone.

If you're concerned about the response messages that the kernel
sends back to the handshake agent... any new process would have to
have a netlink socket open, resolved to the HANDSHAKE family, and
it would have to recognize the message sequence ID in the response
message. Very very unlikely that all that would happen.


> And in the absence of any timeout handler: what do we do if userspace is stuck / doesn't make forward progress?
> At one point TCP will timeout, and the client will close the connection.
> Leaving us with (potentially) broken / stuck processes. Sure we would need to initiate some cleanup here, no?

I'm not sure. Test and see.

In my experience, one peer or the other closes the socket, and the
other follows suit. The handshake agent hits an error when it tries
to use the socket, and exits.


>>> Ideally we would be sending a SIGHUP, wait for some time on the userspace
>>> process to respond with a 'done' message, and send a 'KILL' signal if we
>>> haven't received one.
>>> 
>>> Obs: Sending a KILL signal would imply that userspace is able to cope with
>>> children dying. Which pretty much excludes pthreads, I would think.
>>> 
>>> Guess I'll have to consult Stevens :-)
>> Basically what cancel does is atomically disarm the "done" callback.
>> The socket belongs to the kernel, so it will live until the kernel is
>> good and through with it.
> Oh, the socket does. But the process handling the socket is not.
> So even if we close the socket from the kernel there's no guarantee that userspace will react to it.

If the kernel finishes first (ie, cancels and closes the socket,
as it is supposed to) the user space endpoint is dead. I don't
think it matters what the handshake agent does at that point,
although if this happens frequently, it might amount to a
resource leak.


> Problem here is with using different key materials.
> As the current handshake can only deal with one key at a time the only chance we have for several possible keys is to retry the handshake with the next key.
> But out of necessity we have to use the _same_ connection (as tlshd doesn't control the socket). So we cannot close the socket, and hence we can't notify userspace to give up the handshake attempt.
> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
> If it doesn't come we can terminate all attempts.
> But if we get the 'done' call we know it's safe to start with the next attempt.

We solve this problem by enabling the kernel to provide all those
materials to tlshd in one go.

I don't think there's a "retry" situation here. Once the handshake
has failed, the client peer has to know to try again. That would
mean retrying would have to be part of the upper layer protocol.
Does an NVMe initiator know it has to drive another handshake if
the first one fails, or does it rely on the handshake itself to
try all available identities?

We don't have a choice but to provide all the keys at once and
let the handshake negotiation deal with it.

I'm working on DONE passing multiple remote peer IDs back to the
kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
the other way.

Note that currently the handshake upcall mechanism supports only
one handshake per socket lifetime, as the handshake_req is
released by the socket's sk_destruct callback.


>>>> + *
>>>> + * Request cancellation races with request completion. To determine
>>>> + * who won, callers examine the return value from this function.
>>>> + *
>>>> + * Return values:
>>>> + *   %0 - Uncompleted handshake request was canceled or not found
>>>> + *   %-EBUSY - Handshake request already completed
>>> 
>>> EBUSY? Wouldn't be EAGAIN more approriate?
>> I don't think EAGAIN would be appropriate at all. The situation
>> is that the handshake completed, so there's no need to call cancel
>> again. It's synonym, EWOULDBLOCK, is also not a good semantic fit.
>>> After all, the request is everything _but_ busy...
>> I'm open to suggestion.
>> One option is to use a boolean return value instead of an errno.
> Yeah, that's probably better.
> 
> BTW: thanks for the tracepoints!
> 
> Cheers,
> 
> Hannes
> 

--
Chuck Lever
Hannes Reinecke Feb. 27, 2023, 5:21 p.m. UTC | #5
On 2/27/23 16:39, Chuck Lever III wrote:
> 
> 
>> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 2/27/23 15:59, Chuck Lever III wrote:
>>>> On Feb 27, 2023, at 4:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> On 2/24/23 20:19, Chuck Lever wrote:
>> [ .. ]
>>>>> +	req = sock->sk->sk_handshake_req;
>>>>> +	if (!req) {
>>>>> +		err = -EBUSY;
>>>>> +		goto out_status;
>>>>> +	}
>>>>> +
>>>>> +	trace_handshake_cmd_done(net, req, sock, fd);
>>>>> +
>>>>> +	status = -EIO;
>>>>> +	if (tb[HANDSHAKE_A_DONE_STATUS])
>>>>> +		status = nla_get_u32(tb[HANDSHAKE_A_DONE_STATUS]);
>>>>> +
>>>> And this makes me ever so slightly uneasy.
>>>>
>>>> As 'status' is a netlink attribute it's inevitably defined as 'unsigned'.
>>>> Yet we assume that 'status' is a negative number, leaving us _technically_ in unchartered territory.
>>> Ah, that's an oversight.
>>>> And that is notwithstanding the problem that we haven't even defined _what_ should be in the status attribute.
>>> It's now an errno value.
>>>> Reading the code I assume that it's either '0' for success or a negative number (ie the error code) on failure.
>>>> Which implicitely means that we _never_ set a positive number here.
>>>> So what would we lose if we declare 'status' to carry the _positive_ error number instead?
>>>> It would bring us in-line with the actual netlink attribute definition, we wouldn't need
>>>> to worry about possible integer overflows, yadda yadda...
>>>>
>>>> Hmm?
>>> It can also be argued that errnos in user space are positive-valued,
>>> therefore, this user space visible protocol should use a positive
>>> errno.
>> Thanks.
>>
>> [ .. ]
>>>>> +
>>>>> +/**
>>>>> + * handshake_req_cancel - consumer API to cancel an in-progress handshake
>>>>> + * @sock: socket on which there is an ongoing handshake
>>>>> + *
>>>>> + * XXX: Perhaps killing the user space agent might also be necessary?
>>>>
>>>> I thought we had agreed that we would be sending a signal to the userspace process?
>>> We had discussed killing the handler, but I don't think it's necessary.
>>> I'd rather not do something that drastic unless we have no other choice.
>>> So far my testing hasn't shown a need for killing the child process.
>>> I'm also concerned that the kernel could reuse the handler's process ID.
>>> handshake_req_cancel would kill something that is not a handshake agent.
>> Hmm? If that were the case, wouldn't we be sending the netlink message to the
>> wrong process, to?
> 
> Notifications go to anyone who is listening for handshake requests
> and contain nothing but the handler class number. "Who is to respond
> to this notification". It is up to those processes to send an ACCEPT
> to the kernel, and then later a DONE.
> 
> So... listeners have to register to get notifications, and the
> registration goes away as soon as the netlink socket is closed. That
> is what the long-lived parent tlshd process does.
> 
> After notification, the handshake is driven entirely by the handshake
> agent (the tlshd child process). The kernel is not otherwise sending
> unsolicited netlink messages to anyone.
> 
> If you're concerned about the response messages that the kernel
> sends back to the handshake agent... any new process would have to
> have a netlink socket open, resolved to the HANDSHAKE family, and
> it would have to recognize the message sequence ID in the response
> message. Very very unlikely that all that would happen.
> 
> 
Yes, agree.

>> And in the absence of any timeout handler: what do we do if userspace is stuck / doesn't make forward progress?
>> At one point TCP will timeout, and the client will close the connection.
>> Leaving us with (potentially) broken / stuck processes. Sure we would need to initiate some cleanup here, no?
> 
> I'm not sure. Test and see.
> 
> In my experience, one peer or the other closes the socket, and the
> other follows suit. The handshake agent hits an error when it tries
> to use the socket, and exits.
> 
> 
Hmm. Yes, if the other side closes the socket we'll have to follow suit.
I'm not sure, though, if a TLS timeout necessarily induces as connection 
close. But okay, let's see how things pan out.

>>>> Ideally we would be sending a SIGHUP, wait for some time on the userspace
>>>> process to respond with a 'done' message, and send a 'KILL' signal if we
>>>> haven't received one.
>>>>
>>>> Obs: Sending a KILL signal would imply that userspace is able to cope with
>>>> children dying. Which pretty much excludes pthreads, I would think.
>>>>
>>>> Guess I'll have to consult Stevens :-)
>>> Basically what cancel does is atomically disarm the "done" callback.
>>> The socket belongs to the kernel, so it will live until the kernel is
>>> good and through with it.
>> Oh, the socket does. But the process handling the socket is not.
>> So even if we close the socket from the kernel there's no guarantee that userspace will react to it.
> 
> If the kernel finishes first (ie, cancels and closes the socket,
> as it is supposed to) the user space endpoint is dead. I don't
> think it matters what the handshake agent does at that point,
> although if this happens frequently, it might amount to a
> resource leak.
> 
> 
>> Problem here is with using different key materials.
>> As the current handshake can only deal with one key at a time
>> the only chance we have for several possible keys is to retry
>> the handshake with the next key.
>> But out of necessity we have to use the _same_ connection
>> (as tlshd doesn't control the socket). So we cannot close
>> the socket, and hence we can't notify userspace to give up the handshake attempt.
>> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
>> If it doesn't come we can terminate all attempts.
>> But if we get the 'done' call we know it's safe to start with the next attempt.
> 
> We solve this problem by enabling the kernel to provide all those
> materials to tlshd in one go.
> 
Ah. Right, that would work, too; provide all possible keys to the 
'accept' call and let the userspace agent figure out what to do with 
them. That makes life certainly easier for the kernel side.

> I don't think there's a "retry" situation here. Once the handshake
> has failed, the client peer has to know to try again. That would
> mean retrying would have to be part of the upper layer protocol.
> Does an NVMe initiator know it has to drive another handshake if
> the first one fails, or does it rely on the handshake itself to
> try all available identities?
> 
> We don't have a choice but to provide all the keys at once and
> let the handshake negotiation deal with it.
> 
> I'm working on DONE passing multiple remote peer IDs back to the
> kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
> the other way.
> 
Nope. That's not required.
DONE can only ever have one peer id (TLS 1.3 specifies that the client 
sends a list of identities, the server picks one, and sends that one 
back to the client). So for DONE we will only ever have 1 peer ID.
If we allow for several peer IDs to be present in the client ACCEPT 
message then we'd need to include the resulting peer ID in the client 
DONE, too; otherwise we'll need it for the server DONE only.

So all in all I think we should be going with the multiple IDs in the 
ACCEPT call (ie move the key id from being part of the message into an 
attribute), and have a peer id present in the DONE all for both 
versions, server and client.

> Note that currently the handshake upcall mechanism supports only
> one handshake per socket lifetime, as the handshake_req is
> released by the socket's sk_destruct callback.
> 
Oh, that's fine; we'll have one socket per (nvme) connection anyway.

Cheers,

Hannes
Chuck Lever III Feb. 27, 2023, 6:10 p.m. UTC | #6
> On Feb 27, 2023, at 12:21 PM, Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 2/27/23 16:39, Chuck Lever III wrote:
>>> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@suse.de> wrote:
>>> 
>>> Problem here is with using different key materials.
>>> As the current handshake can only deal with one key at a time
>>> the only chance we have for several possible keys is to retry
>>> the handshake with the next key.
>>> But out of necessity we have to use the _same_ connection
>>> (as tlshd doesn't control the socket). So we cannot close
>>> the socket, and hence we can't notify userspace to give up the handshake attempt.
>>> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
>>> If it doesn't come we can terminate all attempts.
>>> But if we get the 'done' call we know it's safe to start with the next attempt.
>> We solve this problem by enabling the kernel to provide all those
>> materials to tlshd in one go.
> Ah. Right, that would work, too; provide all possible keys to the 'accept' call and let the userspace agent figure out what to do with them. That makes life certainly easier for the kernel side.
> 
>> I don't think there's a "retry" situation here. Once the handshake
>> has failed, the client peer has to know to try again. That would
>> mean retrying would have to be part of the upper layer protocol.
>> Does an NVMe initiator know it has to drive another handshake if
>> the first one fails, or does it rely on the handshake itself to
>> try all available identities?
>> We don't have a choice but to provide all the keys at once and
>> let the handshake negotiation deal with it.
>> I'm working on DONE passing multiple remote peer IDs back to the
>> kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
>> the other way.
> Nope. That's not required.
> DONE can only ever have one peer id (TLS 1.3 specifies that the client sends a list of identities, the server picks one, and sends that one back to the client). So for DONE we will only ever have 1 peer ID.
> If we allow for several peer IDs to be present in the client ACCEPT message then we'd need to include the resulting peer ID in the client DONE, too; otherwise we'll need it for the server DONE only.
> 
> So all in all I think we should be going with the multiple IDs in the ACCEPT call (ie move the key id from being part of the message into an attribute), and have a peer id present in the DONE all for both versions, server and client.

To summarize:

---

The ACCEPT request (from tlshd) would have just the handler class
"Which handler is responding". The kernel uses that to find a
handshake request waiting for that type of handler. In our case,
"tlshd".

The ACCEPT response (from the kernel) would have the socket fd,
the handshake parameters, and zero or more peer ID key serial
numbers. (Today, just zero or one peer IDs).

There is also an errno status in the ACCEPT response, which
the kernel can use to indicate things like "no requests in that
class were found" or that the request was otherwise improperly
formed.

---

The DONE request (from tlshd) would have the socket fd (and
implicitly, the handler's PID), the session status, and zero
or one remote peer ID key serial numbers.

The DONE response (from the kernel) is an ACK. (Today it's
more than that, but that's broken and will be removed).

---

For the DONE request, the session status is one of:

0: session established -- see @peerid for authentication status
EIO: local error
EACCES: handshake rejected

For server handshake completion:

@peerid contains the remote peer ID if the session was
authenticated, or TLS_NO_PEERID if the session was not
authenticated.

status == EACCES if authentication material was present from
both peers but verification failed.

For client handshake completion:

@peerid contains the remote peer ID if authentication was
requested and the session was authenticated

status == EACCES if authentication was requested and the
session was not authenticated, or if verification failed.

(Maybe client could work like the server side, and the
kernel consumer would need to figure out if it cares
whether there was authentication).


Is that adequate?


--
Chuck Lever
Hannes Reinecke Feb. 28, 2023, 6:58 a.m. UTC | #7
On 2/27/23 19:10, Chuck Lever III wrote:
> 
> 
>> On Feb 27, 2023, at 12:21 PM, Hannes Reinecke <hare@suse.de> wrote:
>>
>>> On 2/27/23 16:39, Chuck Lever III wrote:
>>>> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> Problem here is with using different key materials.
>>>> As the current handshake can only deal with one key at a time
>>>> the only chance we have for several possible keys is to retry
>>>> the handshake with the next key.
>>>> But out of necessity we have to use the _same_ connection
>>>> (as tlshd doesn't control the socket). So we cannot close
>>>> the socket, and hence we can't notify userspace to give up the handshake attempt.
>>>> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
>>>> If it doesn't come we can terminate all attempts.
>>>> But if we get the 'done' call we know it's safe to start with the next attempt.
>>> We solve this problem by enabling the kernel to provide all those
>>> materials to tlshd in one go.
>> Ah. Right, that would work, too; provide all possible keys to the
>> 'accept' call and let the userspace agent figure out what to do with
>> them. That makes life certainly easier for the kernel side.
>>
>>> I don't think there's a "retry" situation here. Once the handshake
>>> has failed, the client peer has to know to try again. That would
>>> mean retrying would have to be part of the upper layer protocol.
>>> Does an NVMe initiator know it has to drive another handshake if
>>> the first one fails, or does it rely on the handshake itself to
>>> try all available identities?
>>> We don't have a choice but to provide all the keys at once and
>>> let the handshake negotiation deal with it.
>>> I'm working on DONE passing multiple remote peer IDs back to the
>>> kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
>>> the other way.
>> Nope. That's not required.
>> DONE can only ever have one peer id (TLS 1.3 specifies that the client
>> sends a list of identities, the server picks one, and sends that one back
>> to the client). So for DONE we will only ever have 1 peer ID.
>> If we allow for several peer IDs to be present in the client ACCEPT message
>> then we'd need to include the resulting peer ID in the client DONE, too;
>> otherwise we'll need it for the server DONE only.
>>
>> So all in all I think we should be going with the multiple IDs in the
>> ACCEPT call (ie move the key id from being part of the message into an
>> attribute), and have a peer id present in the DONE all for both versions,
>> server and client.
> 
> To summarize:
> 
> ---
> 
> The ACCEPT request (from tlshd) would have just the handler class
> "Which handler is responding". The kernel uses that to find a
> handshake request waiting for that type of handler. In our case,
> "tlshd".
> 
> The ACCEPT response (from the kernel) would have the socket fd,
> the handshake parameters, and zero or more peer ID key serial
> numbers. (Today, just zero or one peer IDs).
>  > There is also an errno status in the ACCEPT response, which
> the kernel can use to indicate things like "no requests in that
> class were found" or that the request was otherwise improperly
> formed.
> 
> ---
> 
> The DONE request (from tlshd) would have the socket fd (and
> implicitly, the handler's PID), the session status, and zero
> or one remote peer ID key serial numbers.
>  > The DONE response (from the kernel) is an ACK. (Today it's
> more than that, but that's broken and will be removed).
> 
> ---
> 
> For the DONE request, the session status is one of:
> 
> 0: session established -- see @peerid for authentication status
> EIO: local error
> EACCES: handshake rejected
> 
> For server handshake completion:
> 
> @peerid contains the remote peer ID if the session was
> authenticated, or TLS_NO_PEERID if the session was not
> authenticated.
> 
> status == EACCES if authentication material was present from
> both peers but verification failed.
> 
> For client handshake completion:
> 
> @peerid contains the remote peer ID if authentication was
> requested and the session was authenticated
> 
> status == EACCES if authentication was requested and the
> session was not authenticated, or if verification failed.
> 
> (Maybe client could work like the server side, and the
> kernel consumer would need to figure out if it cares
> whether there was authentication).
> 
Yes, that would be my preference. Always return @peerid
for DONE if the TLS session was established.
We might also consider returning @peerid with EACCESS
to indicate the offending ID.

> 
> Is that adequate?
> 
Yes, it is.

So the only bone of contention is the timeout; as we won't
be implementing signals I still think that we should have
a 'timeout' attribute. And if only to feed the TLS timeout
parameter for gnutls ...

Cheers,

Hannes
Chuck Lever III Feb. 28, 2023, 2:28 p.m. UTC | #8
> On Feb 28, 2023, at 1:58 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 2/27/23 19:10, Chuck Lever III wrote:
>>> On Feb 27, 2023, at 12:21 PM, Hannes Reinecke <hare@suse.de> wrote:
>>> 
>>>> On 2/27/23 16:39, Chuck Lever III wrote:
>>>>> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>> 
>>>>> Problem here is with using different key materials.
>>>>> As the current handshake can only deal with one key at a time
>>>>> the only chance we have for several possible keys is to retry
>>>>> the handshake with the next key.
>>>>> But out of necessity we have to use the _same_ connection
>>>>> (as tlshd doesn't control the socket). So we cannot close
>>>>> the socket, and hence we can't notify userspace to give up the handshake attempt.
>>>>> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
>>>>> If it doesn't come we can terminate all attempts.
>>>>> But if we get the 'done' call we know it's safe to start with the next attempt.
>>>> We solve this problem by enabling the kernel to provide all those
>>>> materials to tlshd in one go.
>>> Ah. Right, that would work, too; provide all possible keys to the
>>> 'accept' call and let the userspace agent figure out what to do with
>>> them. That makes life certainly easier for the kernel side.
>>> 
>>>> I don't think there's a "retry" situation here. Once the handshake
>>>> has failed, the client peer has to know to try again. That would
>>>> mean retrying would have to be part of the upper layer protocol.
>>>> Does an NVMe initiator know it has to drive another handshake if
>>>> the first one fails, or does it rely on the handshake itself to
>>>> try all available identities?
>>>> We don't have a choice but to provide all the keys at once and
>>>> let the handshake negotiation deal with it.
>>>> I'm working on DONE passing multiple remote peer IDs back to the
>>>> kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
>>>> the other way.
>>> Nope. That's not required.
>>> DONE can only ever have one peer id (TLS 1.3 specifies that the client
>>> sends a list of identities, the server picks one, and sends that one back
>>> to the client). So for DONE we will only ever have 1 peer ID.
>>> If we allow for several peer IDs to be present in the client ACCEPT message
>>> then we'd need to include the resulting peer ID in the client DONE, too;
>>> otherwise we'll need it for the server DONE only.
>>> 
>>> So all in all I think we should be going with the multiple IDs in the
>>> ACCEPT call (ie move the key id from being part of the message into an
>>> attribute), and have a peer id present in the DONE all for both versions,
>>> server and client.
>> To summarize:
>> ---
>> The ACCEPT request (from tlshd) would have just the handler class
>> "Which handler is responding". The kernel uses that to find a
>> handshake request waiting for that type of handler. In our case,
>> "tlshd".
>> The ACCEPT response (from the kernel) would have the socket fd,
>> the handshake parameters, and zero or more peer ID key serial
>> numbers. (Today, just zero or one peer IDs).
>> > There is also an errno status in the ACCEPT response, which
>> the kernel can use to indicate things like "no requests in that
>> class were found" or that the request was otherwise improperly
>> formed.
>> ---
>> The DONE request (from tlshd) would have the socket fd (and
>> implicitly, the handler's PID), the session status, and zero
>> or one remote peer ID key serial numbers.
>> > The DONE response (from the kernel) is an ACK. (Today it's
>> more than that, but that's broken and will be removed).
>> ---
>> For the DONE request, the session status is one of:
>> 0: session established -- see @peerid for authentication status
>> EIO: local error
>> EACCES: handshake rejected
>> For server handshake completion:
>> @peerid contains the remote peer ID if the session was
>> authenticated, or TLS_NO_PEERID if the session was not
>> authenticated.
>> status == EACCES if authentication material was present from
>> both peers but verification failed.
>> For client handshake completion:
>> @peerid contains the remote peer ID if authentication was
>> requested and the session was authenticated
>> status == EACCES if authentication was requested and the
>> session was not authenticated, or if verification failed.
>> (Maybe client could work like the server side, and the
>> kernel consumer would need to figure out if it cares
>> whether there was authentication).
> Yes, that would be my preference. Always return @peerid
> for DONE if the TLS session was established.

You mean if the TLS session was authenticated. The server
won't receive a remote peer identity if the client peer
doesn't authenticate.


> We might also consider returning @peerid with EACCESS
> to indicate the offending ID.

I'll look into that.


>> Is that adequate?
> Yes, it is.

What about the narrow set of DONE status values? You've
recently wanted to add ENOMEM, ENOKEY, and EINVAL to
this set. My experience is that these status values are
nearly always obscured before they can get back to the
requesting user.

Can the kernel make use of ENOMEM, for example? It might
be able to retry, I suppose... retrying is not sensible
for the server side.


> So the only bone of contention is the timeout; as we won't
> be implementing signals I still think that we should have
> a 'timeout' attribute. And if only to feed the TLS timeout
> parameter for gnutls ...

I'm still not seeing the case for making it an individual
parameter for each handshake request. Maybe a config
parameter, if a short timeout is actually needed... even
then, maybe a built-in timeout is preferable to yet another
tuning knob that can be abused.

I'd like to see some testing results to determine that a
short timeout is the only way to handle corner cases.


--
Chuck Lever
Hannes Reinecke Feb. 28, 2023, 3:48 p.m. UTC | #9
On 2/28/23 15:28, Chuck Lever III wrote:
> 
> 
>> On Feb 28, 2023, at 1:58 AM, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 2/27/23 19:10, Chuck Lever III wrote:
>>>> On Feb 27, 2023, at 12:21 PM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>>> On 2/27/23 16:39, Chuck Lever III wrote:
>>>>>> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>>>
>>>>>> Problem here is with using different key materials.
>>>>>> As the current handshake can only deal with one key at a time
>>>>>> the only chance we have for several possible keys is to retry
>>>>>> the handshake with the next key.
>>>>>> But out of necessity we have to use the _same_ connection
>>>>>> (as tlshd doesn't control the socket). So we cannot close
>>>>>> the socket, and hence we can't notify userspace to give up the handshake attempt.
>>>>>> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
>>>>>> If it doesn't come we can terminate all attempts.
>>>>>> But if we get the 'done' call we know it's safe to start with the next attempt.
>>>>> We solve this problem by enabling the kernel to provide all those
>>>>> materials to tlshd in one go.
>>>> Ah. Right, that would work, too; provide all possible keys to the
>>>> 'accept' call and let the userspace agent figure out what to do with
>>>> them. That makes life certainly easier for the kernel side.
>>>>
>>>>> I don't think there's a "retry" situation here. Once the handshake
>>>>> has failed, the client peer has to know to try again. That would
>>>>> mean retrying would have to be part of the upper layer protocol.
>>>>> Does an NVMe initiator know it has to drive another handshake if
>>>>> the first one fails, or does it rely on the handshake itself to
>>>>> try all available identities?
>>>>> We don't have a choice but to provide all the keys at once and
>>>>> let the handshake negotiation deal with it.
>>>>> I'm working on DONE passing multiple remote peer IDs back to the
>>>>> kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
>>>>> the other way.
>>>> Nope. That's not required.
>>>> DONE can only ever have one peer id (TLS 1.3 specifies that the client
>>>> sends a list of identities, the server picks one, and sends that one back
>>>> to the client). So for DONE we will only ever have 1 peer ID.
>>>> If we allow for several peer IDs to be present in the client ACCEPT message
>>>> then we'd need to include the resulting peer ID in the client DONE, too;
>>>> otherwise we'll need it for the server DONE only.
>>>>
>>>> So all in all I think we should be going with the multiple IDs in the
>>>> ACCEPT call (ie move the key id from being part of the message into an
>>>> attribute), and have a peer id present in the DONE all for both versions,
>>>> server and client.
>>> To summarize:
>>> ---
>>> The ACCEPT request (from tlshd) would have just the handler class
>>> "Which handler is responding". The kernel uses that to find a
>>> handshake request waiting for that type of handler. In our case,
>>> "tlshd".
>>> The ACCEPT response (from the kernel) would have the socket fd,
>>> the handshake parameters, and zero or more peer ID key serial
>>> numbers. (Today, just zero or one peer IDs).
>>>> There is also an errno status in the ACCEPT response, which
>>> the kernel can use to indicate things like "no requests in that
>>> class were found" or that the request was otherwise improperly
>>> formed.
>>> ---
>>> The DONE request (from tlshd) would have the socket fd (and
>>> implicitly, the handler's PID), the session status, and zero
>>> or one remote peer ID key serial numbers.
>>>> The DONE response (from the kernel) is an ACK. (Today it's
>>> more than that, but that's broken and will be removed).
>>> ---
>>> For the DONE request, the session status is one of:
>>> 0: session established -- see @peerid for authentication status
>>> EIO: local error
>>> EACCES: handshake rejected
>>> For server handshake completion:
>>> @peerid contains the remote peer ID if the session was
>>> authenticated, or TLS_NO_PEERID if the session was not
>>> authenticated.
>>> status == EACCES if authentication material was present from
>>> both peers but verification failed.
>>> For client handshake completion:
>>> @peerid contains the remote peer ID if authentication was
>>> requested and the session was authenticated
>>> status == EACCES if authentication was requested and the
>>> session was not authenticated, or if verification failed.
>>> (Maybe client could work like the server side, and the
>>> kernel consumer would need to figure out if it cares
>>> whether there was authentication).
>> Yes, that would be my preference. Always return @peerid
>> for DONE if the TLS session was established.
> 
> You mean if the TLS session was authenticated. The server
> won't receive a remote peer identity if the client peer
> doesn't authenticate.
> 
Ah, yes, forgot about that.
(PSK always 'authenticate' as the identity is that used to
find the appropriate PSK ...)

> 
>> We might also consider returning @peerid with EACCESS
>> to indicate the offending ID.
> 
> I'll look into that.
> 
> 
>>> Is that adequate?
>> Yes, it is.
> 
> What about the narrow set of DONE status values? You've
> recently wanted to add ENOMEM, ENOKEY, and EINVAL to
> this set. My experience is that these status values are
> nearly always obscured before they can get back to the
> requesting user.
> 
> Can the kernel make use of ENOMEM, for example? It might
> be able to retry, I suppose... retrying is not sensible
> for the server side.
> 
The usual problem: Retry or no retry.
Sadly error numbers are no good indicator to that.
Maybe we should take the NVMe approach and add a _different_
attribute indicating whether this particular error status
should be retried.

> 
>> So the only bone of contention is the timeout; as we won't
>> be implementing signals I still think that we should have
>> a 'timeout' attribute. And if only to feed the TLS timeout
>> parameter for gnutls ...
> 
> I'm still not seeing the case for making it an individual
> parameter for each handshake request. Maybe a config
> parameter, if a short timeout is actually needed... even
> then, maybe a built-in timeout is preferable to yet another
> tuning knob that can be abused.
> 
The problem I see is that the kernel-side needs to make forward
progress eventually, and calling into userspace is a good recipe
of violating that principle.
Sending a timeout value as a netlink parameter has the advantage
the both sides are aware that there _is_ a timeout.
The alternative would be an unconditional wait in the kernel,
and a very real possibility of a stuck process.

> I'd like to see some testing results to determine that a
> short timeout is the only way to handle corner cases.
> 
Short timeouts are especially useful for testing and debugging;
timeout handlers are prone to issues, and hence need a really good
bashing to hash out issues.
And not having a timeout is also not a good idea, see above.

But yeah, in theory we could use a configuration timeout in tlshd.

In the end, it's _just_ another netlink attribute, which might
(or might not) be present. Which replaces a built-in value.
I hadn't thought this to be such an issue ...

Cheers,

Hannes
Chuck Lever III Feb. 28, 2023, 4:01 p.m. UTC | #10
> On Feb 28, 2023, at 10:48 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 2/28/23 15:28, Chuck Lever III wrote:
>>> On Feb 28, 2023, at 1:58 AM, Hannes Reinecke <hare@suse.de> wrote:
>>> 
>>> On 2/27/23 19:10, Chuck Lever III wrote:
>>> 
>> What about the narrow set of DONE status values? You've
>> recently wanted to add ENOMEM, ENOKEY, and EINVAL to
>> this set. My experience is that these status values are
>> nearly always obscured before they can get back to the
>> requesting user.
>> Can the kernel make use of ENOMEM, for example? It might
>> be able to retry, I suppose... retrying is not sensible
>> for the server side.
> The usual problem: Retry or no retry.
> Sadly error numbers are no good indicator to that.
> Maybe we should take the NVMe approach and add a _different_
> attribute indicating whether this particular error status
> should be retried.

ENOMEM is obviously temporary. The others are permanent
errors. This is handled simply via a tiny protocol
specification, which I can add near tls_handshake_done().


>>> So the only bone of contention is the timeout; as we won't
>>> be implementing signals I still think that we should have
>>> a 'timeout' attribute. And if only to feed the TLS timeout
>>> parameter for gnutls ...
>> I'm still not seeing the case for making it an individual
>> parameter for each handshake request. Maybe a config
>> parameter, if a short timeout is actually needed... even
>> then, maybe a built-in timeout is preferable to yet another
>> tuning knob that can be abused.
> The problem I see is that the kernel-side needs to make forward
> progress eventually, and calling into userspace is a good recipe
> of violating that principle.

That's why RPC-with-TLS uses wait-interruptible-timeout.


> Sending a timeout value as a netlink parameter has the advantage
> the both sides are aware that there _is_ a timeout.
> The alternative would be an unconditional wait in the kernel,
> and a very real possibility of a stuck process.

I'm not following you. Why isn't wait-interruptible-timeout
in the kernel adequate?


>> I'd like to see some testing results to determine that a
>> short timeout is the only way to handle corner cases.
> Short timeouts are especially useful for testing and debugging;
> timeout handlers are prone to issues, and hence need a really good
> bashing to hash out issues.
> And not having a timeout is also not a good idea, see above.

RPC-with-TLS has a timeout. The kernel is in complete control
of it. After a few seconds, the kernel abandons the handshake
attempt and closes the socket. It doesn't care what the handler
agent does at that point.


> But yeah, in theory we could use a configuration timeout in tlshd.
> 
> In the end, it's _just_ another netlink attribute, which might
> (or might not) be present. Which replaces a built-in value.
> I hadn't thought this to be such an issue ...

It's an issue because you have not identified a particular
corner case (via reproducer) where user and kernel have to
agree on exactly the same timeout value, and it might be
different per-request.

Show me one, and I will agree to add it. So far, I haven't
seen sufficient justification.


--
Chuck Lever
diff mbox series

Patch

diff --git a/Documentation/netlink/specs/handshake.yaml b/Documentation/netlink/specs/handshake.yaml
new file mode 100644
index 000000000000..683a8f2df0a7
--- /dev/null
+++ b/Documentation/netlink/specs/handshake.yaml
@@ -0,0 +1,134 @@ 
+# SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+#
+# GENL HANDSHAKE service.
+#
+# Author: Chuck Lever <chuck.lever@oracle.com>
+#
+# Copyright (c) 2023, Oracle and/or its affiliates.
+#
+
+name: handshake
+
+protocol: genetlink-c
+
+doc: Netlink protocol to request a transport layer security handshake.
+
+uapi-header: linux/net/handshake.h
+
+definitions:
+  -
+    type: enum
+    name: handler-class
+    enum-name:
+    value-start: 0
+    entries: [ none ]
+  -
+    type: enum
+    name: msg-type
+    enum-name:
+    value-start: 0
+    entries: [ unspec, clienthello, serverhello ]
+  -
+    type: enum
+    name: auth
+    enum-name:
+    value-start: 0
+    entries: [ unspec, unauth, x509, psk ]
+
+attribute-sets:
+  -
+    name: accept
+    attributes:
+      -
+        name: status
+        doc: Status of this accept operation
+        type: u32
+        value: 1
+      -
+        name: sockfd
+        doc: File descriptor of socket to use
+        type: u32
+      -
+        name: handler-class
+        doc: Which type of handler is responding
+        type: u32
+        enum: handler-class
+      -
+        name: message-type
+        doc: Handshake message type
+        type: u32
+        enum: msg-type
+      -
+        name: auth
+        doc: Authentication mode
+        type: u32
+        enum: auth
+      -
+        name: gnutls-priorities
+        doc: GnuTLS priority string
+        type: string
+      -
+        name: my-peerid
+        doc: Serial no of key containing local identity
+        type: u32
+      -
+        name: my-privkey
+        doc: Serial no of key containing optional private key
+        type: u32
+  -
+    name: done
+    attributes:
+      -
+        name: status
+        doc: Session status
+        type: u32
+        value: 1
+      -
+        name: sockfd
+        doc: File descriptor of socket that has completed
+        type: u32
+      -
+        name: remote-peerid
+        doc: Serial no of keys containing identities of remote peer
+        type: u32
+
+operations:
+  list:
+    -
+      name: ready
+      doc: Notify handlers that a new handshake request is waiting
+      value: 1
+      notify: accept
+    -
+      name: accept
+      doc: Handler retrieves next queued handshake request
+      attribute-set: accept
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - handler-class
+        reply:
+          attributes:
+            - status
+            - sockfd
+            - message-type
+            - auth
+            - gnutls-priorities
+            - my-peerid
+            - my-privkey
+    -
+      name: done
+      doc: Handler reports handshake completion
+      attribute-set: done
+      do:
+        request:
+          attributes:
+            - status
+            - sockfd
+            - remote-peerid
+
+mcast-groups:
+  list:
+    -
+      name: none
diff --git a/include/net/handshake.h b/include/net/handshake.h
new file mode 100644
index 000000000000..08f859237936
--- /dev/null
+++ b/include/net/handshake.h
@@ -0,0 +1,45 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic HANDSHAKE service.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are visible only within the
+ * kernel are declared here.
+ */
+
+#ifndef _NET_HANDSHAKE_H
+#define _NET_HANDSHAKE_H
+
+struct handshake_req;
+
+/*
+ * Invariants for all handshake requests for one transport layer
+ * security protocol
+ */
+struct handshake_proto {
+	int			hp_handler_class;
+	size_t			hp_privsize;
+
+	int			(*hp_accept)(struct handshake_req *req,
+					     struct genl_info *gi, int fd);
+	void			(*hp_done)(struct handshake_req *req,
+					   int status, struct nlattr **tb);
+	void			(*hp_destroy)(struct handshake_req *req);
+};
+
+extern struct handshake_req *
+handshake_req_alloc(struct socket *sock, const struct handshake_proto *proto,
+		    gfp_t flags);
+extern void *handshake_req_private(struct handshake_req *req);
+extern int handshake_req_submit(struct handshake_req *req, gfp_t flags);
+extern int handshake_req_cancel(struct socket *sock);
+
+extern struct nlmsghdr *handshake_genl_put(struct sk_buff *msg,
+					   struct genl_info *gi);
+
+#endif /* _NET_HANDSHAKE_H */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 78beaa765c73..a0ce9de4dab1 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -188,6 +188,11 @@  struct net {
 #if IS_ENABLED(CONFIG_SMC)
 	struct netns_smc	smc;
 #endif
+
+	/* transport layer security handshake requests */
+	spinlock_t		hs_lock;
+	struct list_head	hs_requests;
+	int			hs_pending;
 } __randomize_layout;
 
 #include <linux/seq_file_net.h>
diff --git a/include/net/sock.h b/include/net/sock.h
index 573f2bf7e0de..2a7345ce2540 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -519,6 +519,7 @@  struct sock {
 
 	struct socket		*sk_socket;
 	void			*sk_user_data;
+	void			*sk_handshake_req;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
 #endif
diff --git a/include/trace/events/handshake.h b/include/trace/events/handshake.h
new file mode 100644
index 000000000000..feffcd1d6256
--- /dev/null
+++ b/include/trace/events/handshake.h
@@ -0,0 +1,159 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM handshake
+
+#if !defined(_TRACE_HANDSHAKE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HANDSHAKE_H
+
+#include <linux/net.h>
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(handshake_event_class,
+	TP_PROTO(
+		const struct net *net,
+		const struct handshake_req *req,
+		const struct socket *sock
+	),
+	TP_ARGS(net, req, sock),
+	TP_STRUCT__entry(
+		__field(const void *, req)
+		__field(const void *, sock)
+		__field(unsigned int, netns_ino)
+	),
+	TP_fast_assign(
+		__entry->req = req;
+		__entry->sock = sock;
+		__entry->netns_ino = net->ns.inum;
+	),
+	TP_printk("req=%p sock=%p",
+		__entry->req, __entry->sock
+	)
+);
+#define DEFINE_HANDSHAKE_EVENT(name)				\
+	DEFINE_EVENT(handshake_event_class, name,		\
+		TP_PROTO(					\
+			const struct net *net,			\
+			const struct handshake_req *req,	\
+			const struct socket *sock		\
+		),						\
+		TP_ARGS(net, req, sock))
+
+DECLARE_EVENT_CLASS(handshake_fd_class,
+	TP_PROTO(
+		const struct net *net,
+		const struct handshake_req *req,
+		const struct socket *sock,
+		int fd
+	),
+	TP_ARGS(net, req, sock, fd),
+	TP_STRUCT__entry(
+		__field(const void *, req)
+		__field(const void *, sock)
+		__field(int, fd)
+		__field(unsigned int, netns_ino)
+	),
+	TP_fast_assign(
+		__entry->req = req;
+		__entry->sock = req->hr_sock;
+		__entry->fd = fd;
+		__entry->netns_ino = net->ns.inum;
+	),
+	TP_printk("req=%p sock=%p fd=%d",
+		__entry->req, __entry->sock, __entry->fd
+	)
+);
+#define DEFINE_HANDSHAKE_FD_EVENT(name)				\
+	DEFINE_EVENT(handshake_fd_class, name,			\
+		TP_PROTO(					\
+			const struct net *net,			\
+			const struct handshake_req *req,	\
+			const struct socket *sock,		\
+			int fd					\
+		),						\
+		TP_ARGS(net, req, sock, fd))
+
+DECLARE_EVENT_CLASS(handshake_error_class,
+	TP_PROTO(
+		const struct net *net,
+		const struct handshake_req *req,
+		const struct socket *sock,
+		int err
+	),
+	TP_ARGS(net, req, sock, err),
+	TP_STRUCT__entry(
+		__field(const void *, req)
+		__field(const void *, sock)
+		__field(int, err)
+		__field(unsigned int, netns_ino)
+	),
+	TP_fast_assign(
+		__entry->req = req;
+		__entry->sock = sock;
+		__entry->err = err;
+		__entry->netns_ino = net->ns.inum;
+	),
+	TP_printk("req=%p sock=%p err=%d",
+		__entry->req, __entry->sock, __entry->err
+	)
+);
+#define DEFINE_HANDSHAKE_ERROR(name)				\
+	DEFINE_EVENT(handshake_error_class, name,		\
+		TP_PROTO(					\
+			const struct net *net,			\
+			const struct handshake_req *req,	\
+			const struct socket *sock,		\
+			int err					\
+		),						\
+		TP_ARGS(net, req, sock, err))
+
+
+/**
+ ** Request lifetime events
+ **/
+
+DEFINE_HANDSHAKE_EVENT(handshake_submit);
+DEFINE_HANDSHAKE_ERROR(handshake_submit_err);
+DEFINE_HANDSHAKE_EVENT(handshake_cancel);
+DEFINE_HANDSHAKE_EVENT(handshake_cancel_none);
+DEFINE_HANDSHAKE_EVENT(handshake_cancel_busy);
+DEFINE_HANDSHAKE_EVENT(handshake_destruct);
+
+
+TRACE_EVENT(handshake_complete,
+	TP_PROTO(
+		const struct net *net,
+		const struct handshake_req *req,
+		const struct socket *sock,
+		int status
+	),
+	TP_ARGS(net, req, sock, status),
+	TP_STRUCT__entry(
+		__field(const void *, req)
+		__field(const void *, sock)
+		__field(int, status)
+		__field(unsigned int, netns_ino)
+	),
+	TP_fast_assign(
+		__entry->req = req;
+		__entry->sock = sock;
+		__entry->status = status;
+		__entry->netns_ino = net->ns.inum;
+	),
+	TP_printk("req=%p sock=%p status=%d",
+		__entry->req, __entry->sock, __entry->status
+	)
+);
+
+/**
+ ** Netlink events
+ **/
+
+DEFINE_HANDSHAKE_ERROR(handshake_notify_err);
+DEFINE_HANDSHAKE_FD_EVENT(handshake_cmd_accept);
+DEFINE_HANDSHAKE_ERROR(handshake_cmd_accept_err);
+DEFINE_HANDSHAKE_FD_EVENT(handshake_cmd_done);
+DEFINE_HANDSHAKE_ERROR(handshake_cmd_done_err);
+
+#endif /* _TRACE_HANDSHAKE_H */
+
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/handshake.h b/include/uapi/linux/handshake.h
new file mode 100644
index 000000000000..09fd7c37cba4
--- /dev/null
+++ b/include/uapi/linux/handshake.h
@@ -0,0 +1,63 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/handshake.yaml */
+/* YNL-GEN uapi header */
+
+#ifndef _UAPI_LINUX_HANDSHAKE_H
+#define _UAPI_LINUX_HANDSHAKE_H
+
+#define HANDSHAKE_FAMILY_NAME		"handshake"
+#define HANDSHAKE_FAMILY_VERSION	1
+
+enum {
+	HANDSHAKE_HANDLER_CLASS_NONE,
+};
+
+enum {
+	HANDSHAKE_MSG_TYPE_UNSPEC,
+	HANDSHAKE_MSG_TYPE_CLIENTHELLO,
+	HANDSHAKE_MSG_TYPE_SERVERHELLO,
+};
+
+enum {
+	HANDSHAKE_AUTH_UNSPEC,
+	HANDSHAKE_AUTH_UNAUTH,
+	HANDSHAKE_AUTH_X509,
+	HANDSHAKE_AUTH_PSK,
+};
+
+enum {
+	HANDSHAKE_A_ACCEPT_STATUS = 1,
+	HANDSHAKE_A_ACCEPT_SOCKFD,
+	HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
+	HANDSHAKE_A_ACCEPT_MESSAGE_TYPE,
+	HANDSHAKE_A_ACCEPT_AUTH,
+	HANDSHAKE_A_ACCEPT_GNUTLS_PRIORITIES,
+	HANDSHAKE_A_ACCEPT_MY_PEERID,
+	HANDSHAKE_A_ACCEPT_MY_PRIVKEY,
+
+	__HANDSHAKE_A_ACCEPT_MAX,
+	HANDSHAKE_A_ACCEPT_MAX = (__HANDSHAKE_A_ACCEPT_MAX - 1)
+};
+
+enum {
+	HANDSHAKE_A_DONE_STATUS = 1,
+	HANDSHAKE_A_DONE_SOCKFD,
+	HANDSHAKE_A_DONE_REMOTE_PEERID,
+
+	__HANDSHAKE_A_DONE_MAX,
+	HANDSHAKE_A_DONE_MAX = (__HANDSHAKE_A_DONE_MAX - 1)
+};
+
+enum {
+	HANDSHAKE_CMD_READY = 1,
+	HANDSHAKE_CMD_ACCEPT,
+	HANDSHAKE_CMD_DONE,
+
+	__HANDSHAKE_CMD_MAX,
+	HANDSHAKE_CMD_MAX = (__HANDSHAKE_CMD_MAX - 1)
+};
+
+#define HANDSHAKE_MCGRP_NONE	"none"
+
+#endif /* _UAPI_LINUX_HANDSHAKE_H */
diff --git a/net/Makefile b/net/Makefile
index 0914bea9c335..adbb64277601 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -79,3 +79,4 @@  obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
 obj-$(CONFIG_MPTCP)		+= mptcp/
 obj-$(CONFIG_MCTP)		+= mctp/
+obj-y				+= handshake/
diff --git a/net/handshake/Makefile b/net/handshake/Makefile
new file mode 100644
index 000000000000..a41b03f4837b
--- /dev/null
+++ b/net/handshake/Makefile
@@ -0,0 +1,11 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for the Generic HANDSHAKE service
+#
+# Author: Chuck Lever <chuck.lever@oracle.com>
+#
+# Copyright (c) 2023, Oracle and/or its affiliates.
+#
+
+obj-y += handshake.o
+handshake-y := netlink.o request.o trace.o
diff --git a/net/handshake/handshake.h b/net/handshake/handshake.h
new file mode 100644
index 000000000000..366c7659ec09
--- /dev/null
+++ b/net/handshake/handshake.h
@@ -0,0 +1,41 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic netlink handshake service
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+/*
+ * Data structures and functions that are visible only within the
+ * handshake module are declared here.
+ */
+
+#ifndef _INTERNAL_HANDSHAKE_H
+#define _INTERNAL_HANDSHAKE_H
+
+/*
+ * One handshake request
+ */
+struct handshake_req {
+	struct list_head		hr_list;
+	unsigned long			hr_flags;
+	const struct handshake_proto	*hr_proto;
+	struct socket			*hr_sock;
+
+	void				(*hr_saved_destruct)(struct sock *sk);
+};
+
+#define HANDSHAKE_F_COMPLETED	BIT(0)
+
+/* netlink.c */
+extern bool handshake_genl_inited;
+int handshake_genl_notify(struct net *net, int handler_class, gfp_t flags);
+
+/* request.c */
+void __remove_pending_locked(struct net *net, struct handshake_req *req);
+void handshake_complete(struct handshake_req *req, int status,
+			struct nlattr **tb);
+
+#endif /* _INTERNAL_HANDSHAKE_H */
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
new file mode 100644
index 000000000000..581e382236cf
--- /dev/null
+++ b/net/handshake/netlink.c
@@ -0,0 +1,340 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Generic netlink handshake service
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/inet.h>
+
+#include <net/sock.h>
+#include <net/genetlink.h>
+#include <net/handshake.h>
+
+#include <uapi/linux/handshake.h>
+#include <trace/events/handshake.h>
+#include "handshake.h"
+
+static struct genl_family __ro_after_init handshake_genl_family;
+bool handshake_genl_inited;
+
+/**
+ * handshake_genl_notify - Notify handlers that a request is waiting
+ * @net: target network namespace
+ * @handler_class: target handler
+ * @flags: memory allocation control flags
+ *
+ * Returns zero on success or a negative errno if notification failed.
+ */
+int handshake_genl_notify(struct net *net, int handler_class, gfp_t flags)
+{
+	struct sk_buff *msg;
+	void *hdr;
+
+	if (!genl_has_listeners(&handshake_genl_family, net, handler_class))
+		return -ESRCH;
+
+	msg = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_put(msg, 0, 0, &handshake_genl_family, 0,
+			  HANDSHAKE_CMD_READY);
+	if (!hdr)
+		goto out_free;
+
+	if (nla_put_u32(msg, HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
+			handler_class) < 0) {
+		genlmsg_cancel(msg, hdr);
+		goto out_free;
+	}
+
+	genlmsg_end(msg, hdr);
+	return genlmsg_multicast_netns(&handshake_genl_family, net, msg,
+				       0, handler_class, flags);
+
+out_free:
+	nlmsg_free(msg);
+	return -EMSGSIZE;
+}
+
+/**
+ * handshake_genl_put - Create a generic netlink message header
+ * @msg: buffer in which to create the header
+ * @gi: generic netlink message context
+ *
+ * Returns a ready-to-use header, or NULL.
+ */
+struct nlmsghdr *handshake_genl_put(struct sk_buff *msg, struct genl_info *gi)
+{
+	return genlmsg_put(msg, gi->snd_portid, gi->snd_seq,
+			   &handshake_genl_family, 0, gi->genlhdr->cmd);
+}
+EXPORT_SYMBOL(handshake_genl_put);
+
+static int handshake_status_reply(struct sk_buff *skb, struct genl_info *gi,
+				  int status)
+{
+	struct nlmsghdr *hdr;
+	struct sk_buff *msg;
+	int ret;
+
+	ret = -ENOMEM;
+	msg = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		goto out;
+	hdr = handshake_genl_put(msg, gi);
+	if (!hdr)
+		goto out_free;
+
+	ret = -EMSGSIZE;
+	ret = nla_put_u32(msg, HANDSHAKE_A_ACCEPT_STATUS, status);
+	if (ret < 0)
+		goto out_free;
+
+	genlmsg_end(msg, hdr);
+	return genlmsg_reply(msg, gi);
+
+out_free:
+	genlmsg_cancel(msg, hdr);
+out:
+	return ret;
+}
+
+/*
+ * dup() a kernel socket for use as a user space file descriptor
+ * in the current process.
+ *
+ * Implicit argument: "current()"
+ */
+static int handshake_dup(struct socket *kernsock)
+{
+	struct file *file = get_file(kernsock->file);
+	int newfd;
+
+	newfd = get_unused_fd_flags(O_CLOEXEC);
+	if (newfd < 0) {
+		fput(file);
+		return newfd;
+	}
+
+	fd_install(newfd, file);
+	return newfd;
+}
+
+static const struct nla_policy
+handshake_accept_nl_policy[HANDSHAKE_A_ACCEPT_HANDLER_CLASS + 1] = {
+	[HANDSHAKE_A_ACCEPT_HANDLER_CLASS] = { .type = NLA_U32, },
+};
+
+static int handshake_nl_accept_doit(struct sk_buff *skb, struct genl_info *gi)
+{
+	struct nlattr *tb[HANDSHAKE_A_ACCEPT_MAX + 1];
+	struct net *net = sock_net(skb->sk);
+	struct handshake_req *pos, *req;
+	int fd, err;
+
+	err = -EINVAL;
+	if (genlmsg_parse(nlmsg_hdr(skb), &handshake_genl_family, tb,
+			  HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
+			  handshake_accept_nl_policy, NULL))
+		goto out_status;
+	if (!tb[HANDSHAKE_A_ACCEPT_HANDLER_CLASS])
+		goto out_status;
+
+	req = NULL;
+	spin_lock(&net->hs_lock);
+	list_for_each_entry(pos, &net->hs_requests, hr_list) {
+		if (pos->hr_proto->hp_handler_class !=
+		    nla_get_u32(tb[HANDSHAKE_A_ACCEPT_HANDLER_CLASS]))
+			continue;
+		__remove_pending_locked(net, pos);
+		req = pos;
+		break;
+	}
+	spin_unlock(&net->hs_lock);
+	if (!req)
+		goto out_status;
+
+	fd = handshake_dup(req->hr_sock);
+	if (fd < 0) {
+		err = fd;
+		goto out_complete;
+	}
+	err = req->hr_proto->hp_accept(req, gi, fd);
+	if (err)
+		goto out_complete;
+
+	trace_handshake_cmd_accept(net, req, req->hr_sock, fd);
+	return 0;
+
+out_complete:
+	handshake_complete(req, -EIO, NULL);
+	fput(req->hr_sock->file);
+out_status:
+	trace_handshake_cmd_accept_err(net, req, NULL, err);
+	return handshake_status_reply(skb, gi, err);
+}
+
+static const struct nla_policy
+handshake_done_nl_policy[HANDSHAKE_A_DONE_MAX + 1] = {
+	[HANDSHAKE_A_DONE_SOCKFD] = { .type = NLA_U32, },
+	[HANDSHAKE_A_DONE_STATUS] = { .type = NLA_U32, },
+	[HANDSHAKE_A_DONE_REMOTE_PEERID] = { .type = NLA_U32, },
+};
+
+static int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *gi)
+{
+	struct nlattr *tb[HANDSHAKE_A_DONE_MAX + 1];
+	struct net *net = sock_net(skb->sk);
+	struct socket *sock = NULL;
+	struct handshake_req *req;
+	int fd, status, err;
+
+	err = genlmsg_parse(nlmsg_hdr(skb), &handshake_genl_family, tb,
+			    HANDSHAKE_A_DONE_MAX, handshake_done_nl_policy,
+			    NULL);
+	if (err || !tb[HANDSHAKE_A_DONE_SOCKFD]) {
+		err = -EINVAL;
+		goto out_status;
+	}
+
+	fd = nla_get_u32(tb[HANDSHAKE_A_DONE_SOCKFD]);
+
+	err = 0;
+	sock = sockfd_lookup(fd, &err);
+	if (err) {
+		err = -EBADF;
+		goto out_status;
+	}
+
+	req = sock->sk->sk_handshake_req;
+	if (!req) {
+		err = -EBUSY;
+		goto out_status;
+	}
+
+	trace_handshake_cmd_done(net, req, sock, fd);
+
+	status = -EIO;
+	if (tb[HANDSHAKE_A_DONE_STATUS])
+		status = nla_get_u32(tb[HANDSHAKE_A_DONE_STATUS]);
+
+	handshake_complete(req, status, tb);
+	fput(sock->file);
+	return 0;
+
+out_status:
+	trace_handshake_cmd_done_err(net, req, sock, err);
+	return handshake_status_reply(skb, gi, err);
+}
+
+static const struct genl_split_ops handshake_nl_ops[] = {
+	{
+		.cmd		= HANDSHAKE_CMD_ACCEPT,
+		.doit		= handshake_nl_accept_doit,
+		.policy		= handshake_accept_nl_policy,
+		.maxattr	= HANDSHAKE_A_ACCEPT_HANDLER_CLASS,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+	{
+		.cmd		= HANDSHAKE_CMD_DONE,
+		.doit		= handshake_nl_done_doit,
+		.policy		= handshake_done_nl_policy,
+		.maxattr	= HANDSHAKE_A_DONE_REMOTE_PEERID,
+		.flags		= GENL_CMD_CAP_DO,
+	},
+};
+
+static const struct genl_multicast_group handshake_nl_mcgrps[] = {
+	[HANDSHAKE_HANDLER_CLASS_NONE] = { .name = HANDSHAKE_MCGRP_NONE, },
+};
+
+static struct genl_family __ro_after_init handshake_genl_family = {
+	.hdrsize		= 0,
+	.name			= HANDSHAKE_FAMILY_NAME,
+	.version		= HANDSHAKE_FAMILY_VERSION,
+	.netnsok		= true,
+	.parallel_ops		= true,
+	.n_mcgrps		= ARRAY_SIZE(handshake_nl_mcgrps),
+	.n_split_ops		= ARRAY_SIZE(handshake_nl_ops),
+	.split_ops		= handshake_nl_ops,
+	.mcgrps			= handshake_nl_mcgrps,
+	.module			= THIS_MODULE,
+};
+
+static int __net_init handshake_net_init(struct net *net)
+{
+	spin_lock_init(&net->hs_lock);
+	INIT_LIST_HEAD(&net->hs_requests);
+	net->hs_pending	= 0;
+	return 0;
+}
+
+static void __net_exit handshake_net_exit(struct net *net)
+{
+	struct handshake_req *req;
+	LIST_HEAD(requests);
+
+	/*
+	 * This drains the net's pending list. Requests that
+	 * have been accepted and are in progress will be
+	 * destroyed when the socket is closed.
+	 */
+	spin_lock(&net->hs_lock);
+	list_splice_init(&requests, &net->hs_requests);
+	spin_unlock(&net->hs_lock);
+
+	while (!list_empty(&requests)) {
+		req = list_first_entry(&requests, struct handshake_req, hr_list);
+		list_del(&req->hr_list);
+
+		/*
+		 * Requests on this list have not yet been
+		 * accepted, so they do not have an fd to put.
+		 */
+
+		handshake_complete(req, -ETIMEDOUT, NULL);
+	}
+}
+
+static struct pernet_operations handshake_genl_net_ops = {
+	.init		= handshake_net_init,
+	.exit		= handshake_net_exit,
+};
+
+static int __init handshake_init(void)
+{
+	int ret;
+
+	ret = genl_register_family(&handshake_genl_family);
+	if (ret) {
+		pr_warn("handshake: netlink registration failed (%d)\n", ret);
+		return ret;
+	}
+
+	ret = register_pernet_subsys(&handshake_genl_net_ops);
+	if (ret) {
+		pr_warn("handshake: pernet registration failed (%d)\n", ret);
+		genl_unregister_family(&handshake_genl_family);
+	}
+
+	handshake_genl_inited = true;
+	return ret;
+}
+
+static void __exit handshake_exit(void)
+{
+	unregister_pernet_subsys(&handshake_genl_net_ops);
+	genl_unregister_family(&handshake_genl_family);
+}
+
+module_init(handshake_init);
+module_exit(handshake_exit);
diff --git a/net/handshake/request.c b/net/handshake/request.c
new file mode 100644
index 000000000000..1d3b8e76dd2c
--- /dev/null
+++ b/net/handshake/request.c
@@ -0,0 +1,246 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Handshake request lifetime events
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/inet.h>
+#include <linux/fdtable.h>
+
+#include <net/sock.h>
+#include <net/genetlink.h>
+#include <net/handshake.h>
+
+#include <uapi/linux/handshake.h>
+#include <trace/events/handshake.h>
+#include "handshake.h"
+
+/*
+ * This limit is to prevent slow remotes from causing denial of service.
+ * A ulimit-style tunable might be used instead.
+ */
+#define HANDSHAKE_PENDING_MAX (10)
+
+static void __add_pending_locked(struct net *net, struct handshake_req *req)
+{
+	net->hs_pending++;
+	list_add_tail(&req->hr_list, &net->hs_requests);
+}
+
+void __remove_pending_locked(struct net *net, struct handshake_req *req)
+{
+	net->hs_pending--;
+	list_del_init(&req->hr_list);
+}
+
+/*
+ * Return values:
+ *   %true - the request was found on @net's pending list
+ *   %false - the request was not found on @net's pending list
+ *
+ * If @req was on a pending list, it has not yet been accepted.
+ */
+static bool remove_pending(struct net *net, struct handshake_req *req)
+{
+	bool ret;
+
+	ret = false;
+
+	spin_lock(&net->hs_lock);
+	if (!list_empty(&req->hr_list)) {
+		__remove_pending_locked(net, req);
+		ret = true;
+	}
+	spin_unlock(&net->hs_lock);
+
+	return ret;
+}
+
+static void handshake_req_destroy(struct handshake_req *req, struct sock *sk)
+{
+	req->hr_proto->hp_destroy(req);
+	sk->sk_handshake_req = NULL;
+	kfree(req);
+}
+
+static void handshake_sk_destruct(struct sock *sk)
+{
+	struct handshake_req *req = sk->sk_handshake_req;
+
+	if (req) {
+		trace_handshake_destruct(sock_net(sk), req, req->hr_sock);
+		handshake_req_destroy(req, sk);
+	}
+}
+
+/**
+ * handshake_req_alloc - consumer API to allocate a request
+ * @sock: open socket on which to perform a handshake
+ * @proto: security protocol
+ * @flags: memory allocation flags
+ *
+ * Returns an initialized handshake_req or NULL.
+ */
+struct handshake_req *handshake_req_alloc(struct socket *sock,
+					  const struct handshake_proto *proto,
+					  gfp_t flags)
+{
+	struct handshake_req *req;
+
+	/* Avoid accessing uninitialized global variables later on */
+	if (!handshake_genl_inited)
+		return NULL;
+
+	req = kzalloc(sizeof(*req) + proto->hp_privsize, flags);
+	if (!req)
+		return NULL;
+
+	sock_hold(sock->sk);
+
+	INIT_LIST_HEAD(&req->hr_list);
+	req->hr_sock = sock;
+	req->hr_proto = proto;
+	return req;
+}
+EXPORT_SYMBOL(handshake_req_alloc);
+
+/**
+ * handshake_req_private - consumer API to return per-handshake private data
+ * @req: handshake arguments
+ *
+ */
+void *handshake_req_private(struct handshake_req *req)
+{
+	return (void *)(req + 1);
+}
+EXPORT_SYMBOL(handshake_req_private);
+
+/**
+ * handshake_req_submit - consumer API to submit a handshake request
+ * @req: handshake arguments
+ * @flags: memory allocation flags
+ *
+ * Return values:
+ *   %0: Request queued
+ *   %-EBUSY: A handshake is already under way for this socket
+ *   %-ESRCH: No handshake agent is available
+ *   %-EAGAIN: Too many pending handshake requests
+ *   %-ENOMEM: Failed to allocate memory
+ *   %-EMSGSIZE: Failed to construct notification message
+ *
+ * A zero return value from handshake_request() means that
+ * exactly one subsequent completion callback is guaranteed.
+ *
+ * A negative return value from handshake_request() means that
+ * no completion callback will be done and that @req is
+ * destroyed.
+ */
+int handshake_req_submit(struct handshake_req *req, gfp_t flags)
+{
+	struct socket *sock = req->hr_sock;
+	struct sock *sk = sock->sk;
+	struct net *net = sock_net(sk);
+	int ret;
+
+	ret = -EAGAIN;
+	if (READ_ONCE(net->hs_pending) >= HANDSHAKE_PENDING_MAX)
+		goto out_err;
+
+	ret = -EBUSY;
+	spin_lock(&net->hs_lock);
+	if (sk->sk_handshake_req || !list_empty(&req->hr_list)) {
+		spin_unlock(&net->hs_lock);
+		goto out_err;
+	}
+	req->hr_saved_destruct = sk->sk_destruct;
+	sk->sk_destruct = handshake_sk_destruct;
+	sk->sk_handshake_req = req;
+	__add_pending_locked(net, req);
+	spin_unlock(&net->hs_lock);
+
+	ret = handshake_genl_notify(net, req->hr_proto->hp_handler_class,
+				    flags);
+	if (ret) {
+		trace_handshake_notify_err(net, req, sock, ret);
+		if (remove_pending(net, req))
+			goto out_err;
+	}
+
+	trace_handshake_submit(net, req, sock);
+	return 0;
+
+out_err:
+	trace_handshake_submit_err(net, req, sock, ret);
+	handshake_req_destroy(req, sk);
+	return ret;
+}
+EXPORT_SYMBOL(handshake_req_submit);
+
+void handshake_complete(struct handshake_req *req, int status,
+			struct nlattr **tb)
+{
+	struct socket *sock = req->hr_sock;
+	struct net *net = sock_net(sock->sk);
+
+	if (!test_and_set_bit(HANDSHAKE_F_COMPLETED, &req->hr_flags)) {
+		trace_handshake_complete(net, req, sock, status);
+		req->hr_proto->hp_done(req, status, tb);
+		__sock_put(sock->sk);
+	}
+}
+
+/**
+ * handshake_req_cancel - consumer API to cancel an in-progress handshake
+ * @sock: socket on which there is an ongoing handshake
+ *
+ * XXX: Perhaps killing the user space agent might also be necessary?
+ *
+ * Request cancellation races with request completion. To determine
+ * who won, callers examine the return value from this function.
+ *
+ * Return values:
+ *   %0 - Uncompleted handshake request was canceled or not found
+ *   %-EBUSY - Handshake request already completed
+ */
+int handshake_req_cancel(struct socket *sock)
+{
+	struct handshake_req *req;
+	struct sock *sk;
+	struct net *net;
+
+	if (!sock)
+		return 0;
+
+	sk = sock->sk;
+	req = sk->sk_handshake_req;
+	net = sock_net(sk);
+
+	if (!req) {
+		trace_handshake_cancel_none(net, req, sock);
+		return 0;
+	}
+
+	if (remove_pending(net, req)) {
+		/* Request hadn't been accepted */
+		trace_handshake_cancel(net, req, sock);
+		return 0;
+	}
+	if (test_and_set_bit(HANDSHAKE_F_COMPLETED, &req->hr_flags)) {
+		/* Request already completed */
+		trace_handshake_cancel_busy(net, req, sock);
+		return -EBUSY;
+	}
+
+	__sock_put(sk);
+	trace_handshake_cancel(net, req, sock);
+	return 0;
+}
+EXPORT_SYMBOL(handshake_req_cancel);
diff --git a/net/handshake/trace.c b/net/handshake/trace.c
new file mode 100644
index 000000000000..3a5b6f29a2b8
--- /dev/null
+++ b/net/handshake/trace.c
@@ -0,0 +1,17 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trace points for transport security layer handshakes.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/types.h>
+#include <net/sock.h>
+
+#include "handshake.h"
+
+#define CREATE_TRACE_POINTS
+
+#include <trace/events/handshake.h>