mbox series

[RFC,net-next,0/5] net: In-kernel QUIC implementation with Userspace handshake

Message ID cover.1710173427.git.lucien.xin@gmail.com (mailing list archive)
Headers show
Series net: In-kernel QUIC implementation with Userspace handshake | expand

Message

Xin Long March 11, 2024, 4:10 p.m. UTC
Introduction
============

This is an implementation of the QUIC protocol as defined in RFC9000. QUIC
is an UDP-Based Multiplexed and Secure Transport protocol, and it provides
applications with flow-controlled streams for structured communication,
low-latency connection establishment, and network path migration. QUIC
includes security measures that ensure confidentiality, integrity, and
availability in a range of deployment circumstances.

This implementation of QUIC in the kernel space enables users to utilize
the QUIC protocol through common socket APIs in user space. Additionally,
kernel subsystems like SMB and NFS can seamlessly operate over the QUIC
protocol after handshake using net/handshake APIs.

Note that In-Kernel QUIC implementation does NOT target Crypto Offload
support for existing Userland QUICs, and Crypto Offload intended for
Userland QUICs can NOT be utilized for Kernel consumers, such as SMB.
Therefore, there is no conflict between In-Kernel QUIC and Crypto
Offload for Userland QUICs.

This implementation offers fundamental support for the following RFCs:

- RFC9000 - QUIC: A UDP-Based Multiplexed and Secure Transport
- RFC9001 - Using TLS to Secure QUIC
- RFC9002 - QUIC Loss Detection and Congestion Control
- RFC9221 - An Unreliable Datagram Extension to QUIC
- RFC9287 - Greasing the QUIC Bit
- RFC9368 - Compatible Version Negotiation for QUIC
- RFC9369 - QUIC Version 2
- Handshake APIs for tlshd Use - SMB/NFS over QUIC

Implementation
==============

The central idea is to implement QUIC within the kernel, incorporating an
userspace handshake approach.

Only the processing and creation of raw TLS Handshake Messages, facilitated
by a tls library like gnutls, take place in userspace. These messages are
exchanged through sendmsg/recvmsg() mechanisms, with cryptographic details
carried in the control message (cmsg).

The entirety of QUIC protocol, excluding TLS Handshake Messages processing
and creation, resides in the kernel. Instead of utilizing a User Level
Protocol (ULP) layer, it establishes a socket of IPPROTO_QUIC type (similar
to IPPROTO_MPTCP) operating over UDP tunnels.

Kernel consumers can initiate a handshake request from kernel to userspace
via the existing net/handshake netlink. The userspace component, tlshd from
ktls-utils, manages the QUIC handshake request processing.

- Handshake Architecture:

      +------+  +------+
      | APP1 |  | APP2 | ...
      +------+  +------+
      +-------------------------------------------------+
      |                libquic (ktls-utils)             |<--------------+
      |      {quic_handshake_server/client/param()}     |               |
      +-------------------------------------------------+      +---------------------+
        {send/recvmsg()}         {set/getsockopt()}            | tlshd (ktls-utils)  |
        [CMSG handshake_info]    [SOCKOPT_CRYPTO_SECRET]       +---------------------+
                                 [SOCKOPT_TRANSPORT_PARAM_EXT]
              | ^                            | ^                        | ^
  Userspace   | |                            | |                        | |
  ------------|-|----------------------------|-|------------------------|-|--------------
  Kernel      | |                            | |                        | |
              v |                            v |                        v |
      +--------------------------------------------------+         +-------------+
      |  socket (IPPRTOTO_QUIC)  |       protocol        |<----+   | handshake   |
      +--------------------------------------------------+     |   | netlink APIs|
      | inqueue | outqueue | cong | path | connection_id |     |   +-------------+
      +--------------------------------------------------+     |      |      |
      |   packet   |   frame   |   crypto   |   pnmap    |     |   +-----+ +-----+
      +--------------------------------------------------+     |   |     | |     |
      |         input           |       output           |     |---| SMB | | NFS | ...
      +--------------------------------------------------+     |   |     | |     |
      |                   UDP tunnels                    |     |   +-----+ +--+--+
      +--------------------------------------------------+     +--------------|

- Post Handshake Architecture:

      +------+  +------+
      | APP1 |  | APP2 | ...
      +------+  +------+
        {send/recvmsg()}         {set/getsockopt()}
        [CMSG stream_info]       [SOCKOPT_KEY_UPDATE]
                                 [SOCKOPT_CONNECTION_MIGRATION]
                                 [SOCKOPT_STREAM_OPEN/RESET/STOP_SENDING]
                                 [...]
              | ^                            | ^
  Userspace   | |                            | |
  ------------|-|----------------------------|-|----------------
  Kernel      | |                            | |
              v |                            v |
      +--------------------------------------------------+
      |  socket (IPPRTOTO_QUIC)  |       protocol        |<----+ {kernel_send/recvmsg()}
      +--------------------------------------------------+     | {kernel_set/getsockopt()}
      | inqueue | outqueue | cong | path | connection_id |     |
      +--------------------------------------------------+     |
      |   packet   |   frame   |   crypto   |   pnmap    |     |   +-----+ +-----+
      +--------------------------------------------------+     |   |     | |     |
      |         input           |       output           |     |---| SMB | | NFS | ...
      +--------------------------------------------------+     |   |     | |     |
      |                   UDP tunnels                    |     |   +-----+ +--+--+
      +--------------------------------------------------+     +--------------|

Usage
=====

This implementation supports a mapping of QUIC into sockets APIs. Similar
to TCP and SCTP, a typical Server and Client use the following system call
sequence to communicate:

       Client                    Server
    ------------------------------------------------------------------
    sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
    bind(sockfd)                       bind(listenfd)
                                       listen(listenfd)
    connect(sockfd)
    quic_client_handshake(sockfd)
                                       sockfd = accecpt(listenfd)
                                       quic_server_handshake(sockfd, cert)

    sendmsg(sockfd)                    recvmsg(sockfd)
    close(sockfd)                      close(sockfd)
                                       close(listenfd)

Please note that quic_client_handshake() and quic_server_handshake() functions
are currently sourced from libquic in the github lxin/quic repository, and might
be integrated into ktls-utils in the future. These functions are responsible for
receiving and processing the raw TLS handshake messages until the completion of
the handshake process.

For utilization by kernel consumers, it is essential to have the tlshd service
(from ktls-utils) installed and running in userspace. This service receives
and manages kernel handshake requests for kernel sockets. In kernel, the APIs
closely resemble those used in userspace:

       Client                    Server
    ------------------------------------------------------------------------
    __sock_create(IPPROTO_QUIC, &sock)  __sock_create(IPPROTO_QUIC, &sock)
    kernel_bind(sock)                   kernel_bind(sock)
                                        kernel_listen(sock)
    kernel_connect(sock)
    tls_client_hello_x509(args:{sock})
                                        kernel_accept(sock, &newsock)
                                        tls_server_hello_x509(args:{newsock})

    kernel_sendmsg(sock)                kernel_recvmsg(newsock)
    sock_release(sock)                  sock_release(newsock)
                                        sock_release(sock)

Please be aware that tls_client_hello_x509() and tls_server_hello_x509() are
APIs from net/handshake/. They are employed to dispatch the handshake request
to the userspace tlshd service and subsequently block until the handshake
process is completed.

For advanced usage,
see man doc: https://github.com/lxin/quic/wiki/man
and examples: https://github.com/lxin/quic/tree/main/tests

The QUIC module is currently labeled as "EXPERIMENTAL".

Xin Long (5):
  net: define IPPROTO_QUIC and SOL_QUIC constants for QUIC protocol
  net: include quic.h in include/uapi/linux for QUIC protocol
  net: implement QUIC protocol code in net/quic directory
  net: integrate QUIC build configuration into Kconfig and Makefile
  Documentation: introduce quic.rst to provide description of QUIC
    protocol

 Documentation/networking/quic.rst |  160 +++
 include/linux/socket.h            |    1 +
 include/uapi/linux/in.h           |    2 +
 include/uapi/linux/quic.h         |  189 +++
 net/Kconfig                       |    1 +
 net/Makefile                      |    1 +
 net/quic/Kconfig                  |   34 +
 net/quic/Makefile                 |   20 +
 net/quic/cong.c                   |  229 ++++
 net/quic/cong.h                   |   84 ++
 net/quic/connection.c             |  172 +++
 net/quic/connection.h             |  117 ++
 net/quic/crypto.c                 |  979 ++++++++++++++++
 net/quic/crypto.h                 |  140 +++
 net/quic/frame.c                  | 1803 ++++++++++++++++++++++++++++
 net/quic/frame.h                  |  162 +++
 net/quic/hashtable.h              |  125 ++
 net/quic/input.c                  |  693 +++++++++++
 net/quic/input.h                  |  169 +++
 net/quic/number.h                 |  174 +++
 net/quic/output.c                 |  638 ++++++++++
 net/quic/output.h                 |  194 +++
 net/quic/packet.c                 | 1179 +++++++++++++++++++
 net/quic/packet.h                 |   99 ++
 net/quic/path.c                   |  434 +++++++
 net/quic/path.h                   |  131 +++
 net/quic/pnmap.c                  |  217 ++++
 net/quic/pnmap.h                  |  134 +++
 net/quic/protocol.c               |  711 +++++++++++
 net/quic/protocol.h               |   56 +
 net/quic/sample_test.c            |  339 ++++++
 net/quic/socket.c                 | 1823 +++++++++++++++++++++++++++++
 net/quic/socket.h                 |  293 +++++
 net/quic/stream.c                 |  248 ++++
 net/quic/stream.h                 |  147 +++
 net/quic/timer.c                  |  241 ++++
 net/quic/timer.h                  |   29 +
 net/quic/unit_test.c              | 1024 ++++++++++++++++
 38 files changed, 13192 insertions(+)
 create mode 100644 Documentation/networking/quic.rst
 create mode 100644 include/uapi/linux/quic.h
 create mode 100644 net/quic/Kconfig
 create mode 100644 net/quic/Makefile
 create mode 100644 net/quic/cong.c
 create mode 100644 net/quic/cong.h
 create mode 100644 net/quic/connection.c
 create mode 100644 net/quic/connection.h
 create mode 100644 net/quic/crypto.c
 create mode 100644 net/quic/crypto.h
 create mode 100644 net/quic/frame.c
 create mode 100644 net/quic/frame.h
 create mode 100644 net/quic/hashtable.h
 create mode 100644 net/quic/input.c
 create mode 100644 net/quic/input.h
 create mode 100644 net/quic/number.h
 create mode 100644 net/quic/output.c
 create mode 100644 net/quic/output.h
 create mode 100644 net/quic/packet.c
 create mode 100644 net/quic/packet.h
 create mode 100644 net/quic/path.c
 create mode 100644 net/quic/path.h
 create mode 100644 net/quic/pnmap.c
 create mode 100644 net/quic/pnmap.h
 create mode 100644 net/quic/protocol.c
 create mode 100644 net/quic/protocol.h
 create mode 100644 net/quic/sample_test.c
 create mode 100644 net/quic/socket.c
 create mode 100644 net/quic/socket.h
 create mode 100644 net/quic/stream.c
 create mode 100644 net/quic/stream.h
 create mode 100644 net/quic/timer.c
 create mode 100644 net/quic/timer.h
 create mode 100644 net/quic/unit_test.c

Comments

Stefan Metzmacher March 13, 2024, 8:56 a.m. UTC | #1
Hi Xin Long,

first many thanks for working on this topic!

> Usage
> =====
> 
> This implementation supports a mapping of QUIC into sockets APIs. Similar
> to TCP and SCTP, a typical Server and Client use the following system call
> sequence to communicate:
> 
>         Client                    Server
>      ------------------------------------------------------------------
>      sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
>      bind(sockfd)                       bind(listenfd)
>                                         listen(listenfd)
>      connect(sockfd)
>      quic_client_handshake(sockfd)
>                                         sockfd = accecpt(listenfd)
>                                         quic_server_handshake(sockfd, cert)
> 
>      sendmsg(sockfd)                    recvmsg(sockfd)
>      close(sockfd)                      close(sockfd)
>                                         close(listenfd)
> 
> Please note that quic_client_handshake() and quic_server_handshake() functions
> are currently sourced from libquic in the github lxin/quic repository, and might
> be integrated into ktls-utils in the future. These functions are responsible for
> receiving and processing the raw TLS handshake messages until the completion of
> the handshake process.

I see a problem with this design for the server, as one reason to
have SMB over QUIC is to use udp port 443 in order to get through
firewalls. As QUIC has the concept of ALPN it should be possible
let a conumer only listen on a specif ALPN, so that the smb server
and web server on "h3" could both accept connections.

So the server application should have a way to specify the desired
ALPN before or during the bind() call. I'm not sure if the
ALPN is available in cleartext before any crypto is needed,
so if the ALPN is encrypted it might be needed to also register
a server certificate and key together with the ALPN.
Because multiple application may not want to share the same key.

This needs to work indepented of kernel or userspace application.

We may want ksmbd (kernel smb server) and apache or smbd (Samba's userspace smb server)
together with apache. And maybe event ksmbd with one certificate for
ksmbd.example.com and smbd with a certificate for smbd.example.com
both on ALPN "smb", while apache uses "h3" with a certificate for
apache.example.com and nginx with "h3" and a certificate for
nginx.example.com.

But also smbd with "smb" as well as apache with "h3" both using
a certificate for quic.example.com.

I guess TLS Server Name Indication also works for QUIC, correct?

For the client side I guess dynamic udp ports are used and
there's no problem with multiple applications...

metze
Xin Long March 13, 2024, 4:03 p.m. UTC | #2
On Wed, Mar 13, 2024 at 4:56 AM Stefan Metzmacher <metze@samba.org> wrote:
>
> Hi Xin Long,
>
> first many thanks for working on this topic!
>
Hi, Stefan

Thanks for the comment!

> > Usage
> > =====
> >
> > This implementation supports a mapping of QUIC into sockets APIs. Similar
> > to TCP and SCTP, a typical Server and Client use the following system call
> > sequence to communicate:
> >
> >         Client                    Server
> >      ------------------------------------------------------------------
> >      sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
> >      bind(sockfd)                       bind(listenfd)
> >                                         listen(listenfd)
> >      connect(sockfd)
> >      quic_client_handshake(sockfd)
> >                                         sockfd = accecpt(listenfd)
> >                                         quic_server_handshake(sockfd, cert)
> >
> >      sendmsg(sockfd)                    recvmsg(sockfd)
> >      close(sockfd)                      close(sockfd)
> >                                         close(listenfd)
> >
> > Please note that quic_client_handshake() and quic_server_handshake() functions
> > are currently sourced from libquic in the github lxin/quic repository, and might
> > be integrated into ktls-utils in the future. These functions are responsible for
> > receiving and processing the raw TLS handshake messages until the completion of
> > the handshake process.
>
> I see a problem with this design for the server, as one reason to
> have SMB over QUIC is to use udp port 443 in order to get through
> firewalls. As QUIC has the concept of ALPN it should be possible
> let a conumer only listen on a specif ALPN, so that the smb server
> and web server on "h3" could both accept connections.
We do provide a sockopt to set ALPN before bind or handshaking:

  https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn

But it's used more like to verify if the ALPN set on the server
matches the one received from the client, instead of to find
the correct server.

So you expect (k)smbd server and web server both to listen on UDP
port 443 on the same host, and which APP server accepts the request
from a client depends on ALPN, right?

Currently, in Kernel, this implementation doesn't process any raw TLS
MSG/EXTs but deliver them to userspace after decryption, and the accept
socket is created before processing handshake.

I'm actually curious how userland QUIC handles this, considering
that the UDP sockets('listening' on the same IP:PORT) are used in
two different servers' processes. I think socket lookup with ALPN
has to be done in Kernel Space. Do you know any userland QUIC
implementation for this?

>
> So the server application should have a way to specify the desired
> ALPN before or during the bind() call. I'm not sure if the
> ALPN is available in cleartext before any crypto is needed,
> so if the ALPN is encrypted it might be needed to also register
> a server certificate and key together with the ALPN.
> Because multiple application may not want to share the same key.
On send side, ALPN extension is in raw TLS messages created in userspace
and passed into the kernel and encoded into QUIC crypto frame and then
*encrypted* before sending out.

On recv side, after decryption, the raw TLS messages are decoded from
the QUIC crypto frame and then delivered to userspace, so in userspace
it processes certificate validation and also see cleartext ALPN.

Let me know if I don't make it clear.

>
> This needs to work indepented of kernel or userspace application.
>
> We may want ksmbd (kernel smb server) and apache or smbd (Samba's userspace smb server)
> together with apache. And maybe event ksmbd with one certificate for
> ksmbd.example.com and smbd with a certificate for smbd.example.com
> both on ALPN "smb", while apache uses "h3" with a certificate for
> apache.example.com and nginx with "h3" and a certificate for
> nginx.example.com.
>
> But also smbd with "smb" as well as apache with "h3" both using
> a certificate for quic.example.com.
>
> I guess TLS Server Name Indication also works for QUIC, correct?
Yes, QUIC is secured by TLS 1.3, almost all extensions in TLS1.3
are supported in QUIC.

In userspace I believe we can use gnutls_server_name_set() to
set SNI on client and process SNI on server after getting SNI
via gnutls_server_name_get() in .post_client_hello_cb().

I think this would be able to work out the multiple certificates
(with different hostnames) used by one smbd server.
>
> For the client side I guess dynamic udp ports are used and
> there's no problem with multiple applications...
>
Right.

Thanks again for your comment.
Stefan Metzmacher March 13, 2024, 5:28 p.m. UTC | #3
Am 13.03.24 um 17:03 schrieb Xin Long:
> On Wed, Mar 13, 2024 at 4:56 AM Stefan Metzmacher <metze@samba.org> wrote:
>>
>> Hi Xin Long,
>>
>> first many thanks for working on this topic!
>>
> Hi, Stefan
> 
> Thanks for the comment!
> 
>>> Usage
>>> =====
>>>
>>> This implementation supports a mapping of QUIC into sockets APIs. Similar
>>> to TCP and SCTP, a typical Server and Client use the following system call
>>> sequence to communicate:
>>>
>>>          Client                    Server
>>>       ------------------------------------------------------------------
>>>       sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
>>>       bind(sockfd)                       bind(listenfd)
>>>                                          listen(listenfd)
>>>       connect(sockfd)
>>>       quic_client_handshake(sockfd)
>>>                                          sockfd = accecpt(listenfd)
>>>                                          quic_server_handshake(sockfd, cert)
>>>
>>>       sendmsg(sockfd)                    recvmsg(sockfd)
>>>       close(sockfd)                      close(sockfd)
>>>                                          close(listenfd)
>>>
>>> Please note that quic_client_handshake() and quic_server_handshake() functions
>>> are currently sourced from libquic in the github lxin/quic repository, and might
>>> be integrated into ktls-utils in the future. These functions are responsible for
>>> receiving and processing the raw TLS handshake messages until the completion of
>>> the handshake process.
>>
>> I see a problem with this design for the server, as one reason to
>> have SMB over QUIC is to use udp port 443 in order to get through
>> firewalls. As QUIC has the concept of ALPN it should be possible
>> let a conumer only listen on a specif ALPN, so that the smb server
>> and web server on "h3" could both accept connections.
> We do provide a sockopt to set ALPN before bind or handshaking:
> 
>    https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn
> 
> But it's used more like to verify if the ALPN set on the server
> matches the one received from the client, instead of to find
> the correct server.

Ah, ok.

> So you expect (k)smbd server and web server both to listen on UDP
> port 443 on the same host, and which APP server accepts the request
> from a client depends on ALPN, right?

yes.

> Currently, in Kernel, this implementation doesn't process any raw TLS
> MSG/EXTs but deliver them to userspace after decryption, and the accept
> socket is created before processing handshake.
> 
> I'm actually curious how userland QUIC handles this, considering
> that the UDP sockets('listening' on the same IP:PORT) are used in
> two different servers' processes. I think socket lookup with ALPN
> has to be done in Kernel Space. Do you know any userland QUIC
> implementation for this?

I don't now, but I guess QUIC is only used for http so
far and maybe dns, but that seems to use port 853.

So there's no strict need for it and the web server
would handle all relevant ALPNs.

>>
>> So the server application should have a way to specify the desired
>> ALPN before or during the bind() call. I'm not sure if the
>> ALPN is available in cleartext before any crypto is needed,
>> so if the ALPN is encrypted it might be needed to also register
>> a server certificate and key together with the ALPN.
>> Because multiple application may not want to share the same key.
> On send side, ALPN extension is in raw TLS messages created in userspace
> and passed into the kernel and encoded into QUIC crypto frame and then
> *encrypted* before sending out.

Ok.

> On recv side, after decryption, the raw TLS messages are decoded from
> the QUIC crypto frame and then delivered to userspace, so in userspace
> it processes certificate validation and also see cleartext ALPN.
> 
> Let me know if I don't make it clear.

But the first "new" QUIC pdu from will trigger the accept() to
return and userspace (or the kernel helper function) will to
all crypto? Or does the first decryption happen in kernel (before accept returns)?

Maybe it would be possible to optionally have socket option to
register ALPNs with certificates so that tls_server_hello_x509()
could be called automatically before accept returns (even for
userspace consumers).

It may mean the tlshd protocol needs to be extended...

metze
Xin Long March 13, 2024, 7:39 p.m. UTC | #4
On Wed, Mar 13, 2024 at 1:28 PM Stefan Metzmacher <metze@samba.org> wrote:
>
> Am 13.03.24 um 17:03 schrieb Xin Long:
> > On Wed, Mar 13, 2024 at 4:56 AM Stefan Metzmacher <metze@samba.org> wrote:
> >>
> >> Hi Xin Long,
> >>
> >> first many thanks for working on this topic!
> >>
> > Hi, Stefan
> >
> > Thanks for the comment!
> >
> >>> Usage
> >>> =====
> >>>
> >>> This implementation supports a mapping of QUIC into sockets APIs. Similar
> >>> to TCP and SCTP, a typical Server and Client use the following system call
> >>> sequence to communicate:
> >>>
> >>>          Client                    Server
> >>>       ------------------------------------------------------------------
> >>>       sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
> >>>       bind(sockfd)                       bind(listenfd)
> >>>                                          listen(listenfd)
> >>>       connect(sockfd)
> >>>       quic_client_handshake(sockfd)
> >>>                                          sockfd = accecpt(listenfd)
> >>>                                          quic_server_handshake(sockfd, cert)
> >>>
> >>>       sendmsg(sockfd)                    recvmsg(sockfd)
> >>>       close(sockfd)                      close(sockfd)
> >>>                                          close(listenfd)
> >>>
> >>> Please note that quic_client_handshake() and quic_server_handshake() functions
> >>> are currently sourced from libquic in the github lxin/quic repository, and might
> >>> be integrated into ktls-utils in the future. These functions are responsible for
> >>> receiving and processing the raw TLS handshake messages until the completion of
> >>> the handshake process.
> >>
> >> I see a problem with this design for the server, as one reason to
> >> have SMB over QUIC is to use udp port 443 in order to get through
> >> firewalls. As QUIC has the concept of ALPN it should be possible
> >> let a conumer only listen on a specif ALPN, so that the smb server
> >> and web server on "h3" could both accept connections.
> > We do provide a sockopt to set ALPN before bind or handshaking:
> >
> >    https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn
> >
> > But it's used more like to verify if the ALPN set on the server
> > matches the one received from the client, instead of to find
> > the correct server.
>
> Ah, ok.
Just note that, with a bit change in the current libquic, it still
allows users to use ALPN to find the correct function or thread in
the *same* process, usage be like:

listenfd = socket(IPPROTO_QUIC);
/* match all during handshake with wildcard ALPN */
setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "*");
bind(listenfd)
listen(listenfd)

while (1) {
  sockfd = accept(listenfd);
  /* the alpn from client will be set to sockfd during handshake */
  quic_server_handshake(sockfd, cert);

  getsockopt(sockfd, QUIC_SOCKOPT_ALPN, alpn);

  switch (alpn) {
    case "smbd": smbd_thread(sockfd);
    case "h3": h3_thread(sockfd);
    case "ksmbd": ksmbd_thread(sockfd);
  }
}

>
> > So you expect (k)smbd server and web server both to listen on UDP
> > port 443 on the same host, and which APP server accepts the request
> > from a client depends on ALPN, right?
>
> yes.
Got you. This can be done by also moving TLS 1.3 message exchange to
kernel where we can get the ALPN before looking up the listening socket.
However, In-kernel TLS 1.3 Handshake had been NACKed by both kernel
netdev maintainers and userland ssl lib developers with good reasons.

>
> > Currently, in Kernel, this implementation doesn't process any raw TLS
> > MSG/EXTs but deliver them to userspace after decryption, and the accept
> > socket is created before processing handshake.
> >
> > I'm actually curious how userland QUIC handles this, considering
> > that the UDP sockets('listening' on the same IP:PORT) are used in
> > two different servers' processes. I think socket lookup with ALPN
> > has to be done in Kernel Space. Do you know any userland QUIC
> > implementation for this?
>
> I don't now, but I guess QUIC is only used for http so
> far and maybe dns, but that seems to use port 853.
>
> So there's no strict need for it and the web server
> would handle all relevant ALPNs.
Honestly, I don't think any userland QUIC can use ALPN to lookup for
different sockets used by different servers/processes. As such thing
can be only done in Kernel Space.

>
> >>
> >> So the server application should have a way to specify the desired
> >> ALPN before or during the bind() call. I'm not sure if the
> >> ALPN is available in cleartext before any crypto is needed,
> >> so if the ALPN is encrypted it might be needed to also register
> >> a server certificate and key together with the ALPN.
> >> Because multiple application may not want to share the same key.
> > On send side, ALPN extension is in raw TLS messages created in userspace
> > and passed into the kernel and encoded into QUIC crypto frame and then
> > *encrypted* before sending out.
>
> Ok.
>
> > On recv side, after decryption, the raw TLS messages are decoded from
> > the QUIC crypto frame and then delivered to userspace, so in userspace
> > it processes certificate validation and also see cleartext ALPN.
> >
> > Let me know if I don't make it clear.
>
> But the first "new" QUIC pdu from will trigger the accept() to
> return and userspace (or the kernel helper function) will to
> all crypto? Or does the first decryption happen in kernel (before accept returns)?
Good question!

The first "new" QUIC pdu will cause to create a 'request sock' (contains
4-tuple and connection IDs only) and queue up to reqsk list of the listen
sock (if validate_peer_address param is not set), and this pdu is enqueued
in the inq->backlog_list of the listen sock.

When accept() is called, in Kernel, it dequeues the "request sock" from the
reqsk list of the listen sock, and creates the accept socket based on this
reqsk. Then it processes the pdu for this new accept socket from the
inq->backlog_list of the listen sock, including *decrypting* QUIC packet
and decoding CRYPTO frame, then deliver the raw/cleartext TLS message to
the Userspace libquic.

Then in Userspace libquic, it handles the received TLS message and creates
a new raw/cleartext TLS message for response via libgnutls, and delivers to
kernel. In kernel, it will encode this message to a CRYPTO frame in a QUIC
packet and then *encrypt* this QUIC packet and send it out.

So as you can see, there's no en/decryption happening in Userspace. In
Userspace libquic, it only does raw/cleartext TLS message exchange. ALL
en/decryption happens in Kernel Space, as these en/decryption are done
against QUIC packets, not directly against the TLS messages.

>
> Maybe it would be possible to optionally have socket option to
> register ALPNs with certificates so that tls_server_hello_x509()
> could be called automatically before accept returns (even for
> userspace consumers).
>
> It may mean the tlshd protocol needs to be extended...
>
so that userspace consumers don't need quic_client/server_handshake(), and
accept() returns a socket that already has the handshake done, right?

We didn't do that, as:

1. It's not a good idea for Userspace consumers' applications to reply on
   a daemon like tlshd, not convenient for users, also a bit weird for
   userspace app to ask another userspace app to help do the handshake.
2. It's too complex to implement, especially if we also want to call
   tls_client_hello_x509() before connect() returns on client side.
3. For Kernel usage, I prefer leaving this to the kernel consumers for
   more flexibility for handshake requests.

As for the ALPNs with certificates, not sure if I understand correctly.
But if you want the server to select certificates according to the ALPN
received from the client during handshake. I think it could be done in
userspace libquic. But yes, tlshd service may also need to extend.
Stefan Metzmacher March 14, 2024, 9:21 a.m. UTC | #5
Am 13.03.24 um 20:39 schrieb Xin Long:
> On Wed, Mar 13, 2024 at 1:28 PM Stefan Metzmacher <metze@samba.org> wrote:
>>
>> Am 13.03.24 um 17:03 schrieb Xin Long:
>>> On Wed, Mar 13, 2024 at 4:56 AM Stefan Metzmacher <metze@samba.org> wrote:
>>>>
>>>> Hi Xin Long,
>>>>
>>>> first many thanks for working on this topic!
>>>>
>>> Hi, Stefan
>>>
>>> Thanks for the comment!
>>>
>>>>> Usage
>>>>> =====
>>>>>
>>>>> This implementation supports a mapping of QUIC into sockets APIs. Similar
>>>>> to TCP and SCTP, a typical Server and Client use the following system call
>>>>> sequence to communicate:
>>>>>
>>>>>           Client                    Server
>>>>>        ------------------------------------------------------------------
>>>>>        sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
>>>>>        bind(sockfd)                       bind(listenfd)
>>>>>                                           listen(listenfd)
>>>>>        connect(sockfd)
>>>>>        quic_client_handshake(sockfd)
>>>>>                                           sockfd = accecpt(listenfd)
>>>>>                                           quic_server_handshake(sockfd, cert)
>>>>>
>>>>>        sendmsg(sockfd)                    recvmsg(sockfd)
>>>>>        close(sockfd)                      close(sockfd)
>>>>>                                           close(listenfd)
>>>>>
>>>>> Please note that quic_client_handshake() and quic_server_handshake() functions
>>>>> are currently sourced from libquic in the github lxin/quic repository, and might
>>>>> be integrated into ktls-utils in the future. These functions are responsible for
>>>>> receiving and processing the raw TLS handshake messages until the completion of
>>>>> the handshake process.
>>>>
>>>> I see a problem with this design for the server, as one reason to
>>>> have SMB over QUIC is to use udp port 443 in order to get through
>>>> firewalls. As QUIC has the concept of ALPN it should be possible
>>>> let a conumer only listen on a specif ALPN, so that the smb server
>>>> and web server on "h3" could both accept connections.
>>> We do provide a sockopt to set ALPN before bind or handshaking:
>>>
>>>     https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn
>>>
>>> But it's used more like to verify if the ALPN set on the server
>>> matches the one received from the client, instead of to find
>>> the correct server.
>>
>> Ah, ok.
> Just note that, with a bit change in the current libquic, it still
> allows users to use ALPN to find the correct function or thread in
> the *same* process, usage be like:
> 
> listenfd = socket(IPPROTO_QUIC);
> /* match all during handshake with wildcard ALPN */
> setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "*");
> bind(listenfd)
> listen(listenfd)
> 
> while (1) {
>    sockfd = accept(listenfd);
>    /* the alpn from client will be set to sockfd during handshake */
>    quic_server_handshake(sockfd, cert);
> 
>    getsockopt(sockfd, QUIC_SOCKOPT_ALPN, alpn);

Would quic_server_handshake() call setsockopt()?

>    switch (alpn) {
>      case "smbd": smbd_thread(sockfd);
>      case "h3": h3_thread(sockfd);
>      case "ksmbd": ksmbd_thread(sockfd);
>    }
> }

Ok, but that would mean all application need to be aware of each other,
but it would be possible and socket fds could be passed to other
processes.

>>
>>> So you expect (k)smbd server and web server both to listen on UDP
>>> port 443 on the same host, and which APP server accepts the request
>>> from a client depends on ALPN, right?
>>
>> yes.
> Got you. This can be done by also moving TLS 1.3 message exchange to
> kernel where we can get the ALPN before looking up the listening socket.
> However, In-kernel TLS 1.3 Handshake had been NACKed by both kernel
> netdev maintainers and userland ssl lib developers with good reasons.
> 
>>
>>> Currently, in Kernel, this implementation doesn't process any raw TLS
>>> MSG/EXTs but deliver them to userspace after decryption, and the accept
>>> socket is created before processing handshake.
>>>
>>> I'm actually curious how userland QUIC handles this, considering
>>> that the UDP sockets('listening' on the same IP:PORT) are used in
>>> two different servers' processes. I think socket lookup with ALPN
>>> has to be done in Kernel Space. Do you know any userland QUIC
>>> implementation for this?
>>
>> I don't now, but I guess QUIC is only used for http so
>> far and maybe dns, but that seems to use port 853.
>>
>> So there's no strict need for it and the web server
>> would handle all relevant ALPNs.
> Honestly, I don't think any userland QUIC can use ALPN to lookup for
> different sockets used by different servers/processes. As such thing
> can be only done in Kernel Space.
> 
>>
>>>>
>>>> So the server application should have a way to specify the desired
>>>> ALPN before or during the bind() call. I'm not sure if the
>>>> ALPN is available in cleartext before any crypto is needed,
>>>> so if the ALPN is encrypted it might be needed to also register
>>>> a server certificate and key together with the ALPN.
>>>> Because multiple application may not want to share the same key.
>>> On send side, ALPN extension is in raw TLS messages created in userspace
>>> and passed into the kernel and encoded into QUIC crypto frame and then
>>> *encrypted* before sending out.
>>
>> Ok.
>>
>>> On recv side, after decryption, the raw TLS messages are decoded from
>>> the QUIC crypto frame and then delivered to userspace, so in userspace
>>> it processes certificate validation and also see cleartext ALPN.
>>>
>>> Let me know if I don't make it clear.
>>
>> But the first "new" QUIC pdu from will trigger the accept() to
>> return and userspace (or the kernel helper function) will to
>> all crypto? Or does the first decryption happen in kernel (before accept returns)?
> Good question!
> 
> The first "new" QUIC pdu will cause to create a 'request sock' (contains
> 4-tuple and connection IDs only) and queue up to reqsk list of the listen
> sock (if validate_peer_address param is not set), and this pdu is enqueued
> in the inq->backlog_list of the listen sock.
> 
> When accept() is called, in Kernel, it dequeues the "request sock" from the
> reqsk list of the listen sock, and creates the accept socket based on this
> reqsk. Then it processes the pdu for this new accept socket from the
> inq->backlog_list of the listen sock, including *decrypting* QUIC packet
> and decoding CRYPTO frame, then deliver the raw/cleartext TLS message to
> the Userspace libquic.

Ok, when the kernel already decrypts it could already
look find the ALPN. It doesn't mean it should do the full
handshake, but parse enough to find the ALPN.

But I don't yet understand how the kernel gets the key to
do the initlal decryption, I'd assume some call before listen()
need to tell the kernel about the keys.

> Then in Userspace libquic, it handles the received TLS message and creates
> a new raw/cleartext TLS message for response via libgnutls, and delivers to
> kernel. In kernel, it will encode this message to a CRYPTO frame in a QUIC
> packet and then *encrypt* this QUIC packet and send it out.
> 
> So as you can see, there's no en/decryption happening in Userspace. In
> Userspace libquic, it only does raw/cleartext TLS message exchange. ALL
> en/decryption happens in Kernel Space, as these en/decryption are done
> against QUIC packets, not directly against the TLS messages.
> 
>>
>> Maybe it would be possible to optionally have socket option to
>> register ALPNs with certificates so that tls_server_hello_x509()
>> could be called automatically before accept returns (even for
>> userspace consumers).
>>
>> It may mean the tlshd protocol needs to be extended...
>>
> so that userspace consumers don't need quic_client/server_handshake(), and
> accept() returns a socket that already has the handshake done, right?
> 
> We didn't do that, as:
> 
> 1. It's not a good idea for Userspace consumers' applications to reply on
>     a daemon like tlshd, not convenient for users, also a bit weird for
>     userspace app to ask another userspace app to help do the handshake.
> 2. It's too complex to implement, especially if we also want to call
>     tls_client_hello_x509() before connect() returns on client side.
> 3. For Kernel usage, I prefer leaving this to the kernel consumers for
>     more flexibility for handshake requests.
> 
> As for the ALPNs with certificates, not sure if I understand correctly.
> But if you want the server to select certificates according to the ALPN
> received from the client during handshake. I think it could be done in
> userspace libquic. But yes, tlshd service may also need to extend.

I was just brainstorming for ideas...

metze
Xin Long March 14, 2024, 4:21 p.m. UTC | #6
On Thu, Mar 14, 2024 at 5:21 AM Stefan Metzmacher <metze@samba.org> wrote:
>
> Am 13.03.24 um 20:39 schrieb Xin Long:
> > On Wed, Mar 13, 2024 at 1:28 PM Stefan Metzmacher <metze@samba.org> wrote:
> >>
> >> Am 13.03.24 um 17:03 schrieb Xin Long:
> >>> On Wed, Mar 13, 2024 at 4:56 AM Stefan Metzmacher <metze@samba.org> wrote:
> >>>>
> >>>> Hi Xin Long,
> >>>>
> >>>> first many thanks for working on this topic!
> >>>>
> >>> Hi, Stefan
> >>>
> >>> Thanks for the comment!
> >>>
> >>>>> Usage
> >>>>> =====
> >>>>>
> >>>>> This implementation supports a mapping of QUIC into sockets APIs. Similar
> >>>>> to TCP and SCTP, a typical Server and Client use the following system call
> >>>>> sequence to communicate:
> >>>>>
> >>>>>           Client                    Server
> >>>>>        ------------------------------------------------------------------
> >>>>>        sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
> >>>>>        bind(sockfd)                       bind(listenfd)
> >>>>>                                           listen(listenfd)
> >>>>>        connect(sockfd)
> >>>>>        quic_client_handshake(sockfd)
> >>>>>                                           sockfd = accecpt(listenfd)
> >>>>>                                           quic_server_handshake(sockfd, cert)
> >>>>>
> >>>>>        sendmsg(sockfd)                    recvmsg(sockfd)
> >>>>>        close(sockfd)                      close(sockfd)
> >>>>>                                           close(listenfd)
> >>>>>
> >>>>> Please note that quic_client_handshake() and quic_server_handshake() functions
> >>>>> are currently sourced from libquic in the github lxin/quic repository, and might
> >>>>> be integrated into ktls-utils in the future. These functions are responsible for
> >>>>> receiving and processing the raw TLS handshake messages until the completion of
> >>>>> the handshake process.
> >>>>
> >>>> I see a problem with this design for the server, as one reason to
> >>>> have SMB over QUIC is to use udp port 443 in order to get through
> >>>> firewalls. As QUIC has the concept of ALPN it should be possible
> >>>> let a conumer only listen on a specif ALPN, so that the smb server
> >>>> and web server on "h3" could both accept connections.
> >>> We do provide a sockopt to set ALPN before bind or handshaking:
> >>>
> >>>     https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn
> >>>
> >>> But it's used more like to verify if the ALPN set on the server
> >>> matches the one received from the client, instead of to find
> >>> the correct server.
> >>
> >> Ah, ok.
> > Just note that, with a bit change in the current libquic, it still
> > allows users to use ALPN to find the correct function or thread in
> > the *same* process, usage be like:
> >
> > listenfd = socket(IPPROTO_QUIC);
> > /* match all during handshake with wildcard ALPN */
> > setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "*");
> > bind(listenfd)
> > listen(listenfd)
> >
> > while (1) {
> >    sockfd = accept(listenfd);
> >    /* the alpn from client will be set to sockfd during handshake */
> >    quic_server_handshake(sockfd, cert);
> >
> >    getsockopt(sockfd, QUIC_SOCKOPT_ALPN, alpn);
>
> Would quic_server_handshake() call setsockopt()?
Yes, I just made a bit change in the userspace libquic:

  https://github.com/lxin/quic/commit/9c75bd42769a8cbc1652e2f4c8d77780f23afde6

So you can set up multple ALPNs on listen sock:

  setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "smbd, h3, ksmbd");

Then during handshake, the matched ALPN from client will be set into
the accept socket, then users can get it later after handshake.

Note that userspace libquic is a very light lib (a couple of hundred lines
of code), you can add more TLS related support without touching Kernel code,
including the SNI support you mentioned.

>
> >    switch (alpn) {
> >      case "smbd": smbd_thread(sockfd);
> >      case "h3": h3_thread(sockfd);
> >      case "ksmbd": ksmbd_thread(sockfd);
> >    }
> > }
>
> Ok, but that would mean all application need to be aware of each other,
> but it would be possible and socket fds could be passed to other
> processes.
It doesn't sound common to me, but yes, I think Unix Domain Sockets
can pass it to another process.

>
> >>
> >>> So you expect (k)smbd server and web server both to listen on UDP
> >>> port 443 on the same host, and which APP server accepts the request
> >>> from a client depends on ALPN, right?
> >>
> >> yes.
> > Got you. This can be done by also moving TLS 1.3 message exchange to
> > kernel where we can get the ALPN before looking up the listening socket.
> > However, In-kernel TLS 1.3 Handshake had been NACKed by both kernel
> > netdev maintainers and userland ssl lib developers with good reasons.
> >
> >>
> >>> Currently, in Kernel, this implementation doesn't process any raw TLS
> >>> MSG/EXTs but deliver them to userspace after decryption, and the accept
> >>> socket is created before processing handshake.
> >>>
> >>> I'm actually curious how userland QUIC handles this, considering
> >>> that the UDP sockets('listening' on the same IP:PORT) are used in
> >>> two different servers' processes. I think socket lookup with ALPN
> >>> has to be done in Kernel Space. Do you know any userland QUIC
> >>> implementation for this?
> >>
> >> I don't now, but I guess QUIC is only used for http so
> >> far and maybe dns, but that seems to use port 853.
> >>
> >> So there's no strict need for it and the web server
> >> would handle all relevant ALPNs.
> > Honestly, I don't think any userland QUIC can use ALPN to lookup for
> > different sockets used by different servers/processes. As such thing
> > can be only done in Kernel Space.
> >
> >>
> >>>>
> >>>> So the server application should have a way to specify the desired
> >>>> ALPN before or during the bind() call. I'm not sure if the
> >>>> ALPN is available in cleartext before any crypto is needed,
> >>>> so if the ALPN is encrypted it might be needed to also register
> >>>> a server certificate and key together with the ALPN.
> >>>> Because multiple application may not want to share the same key.
> >>> On send side, ALPN extension is in raw TLS messages created in userspace
> >>> and passed into the kernel and encoded into QUIC crypto frame and then
> >>> *encrypted* before sending out.
> >>
> >> Ok.
> >>
> >>> On recv side, after decryption, the raw TLS messages are decoded from
> >>> the QUIC crypto frame and then delivered to userspace, so in userspace
> >>> it processes certificate validation and also see cleartext ALPN.
> >>>
> >>> Let me know if I don't make it clear.
> >>
> >> But the first "new" QUIC pdu from will trigger the accept() to
> >> return and userspace (or the kernel helper function) will to
> >> all crypto? Or does the first decryption happen in kernel (before accept returns)?
> > Good question!
> >
> > The first "new" QUIC pdu will cause to create a 'request sock' (contains
> > 4-tuple and connection IDs only) and queue up to reqsk list of the listen
> > sock (if validate_peer_address param is not set), and this pdu is enqueued
> > in the inq->backlog_list of the listen sock.
> >
> > When accept() is called, in Kernel, it dequeues the "request sock" from the
> > reqsk list of the listen sock, and creates the accept socket based on this
> > reqsk. Then it processes the pdu for this new accept socket from the
> > inq->backlog_list of the listen sock, including *decrypting* QUIC packet
> > and decoding CRYPTO frame, then deliver the raw/cleartext TLS message to
> > the Userspace libquic.
>
> Ok, when the kernel already decrypts it could already
> look find the ALPN. It doesn't mean it should do the full
> handshake, but parse enough to find the ALPN.
Correct, in-kernel QUIC should only do the QUIC related things,
and all TLS handshake msgs must be handled in Userspace.
This won't cause "layering violation", as Nick Banks said.

>
> But I don't yet understand how the kernel gets the key to
> do the initlal decryption, I'd assume some call before listen()
> need to tell the kernel about the keys.
For initlal decryption, the keys can be derived with the initial packet.
basically, it only needs the dst_connection_id from the client initial
packet. see:

  https://datatracker.ietf.org/doc/html/rfc9001#name-initial-secrets

so we don't need to set up anything to kernel for initial's keys.

But for the handshake, application or early_data keys, they will be set
up into kernel during handshake via:

  setsockopt(QUIC_SOCKOPT_CRYPTO_SECRET)

Thanks.
>
> > Then in Userspace libquic, it handles the received TLS message and creates
> > a new raw/cleartext TLS message for response via libgnutls, and delivers to
> > kernel. In kernel, it will encode this message to a CRYPTO frame in a QUIC
> > packet and then *encrypt* this QUIC packet and send it out.
> >
> > So as you can see, there's no en/decryption happening in Userspace. In
> > Userspace libquic, it only does raw/cleartext TLS message exchange. ALL
> > en/decryption happens in Kernel Space, as these en/decryption are done
> > against QUIC packets, not directly against the TLS messages.
> >
> >>
> >> Maybe it would be possible to optionally have socket option to
> >> register ALPNs with certificates so that tls_server_hello_x509()
> >> could be called automatically before accept returns (even for
> >> userspace consumers).
> >>
> >> It may mean the tlshd protocol needs to be extended...
> >>
> > so that userspace consumers don't need quic_client/server_handshake(), and
> > accept() returns a socket that already has the handshake done, right?
> >
> > We didn't do that, as:
> >
> > 1. It's not a good idea for Userspace consumers' applications to reply on
> >     a daemon like tlshd, not convenient for users, also a bit weird for
> >     userspace app to ask another userspace app to help do the handshake.
> > 2. It's too complex to implement, especially if we also want to call
> >     tls_client_hello_x509() before connect() returns on client side.
> > 3. For Kernel usage, I prefer leaving this to the kernel consumers for
> >     more flexibility for handshake requests.
> >
> > As for the ALPNs with certificates, not sure if I understand correctly.
> > But if you want the server to select certificates according to the ALPN
> > received from the client during handshake. I think it could be done in
> > userspace libquic. But yes, tlshd service may also need to extend.
>
> I was just brainstorming for ideas...
>
> metze
Jason Baron April 8, 2024, 2:07 p.m. UTC | #7
Hi,

This series looks very interesting- I was just wondering if you had done 
any performance testing of this vs. a userspace QUIC implementation?

Thanks,

-Jason

On 3/11/24 12:10 PM, Xin Long wrote:
> Introduction
> ============
> 
> This is an implementation of the QUIC protocol as defined in RFC9000. QUIC
> is an UDP-Based Multiplexed and Secure Transport protocol, and it provides
> applications with flow-controlled streams for structured communication,
> low-latency connection establishment, and network path migration. QUIC
> includes security measures that ensure confidentiality, integrity, and
> availability in a range of deployment circumstances.
> 
> This implementation of QUIC in the kernel space enables users to utilize
> the QUIC protocol through common socket APIs in user space. Additionally,
> kernel subsystems like SMB and NFS can seamlessly operate over the QUIC
> protocol after handshake using net/handshake APIs.
> 
> Note that In-Kernel QUIC implementation does NOT target Crypto Offload
> support for existing Userland QUICs, and Crypto Offload intended for
> Userland QUICs can NOT be utilized for Kernel consumers, such as SMB.
> Therefore, there is no conflict between In-Kernel QUIC and Crypto
> Offload for Userland QUICs.
> 
> This implementation offers fundamental support for the following RFCs:
> 
> - RFC9000 - QUIC: A UDP-Based Multiplexed and Secure Transport
> - RFC9001 - Using TLS to Secure QUIC
> - RFC9002 - QUIC Loss Detection and Congestion Control
> - RFC9221 - An Unreliable Datagram Extension to QUIC
> - RFC9287 - Greasing the QUIC Bit
> - RFC9368 - Compatible Version Negotiation for QUIC
> - RFC9369 - QUIC Version 2
> - Handshake APIs for tlshd Use - SMB/NFS over QUIC
> 
> Implementation
> ==============
> 
> The central idea is to implement QUIC within the kernel, incorporating an
> userspace handshake approach.
> 
> Only the processing and creation of raw TLS Handshake Messages, facilitated
> by a tls library like gnutls, take place in userspace. These messages are
> exchanged through sendmsg/recvmsg() mechanisms, with cryptographic details
> carried in the control message (cmsg).
> 
> The entirety of QUIC protocol, excluding TLS Handshake Messages processing
> and creation, resides in the kernel. Instead of utilizing a User Level
> Protocol (ULP) layer, it establishes a socket of IPPROTO_QUIC type (similar
> to IPPROTO_MPTCP) operating over UDP tunnels.
> 
> Kernel consumers can initiate a handshake request from kernel to userspace
> via the existing net/handshake netlink. The userspace component, tlshd from
> ktls-utils, manages the QUIC handshake request processing.
> 
> - Handshake Architecture:
> 
>        +------+  +------+
>        | APP1 |  | APP2 | ...
>        +------+  +------+
>        +-------------------------------------------------+
>        |                libquic (ktls-utils)             |<--------------+
>        |      {quic_handshake_server/client/param()}     |               |
>        +-------------------------------------------------+      +---------------------+
>          {send/recvmsg()}         {set/getsockopt()}            | tlshd (ktls-utils)  |
>          [CMSG handshake_info]    [SOCKOPT_CRYPTO_SECRET]       +---------------------+
>                                   [SOCKOPT_TRANSPORT_PARAM_EXT]
>                | ^                            | ^                        | ^
>    Userspace   | |                            | |                        | |
>    ------------|-|----------------------------|-|------------------------|-|--------------
>    Kernel      | |                            | |                        | |
>                v |                            v |                        v |
>        +--------------------------------------------------+         +-------------+
>        |  socket (IPPRTOTO_QUIC)  |       protocol        |<----+   | handshake   |
>        +--------------------------------------------------+     |   | netlink APIs|
>        | inqueue | outqueue | cong | path | connection_id |     |   +-------------+
>        +--------------------------------------------------+     |      |      |
>        |   packet   |   frame   |   crypto   |   pnmap    |     |   +-----+ +-----+
>        +--------------------------------------------------+     |   |     | |     |
>        |         input           |       output           |     |---| SMB | | NFS | ...
>        +--------------------------------------------------+     |   |     | |     |
>        |                   UDP tunnels                    |     |   +-----+ +--+--+
>        +--------------------------------------------------+     +--------------|
> 
> - Post Handshake Architecture:
> 
>        +------+  +------+
>        | APP1 |  | APP2 | ...
>        +------+  +------+
>          {send/recvmsg()}         {set/getsockopt()}
>          [CMSG stream_info]       [SOCKOPT_KEY_UPDATE]
>                                   [SOCKOPT_CONNECTION_MIGRATION]
>                                   [SOCKOPT_STREAM_OPEN/RESET/STOP_SENDING]
>                                   [...]
>                | ^                            | ^
>    Userspace   | |                            | |
>    ------------|-|----------------------------|-|----------------
>    Kernel      | |                            | |
>                v |                            v |
>        +--------------------------------------------------+
>        |  socket (IPPRTOTO_QUIC)  |       protocol        |<----+ {kernel_send/recvmsg()}
>        +--------------------------------------------------+     | {kernel_set/getsockopt()}
>        | inqueue | outqueue | cong | path | connection_id |     |
>        +--------------------------------------------------+     |
>        |   packet   |   frame   |   crypto   |   pnmap    |     |   +-----+ +-----+
>        +--------------------------------------------------+     |   |     | |     |
>        |         input           |       output           |     |---| SMB | | NFS | ...
>        +--------------------------------------------------+     |   |     | |     |
>        |                   UDP tunnels                    |     |   +-----+ +--+--+
>        +--------------------------------------------------+     +--------------|
> 
> Usage
> =====
> 
> This implementation supports a mapping of QUIC into sockets APIs. Similar
> to TCP and SCTP, a typical Server and Client use the following system call
> sequence to communicate:
> 
>         Client                    Server
>      ------------------------------------------------------------------
>      sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
>      bind(sockfd)                       bind(listenfd)
>                                         listen(listenfd)
>      connect(sockfd)
>      quic_client_handshake(sockfd)
>                                         sockfd = accecpt(listenfd)
>                                         quic_server_handshake(sockfd, cert)
> 
>      sendmsg(sockfd)                    recvmsg(sockfd)
>      close(sockfd)                      close(sockfd)
>                                         close(listenfd)
> 
> Please note that quic_client_handshake() and quic_server_handshake() functions
> are currently sourced from libquic in the github lxin/quic repository, and might
> be integrated into ktls-utils in the future. These functions are responsible for
> receiving and processing the raw TLS handshake messages until the completion of
> the handshake process.
> 
> For utilization by kernel consumers, it is essential to have the tlshd service
> (from ktls-utils) installed and running in userspace. This service receives
> and manages kernel handshake requests for kernel sockets. In kernel, the APIs
> closely resemble those used in userspace:
> 
>         Client                    Server
>      ------------------------------------------------------------------------
>      __sock_create(IPPROTO_QUIC, &sock)  __sock_create(IPPROTO_QUIC, &sock)
>      kernel_bind(sock)                   kernel_bind(sock)
>                                          kernel_listen(sock)
>      kernel_connect(sock)
>      tls_client_hello_x509(args:{sock})
>                                          kernel_accept(sock, &newsock)
>                                          tls_server_hello_x509(args:{newsock})
> 
>      kernel_sendmsg(sock)                kernel_recvmsg(newsock)
>      sock_release(sock)                  sock_release(newsock)
>                                          sock_release(sock)
> 
> Please be aware that tls_client_hello_x509() and tls_server_hello_x509() are
> APIs from net/handshake/. They are employed to dispatch the handshake request
> to the userspace tlshd service and subsequently block until the handshake
> process is completed.
> 
> For advanced usage,
> see man doc: https://urldefense.com/v3/__https://github.com/lxin/quic/wiki/man__;!!GjvTz_vk!UsLPnlAN5OZvmKIETR2k4xtGO49kJw5h_my6mmoYzVfohMrtGl2Be1zG9WOV3L7scd5SspyzNzYcUkjf$
> and examples: https://urldefense.com/v3/__https://github.com/lxin/quic/tree/main/tests__;!!GjvTz_vk!UsLPnlAN5OZvmKIETR2k4xtGO49kJw5h_my6mmoYzVfohMrtGl2Be1zG9WOV3L7scd5SspyzNy4xr52Q$
> 
> The QUIC module is currently labeled as "EXPERIMENTAL".
> 
> Xin Long (5):
>    net: define IPPROTO_QUIC and SOL_QUIC constants for QUIC protocol
>    net: include quic.h in include/uapi/linux for QUIC protocol
>    net: implement QUIC protocol code in net/quic directory
>    net: integrate QUIC build configuration into Kconfig and Makefile
>    Documentation: introduce quic.rst to provide description of QUIC
>      protocol
> 
>   Documentation/networking/quic.rst |  160 +++
>   include/linux/socket.h            |    1 +
>   include/uapi/linux/in.h           |    2 +
>   include/uapi/linux/quic.h         |  189 +++
>   net/Kconfig                       |    1 +
>   net/Makefile                      |    1 +
>   net/quic/Kconfig                  |   34 +
>   net/quic/Makefile                 |   20 +
>   net/quic/cong.c                   |  229 ++++
>   net/quic/cong.h                   |   84 ++
>   net/quic/connection.c             |  172 +++
>   net/quic/connection.h             |  117 ++
>   net/quic/crypto.c                 |  979 ++++++++++++++++
>   net/quic/crypto.h                 |  140 +++
>   net/quic/frame.c                  | 1803 ++++++++++++++++++++++++++++
>   net/quic/frame.h                  |  162 +++
>   net/quic/hashtable.h              |  125 ++
>   net/quic/input.c                  |  693 +++++++++++
>   net/quic/input.h                  |  169 +++
>   net/quic/number.h                 |  174 +++
>   net/quic/output.c                 |  638 ++++++++++
>   net/quic/output.h                 |  194 +++
>   net/quic/packet.c                 | 1179 +++++++++++++++++++
>   net/quic/packet.h                 |   99 ++
>   net/quic/path.c                   |  434 +++++++
>   net/quic/path.h                   |  131 +++
>   net/quic/pnmap.c                  |  217 ++++
>   net/quic/pnmap.h                  |  134 +++
>   net/quic/protocol.c               |  711 +++++++++++
>   net/quic/protocol.h               |   56 +
>   net/quic/sample_test.c            |  339 ++++++
>   net/quic/socket.c                 | 1823 +++++++++++++++++++++++++++++
>   net/quic/socket.h                 |  293 +++++
>   net/quic/stream.c                 |  248 ++++
>   net/quic/stream.h                 |  147 +++
>   net/quic/timer.c                  |  241 ++++
>   net/quic/timer.h                  |   29 +
>   net/quic/unit_test.c              | 1024 ++++++++++++++++
>   38 files changed, 13192 insertions(+)
>   create mode 100644 Documentation/networking/quic.rst
>   create mode 100644 include/uapi/linux/quic.h
>   create mode 100644 net/quic/Kconfig
>   create mode 100644 net/quic/Makefile
>   create mode 100644 net/quic/cong.c
>   create mode 100644 net/quic/cong.h
>   create mode 100644 net/quic/connection.c
>   create mode 100644 net/quic/connection.h
>   create mode 100644 net/quic/crypto.c
>   create mode 100644 net/quic/crypto.h
>   create mode 100644 net/quic/frame.c
>   create mode 100644 net/quic/frame.h
>   create mode 100644 net/quic/hashtable.h
>   create mode 100644 net/quic/input.c
>   create mode 100644 net/quic/input.h
>   create mode 100644 net/quic/number.h
>   create mode 100644 net/quic/output.c
>   create mode 100644 net/quic/output.h
>   create mode 100644 net/quic/packet.c
>   create mode 100644 net/quic/packet.h
>   create mode 100644 net/quic/path.c
>   create mode 100644 net/quic/path.h
>   create mode 100644 net/quic/pnmap.c
>   create mode 100644 net/quic/pnmap.h
>   create mode 100644 net/quic/protocol.c
>   create mode 100644 net/quic/protocol.h
>   create mode 100644 net/quic/sample_test.c
>   create mode 100644 net/quic/socket.c
>   create mode 100644 net/quic/socket.h
>   create mode 100644 net/quic/stream.c
>   create mode 100644 net/quic/stream.h
>   create mode 100644 net/quic/timer.c
>   create mode 100644 net/quic/timer.h
>   create mode 100644 net/quic/unit_test.c
>
Xin Long April 8, 2024, 3:05 p.m. UTC | #8
On Mon, Apr 8, 2024 at 10:07 AM Jason Baron <jbaron@akamai.com> wrote:
>
> Hi,
>
> This series looks very interesting- I was just wondering if you had done
> any performance testing of this vs. a userspace QUIC implementation?
Hi, Jason,

I only did the testing vs kTLS with adapt iperf:

https://github.com/lxin/quic?tab=readme-ov-file#build-and-install-iperf-for-performance-tests

For userspace QUIC implementations, it's not using common socket APIs.
I couldn't find a tool to do the performance testing for them.

Thanks.
>
> Thanks,
>
> -Jason
>
> On 3/11/24 12:10 PM, Xin Long wrote:
> > Introduction
> > ============
> >
> > This is an implementation of the QUIC protocol as defined in RFC9000. QUIC
> > is an UDP-Based Multiplexed and Secure Transport protocol, and it provides
> > applications with flow-controlled streams for structured communication,
> > low-latency connection establishment, and network path migration. QUIC
> > includes security measures that ensure confidentiality, integrity, and
> > availability in a range of deployment circumstances.
> >
> > This implementation of QUIC in the kernel space enables users to utilize
> > the QUIC protocol through common socket APIs in user space. Additionally,
> > kernel subsystems like SMB and NFS can seamlessly operate over the QUIC
> > protocol after handshake using net/handshake APIs.
> >
> > Note that In-Kernel QUIC implementation does NOT target Crypto Offload
> > support for existing Userland QUICs, and Crypto Offload intended for
> > Userland QUICs can NOT be utilized for Kernel consumers, such as SMB.
> > Therefore, there is no conflict between In-Kernel QUIC and Crypto
> > Offload for Userland QUICs.
> >
> > This implementation offers fundamental support for the following RFCs:
> >
> > - RFC9000 - QUIC: A UDP-Based Multiplexed and Secure Transport
> > - RFC9001 - Using TLS to Secure QUIC
> > - RFC9002 - QUIC Loss Detection and Congestion Control
> > - RFC9221 - An Unreliable Datagram Extension to QUIC
> > - RFC9287 - Greasing the QUIC Bit
> > - RFC9368 - Compatible Version Negotiation for QUIC
> > - RFC9369 - QUIC Version 2
> > - Handshake APIs for tlshd Use - SMB/NFS over QUIC
> >
> > Implementation
> > ==============
> >
> > The central idea is to implement QUIC within the kernel, incorporating an
> > userspace handshake approach.
> >
> > Only the processing and creation of raw TLS Handshake Messages, facilitated
> > by a tls library like gnutls, take place in userspace. These messages are
> > exchanged through sendmsg/recvmsg() mechanisms, with cryptographic details
> > carried in the control message (cmsg).
> >
> > The entirety of QUIC protocol, excluding TLS Handshake Messages processing
> > and creation, resides in the kernel. Instead of utilizing a User Level
> > Protocol (ULP) layer, it establishes a socket of IPPROTO_QUIC type (similar
> > to IPPROTO_MPTCP) operating over UDP tunnels.
> >
> > Kernel consumers can initiate a handshake request from kernel to userspace
> > via the existing net/handshake netlink. The userspace component, tlshd from
> > ktls-utils, manages the QUIC handshake request processing.
> >
> > - Handshake Architecture:
> >
> >        +------+  +------+
> >        | APP1 |  | APP2 | ...
> >        +------+  +------+
> >        +-------------------------------------------------+
> >        |                libquic (ktls-utils)             |<--------------+
> >        |      {quic_handshake_server/client/param()}     |               |
> >        +-------------------------------------------------+      +---------------------+
> >          {send/recvmsg()}         {set/getsockopt()}            | tlshd (ktls-utils)  |
> >          [CMSG handshake_info]    [SOCKOPT_CRYPTO_SECRET]       +---------------------+
> >                                   [SOCKOPT_TRANSPORT_PARAM_EXT]
> >                | ^                            | ^                        | ^
> >    Userspace   | |                            | |                        | |
> >    ------------|-|----------------------------|-|------------------------|-|--------------
> >    Kernel      | |                            | |                        | |
> >                v |                            v |                        v |
> >        +--------------------------------------------------+         +-------------+
> >        |  socket (IPPRTOTO_QUIC)  |       protocol        |<----+   | handshake   |
> >        +--------------------------------------------------+     |   | netlink APIs|
> >        | inqueue | outqueue | cong | path | connection_id |     |   +-------------+
> >        +--------------------------------------------------+     |      |      |
> >        |   packet   |   frame   |   crypto   |   pnmap    |     |   +-----+ +-----+
> >        +--------------------------------------------------+     |   |     | |     |
> >        |         input           |       output           |     |---| SMB | | NFS | ...
> >        +--------------------------------------------------+     |   |     | |     |
> >        |                   UDP tunnels                    |     |   +-----+ +--+--+
> >        +--------------------------------------------------+     +--------------|
> >
> > - Post Handshake Architecture:
> >
> >        +------+  +------+
> >        | APP1 |  | APP2 | ...
> >        +------+  +------+
> >          {send/recvmsg()}         {set/getsockopt()}
> >          [CMSG stream_info]       [SOCKOPT_KEY_UPDATE]
> >                                   [SOCKOPT_CONNECTION_MIGRATION]
> >                                   [SOCKOPT_STREAM_OPEN/RESET/STOP_SENDING]
> >                                   [...]
> >                | ^                            | ^
> >    Userspace   | |                            | |
> >    ------------|-|----------------------------|-|----------------
> >    Kernel      | |                            | |
> >                v |                            v |
> >        +--------------------------------------------------+
> >        |  socket (IPPRTOTO_QUIC)  |       protocol        |<----+ {kernel_send/recvmsg()}
> >        +--------------------------------------------------+     | {kernel_set/getsockopt()}
> >        | inqueue | outqueue | cong | path | connection_id |     |
> >        +--------------------------------------------------+     |
> >        |   packet   |   frame   |   crypto   |   pnmap    |     |   +-----+ +-----+
> >        +--------------------------------------------------+     |   |     | |     |
> >        |         input           |       output           |     |---| SMB | | NFS | ...
> >        +--------------------------------------------------+     |   |     | |     |
> >        |                   UDP tunnels                    |     |   +-----+ +--+--+
> >        +--------------------------------------------------+     +--------------|
> >
> > Usage
> > =====
> >
> > This implementation supports a mapping of QUIC into sockets APIs. Similar
> > to TCP and SCTP, a typical Server and Client use the following system call
> > sequence to communicate:
> >
> >         Client                    Server
> >      ------------------------------------------------------------------
> >      sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
> >      bind(sockfd)                       bind(listenfd)
> >                                         listen(listenfd)
> >      connect(sockfd)
> >      quic_client_handshake(sockfd)
> >                                         sockfd = accecpt(listenfd)
> >                                         quic_server_handshake(sockfd, cert)
> >
> >      sendmsg(sockfd)                    recvmsg(sockfd)
> >      close(sockfd)                      close(sockfd)
> >                                         close(listenfd)
> >
> > Please note that quic_client_handshake() and quic_server_handshake() functions
> > are currently sourced from libquic in the github lxin/quic repository, and might
> > be integrated into ktls-utils in the future. These functions are responsible for
> > receiving and processing the raw TLS handshake messages until the completion of
> > the handshake process.
> >
> > For utilization by kernel consumers, it is essential to have the tlshd service
> > (from ktls-utils) installed and running in userspace. This service receives
> > and manages kernel handshake requests for kernel sockets. In kernel, the APIs
> > closely resemble those used in userspace:
> >
> >         Client                    Server
> >      ------------------------------------------------------------------------
> >      __sock_create(IPPROTO_QUIC, &sock)  __sock_create(IPPROTO_QUIC, &sock)
> >      kernel_bind(sock)                   kernel_bind(sock)
> >                                          kernel_listen(sock)
> >      kernel_connect(sock)
> >      tls_client_hello_x509(args:{sock})
> >                                          kernel_accept(sock, &newsock)
> >                                          tls_server_hello_x509(args:{newsock})
> >
> >      kernel_sendmsg(sock)                kernel_recvmsg(newsock)
> >      sock_release(sock)                  sock_release(newsock)
> >                                          sock_release(sock)
> >
> > Please be aware that tls_client_hello_x509() and tls_server_hello_x509() are
> > APIs from net/handshake/. They are employed to dispatch the handshake request
> > to the userspace tlshd service and subsequently block until the handshake
> > process is completed.
> >
> > For advanced usage,
> > see man doc: https://urldefense.com/v3/__https://github.com/lxin/quic/wiki/man__;!!GjvTz_vk!UsLPnlAN5OZvmKIETR2k4xtGO49kJw5h_my6mmoYzVfohMrtGl2Be1zG9WOV3L7scd5SspyzNzYcUkjf$
> > and examples: https://urldefense.com/v3/__https://github.com/lxin/quic/tree/main/tests__;!!GjvTz_vk!UsLPnlAN5OZvmKIETR2k4xtGO49kJw5h_my6mmoYzVfohMrtGl2Be1zG9WOV3L7scd5SspyzNy4xr52Q$
> >
> > The QUIC module is currently labeled as "EXPERIMENTAL".
> >
> > Xin Long (5):
> >    net: define IPPROTO_QUIC and SOL_QUIC constants for QUIC protocol
> >    net: include quic.h in include/uapi/linux for QUIC protocol
> >    net: implement QUIC protocol code in net/quic directory
> >    net: integrate QUIC build configuration into Kconfig and Makefile
> >    Documentation: introduce quic.rst to provide description of QUIC
> >      protocol
> >
> >   Documentation/networking/quic.rst |  160 +++
> >   include/linux/socket.h            |    1 +
> >   include/uapi/linux/in.h           |    2 +
> >   include/uapi/linux/quic.h         |  189 +++
> >   net/Kconfig                       |    1 +
> >   net/Makefile                      |    1 +
> >   net/quic/Kconfig                  |   34 +
> >   net/quic/Makefile                 |   20 +
> >   net/quic/cong.c                   |  229 ++++
> >   net/quic/cong.h                   |   84 ++
> >   net/quic/connection.c             |  172 +++
> >   net/quic/connection.h             |  117 ++
> >   net/quic/crypto.c                 |  979 ++++++++++++++++
> >   net/quic/crypto.h                 |  140 +++
> >   net/quic/frame.c                  | 1803 ++++++++++++++++++++++++++++
> >   net/quic/frame.h                  |  162 +++
> >   net/quic/hashtable.h              |  125 ++
> >   net/quic/input.c                  |  693 +++++++++++
> >   net/quic/input.h                  |  169 +++
> >   net/quic/number.h                 |  174 +++
> >   net/quic/output.c                 |  638 ++++++++++
> >   net/quic/output.h                 |  194 +++
> >   net/quic/packet.c                 | 1179 +++++++++++++++++++
> >   net/quic/packet.h                 |   99 ++
> >   net/quic/path.c                   |  434 +++++++
> >   net/quic/path.h                   |  131 +++
> >   net/quic/pnmap.c                  |  217 ++++
> >   net/quic/pnmap.h                  |  134 +++
> >   net/quic/protocol.c               |  711 +++++++++++
> >   net/quic/protocol.h               |   56 +
> >   net/quic/sample_test.c            |  339 ++++++
> >   net/quic/socket.c                 | 1823 +++++++++++++++++++++++++++++
> >   net/quic/socket.h                 |  293 +++++
> >   net/quic/stream.c                 |  248 ++++
> >   net/quic/stream.h                 |  147 +++
> >   net/quic/timer.c                  |  241 ++++
> >   net/quic/timer.h                  |   29 +
> >   net/quic/unit_test.c              | 1024 ++++++++++++++++
> >   38 files changed, 13192 insertions(+)
> >   create mode 100644 Documentation/networking/quic.rst
> >   create mode 100644 include/uapi/linux/quic.h
> >   create mode 100644 net/quic/Kconfig
> >   create mode 100644 net/quic/Makefile
> >   create mode 100644 net/quic/cong.c
> >   create mode 100644 net/quic/cong.h
> >   create mode 100644 net/quic/connection.c
> >   create mode 100644 net/quic/connection.h
> >   create mode 100644 net/quic/crypto.c
> >   create mode 100644 net/quic/crypto.h
> >   create mode 100644 net/quic/frame.c
> >   create mode 100644 net/quic/frame.h
> >   create mode 100644 net/quic/hashtable.h
> >   create mode 100644 net/quic/input.c
> >   create mode 100644 net/quic/input.h
> >   create mode 100644 net/quic/number.h
> >   create mode 100644 net/quic/output.c
> >   create mode 100644 net/quic/output.h
> >   create mode 100644 net/quic/packet.c
> >   create mode 100644 net/quic/packet.h
> >   create mode 100644 net/quic/path.c
> >   create mode 100644 net/quic/path.h
> >   create mode 100644 net/quic/pnmap.c
> >   create mode 100644 net/quic/pnmap.h
> >   create mode 100644 net/quic/protocol.c
> >   create mode 100644 net/quic/protocol.h
> >   create mode 100644 net/quic/sample_test.c
> >   create mode 100644 net/quic/socket.c
> >   create mode 100644 net/quic/socket.h
> >   create mode 100644 net/quic/stream.c
> >   create mode 100644 net/quic/stream.h
> >   create mode 100644 net/quic/timer.c
> >   create mode 100644 net/quic/timer.h
> >   create mode 100644 net/quic/unit_test.c
> >
Stefan Metzmacher April 19, 2024, 2:07 p.m. UTC | #9
Hi Xin Long,

>>>>>> first many thanks for working on this topic!
>>>>>>
>>>>> Hi, Stefan
>>>>>
>>>>> Thanks for the comment!
>>>>>
>>>>>>> Usage
>>>>>>> =====
>>>>>>>
>>>>>>> This implementation supports a mapping of QUIC into sockets APIs. Similar
>>>>>>> to TCP and SCTP, a typical Server and Client use the following system call
>>>>>>> sequence to communicate:
>>>>>>>
>>>>>>>            Client                    Server
>>>>>>>         ------------------------------------------------------------------
>>>>>>>         sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
>>>>>>>         bind(sockfd)                       bind(listenfd)
>>>>>>>                                            listen(listenfd)
>>>>>>>         connect(sockfd)
>>>>>>>         quic_client_handshake(sockfd)
>>>>>>>                                            sockfd = accecpt(listenfd)
>>>>>>>                                            quic_server_handshake(sockfd, cert)
>>>>>>>
>>>>>>>         sendmsg(sockfd)                    recvmsg(sockfd)
>>>>>>>         close(sockfd)                      close(sockfd)
>>>>>>>                                            close(listenfd)
>>>>>>>
>>>>>>> Please note that quic_client_handshake() and quic_server_handshake() functions
>>>>>>> are currently sourced from libquic in the github lxin/quic repository, and might
>>>>>>> be integrated into ktls-utils in the future. These functions are responsible for
>>>>>>> receiving and processing the raw TLS handshake messages until the completion of
>>>>>>> the handshake process.
>>>>>>
>>>>>> I see a problem with this design for the server, as one reason to
>>>>>> have SMB over QUIC is to use udp port 443 in order to get through
>>>>>> firewalls. As QUIC has the concept of ALPN it should be possible
>>>>>> let a conumer only listen on a specif ALPN, so that the smb server
>>>>>> and web server on "h3" could both accept connections.
>>>>> We do provide a sockopt to set ALPN before bind or handshaking:
>>>>>
>>>>>      https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn
>>>>>
>>>>> But it's used more like to verify if the ALPN set on the server
>>>>> matches the one received from the client, instead of to find
>>>>> the correct server.
>>>>
>>>> Ah, ok.
>>> Just note that, with a bit change in the current libquic, it still
>>> allows users to use ALPN to find the correct function or thread in
>>> the *same* process, usage be like:
>>>
>>> listenfd = socket(IPPROTO_QUIC);
>>> /* match all during handshake with wildcard ALPN */
>>> setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "*");
>>> bind(listenfd)
>>> listen(listenfd)
>>>
>>> while (1) {
>>>     sockfd = accept(listenfd);
>>>     /* the alpn from client will be set to sockfd during handshake */
>>>     quic_server_handshake(sockfd, cert);
>>>
>>>     getsockopt(sockfd, QUIC_SOCKOPT_ALPN, alpn);
>>
>> Would quic_server_handshake() call setsockopt()?
> Yes, I just made a bit change in the userspace libquic:
> 
>    https://github.com/lxin/quic/commit/9c75bd42769a8cbc1652e2f4c8d77780f23afde6
> 
> So you can set up multple ALPNs on listen sock:
> 
>    setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "smbd, h3, ksmbd");
> 
> Then during handshake, the matched ALPN from client will be set into
> the accept socket, then users can get it later after handshake.
> 
> Note that userspace libquic is a very light lib (a couple of hundred lines
> of code), you can add more TLS related support without touching Kernel code,
> including the SNI support you mentioned.
> 
>>
>>>     switch (alpn) {
>>>       case "smbd": smbd_thread(sockfd);
>>>       case "h3": h3_thread(sockfd);
>>>       case "ksmbd": ksmbd_thread(sockfd);
>>>     }
>>> }
>>
>> Ok, but that would mean all application need to be aware of each other,
>> but it would be possible and socket fds could be passed to other
>> processes.
> It doesn't sound common to me, but yes, I think Unix Domain Sockets
> can pass it to another process.

I think it will be extremely common to have multiple services
based on udp port 443.

People will expect to find web services, smb and maybe more
behind the same dnshost name. And multiple dnshostnames pointing
to the same ip address is also very likely.

With plain tcp/udp it's also possible to independent sockets
per port. There's no single userspace daemon that listens on
'tcp' and will dispatch into different process base on the port.

And with QUIC the port space is the ALPN and/or SNI
combination.

And I think this should be addressed before this becomes an
unchangeable kernel ABI, written is stone.

>>>>> So you expect (k)smbd server and web server both to listen on UDP
>>>>> port 443 on the same host, and which APP server accepts the request
>>>>> from a client depends on ALPN, right?
>>>>
>>>> yes.
>>> Got you. This can be done by also moving TLS 1.3 message exchange to
>>> kernel where we can get the ALPN before looking up the listening socket.
>>> However, In-kernel TLS 1.3 Handshake had been NACKed by both kernel
>>> netdev maintainers and userland ssl lib developers with good reasons.
>>>
>>>>
>>>>> Currently, in Kernel, this implementation doesn't process any raw TLS
>>>>> MSG/EXTs but deliver them to userspace after decryption, and the accept
>>>>> socket is created before processing handshake.
>>>>>
>>>>> I'm actually curious how userland QUIC handles this, considering
>>>>> that the UDP sockets('listening' on the same IP:PORT) are used in
>>>>> two different servers' processes. I think socket lookup with ALPN
>>>>> has to be done in Kernel Space. Do you know any userland QUIC
>>>>> implementation for this?
>>>>
>>>> I don't now, but I guess QUIC is only used for http so
>>>> far and maybe dns, but that seems to use port 853.
>>>>
>>>> So there's no strict need for it and the web server
>>>> would handle all relevant ALPNs.
>>> Honestly, I don't think any userland QUIC can use ALPN to lookup for
>>> different sockets used by different servers/processes. As such thing
>>> can be only done in Kernel Space.
>>>
>>>>
>>>>>>
>>>>>> So the server application should have a way to specify the desired
>>>>>> ALPN before or during the bind() call. I'm not sure if the
>>>>>> ALPN is available in cleartext before any crypto is needed,
>>>>>> so if the ALPN is encrypted it might be needed to also register
>>>>>> a server certificate and key together with the ALPN.
>>>>>> Because multiple application may not want to share the same key.
>>>>> On send side, ALPN extension is in raw TLS messages created in userspace
>>>>> and passed into the kernel and encoded into QUIC crypto frame and then
>>>>> *encrypted* before sending out.
>>>>
>>>> Ok.
>>>>
>>>>> On recv side, after decryption, the raw TLS messages are decoded from
>>>>> the QUIC crypto frame and then delivered to userspace, so in userspace
>>>>> it processes certificate validation and also see cleartext ALPN.
>>>>>
>>>>> Let me know if I don't make it clear.
>>>>
>>>> But the first "new" QUIC pdu from will trigger the accept() to
>>>> return and userspace (or the kernel helper function) will to
>>>> all crypto? Or does the first decryption happen in kernel (before accept returns)?
>>> Good question!
>>>
>>> The first "new" QUIC pdu will cause to create a 'request sock' (contains
>>> 4-tuple and connection IDs only) and queue up to reqsk list of the listen
>>> sock (if validate_peer_address param is not set), and this pdu is enqueued
>>> in the inq->backlog_list of the listen sock.
>>>
>>> When accept() is called, in Kernel, it dequeues the "request sock" from the
>>> reqsk list of the listen sock, and creates the accept socket based on this
>>> reqsk. Then it processes the pdu for this new accept socket from the
>>> inq->backlog_list of the listen sock, including *decrypting* QUIC packet
>>> and decoding CRYPTO frame, then deliver the raw/cleartext TLS message to
>>> the Userspace libquic.
>>
>> Ok, when the kernel already decrypts it could already
>> look find the ALPN. It doesn't mean it should do the full
>> handshake, but parse enough to find the ALPN.
> Correct, in-kernel QUIC should only do the QUIC related things,
> and all TLS handshake msgs must be handled in Userspace.
> This won't cause "layering violation", as Nick Banks said.

But I think its unavoidable for the ALPN and SNI fields on
the server side. As every service tries to use udp port 443
and somehow that needs to be shared if multiple services want to
use it.

I guess on the acceptor side we would need to somehow detach low level
udp struct sock from the logical listen struct sock.

And quic_do_listen_rcv() would need to find the correct logical listening
socket and call quic_request_sock_enqueue() on the logical socket
not the lowlevel udo socket. The same for all stuff happening after
quic_request_sock_enqueue() at the end of quic_do_listen_rcv.

>> But I don't yet understand how the kernel gets the key to
>> do the initlal decryption, I'd assume some call before listen()
>> need to tell the kernel about the keys.
> For initlal decryption, the keys can be derived with the initial packet.
> basically, it only needs the dst_connection_id from the client initial
> packet. see:
> 
>    https://datatracker.ietf.org/doc/html/rfc9001#name-initial-secrets
> 
> so we don't need to set up anything to kernel for initial's keys.

I got it thanks!

metze
Xin Long April 19, 2024, 6:09 p.m. UTC | #10
On Fri, Apr 19, 2024 at 10:07 AM Stefan Metzmacher <metze@samba.org> wrote:
>
> Hi Xin Long,
>
> >>>>>> first many thanks for working on this topic!
> >>>>>>
> >>>>> Hi, Stefan
> >>>>>
> >>>>> Thanks for the comment!
> >>>>>
> >>>>>>> Usage
> >>>>>>> =====
> >>>>>>>
> >>>>>>> This implementation supports a mapping of QUIC into sockets APIs. Similar
> >>>>>>> to TCP and SCTP, a typical Server and Client use the following system call
> >>>>>>> sequence to communicate:
> >>>>>>>
> >>>>>>>            Client                    Server
> >>>>>>>         ------------------------------------------------------------------
> >>>>>>>         sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
> >>>>>>>         bind(sockfd)                       bind(listenfd)
> >>>>>>>                                            listen(listenfd)
> >>>>>>>         connect(sockfd)
> >>>>>>>         quic_client_handshake(sockfd)
> >>>>>>>                                            sockfd = accecpt(listenfd)
> >>>>>>>                                            quic_server_handshake(sockfd, cert)
> >>>>>>>
> >>>>>>>         sendmsg(sockfd)                    recvmsg(sockfd)
> >>>>>>>         close(sockfd)                      close(sockfd)
> >>>>>>>                                            close(listenfd)
> >>>>>>>
> >>>>>>> Please note that quic_client_handshake() and quic_server_handshake() functions
> >>>>>>> are currently sourced from libquic in the github lxin/quic repository, and might
> >>>>>>> be integrated into ktls-utils in the future. These functions are responsible for
> >>>>>>> receiving and processing the raw TLS handshake messages until the completion of
> >>>>>>> the handshake process.
> >>>>>>
> >>>>>> I see a problem with this design for the server, as one reason to
> >>>>>> have SMB over QUIC is to use udp port 443 in order to get through
> >>>>>> firewalls. As QUIC has the concept of ALPN it should be possible
> >>>>>> let a conumer only listen on a specif ALPN, so that the smb server
> >>>>>> and web server on "h3" could both accept connections.
> >>>>> We do provide a sockopt to set ALPN before bind or handshaking:
> >>>>>
> >>>>>      https://github.com/lxin/quic/wiki/man#quic_sockopt_alpn
> >>>>>
> >>>>> But it's used more like to verify if the ALPN set on the server
> >>>>> matches the one received from the client, instead of to find
> >>>>> the correct server.
> >>>>
> >>>> Ah, ok.
> >>> Just note that, with a bit change in the current libquic, it still
> >>> allows users to use ALPN to find the correct function or thread in
> >>> the *same* process, usage be like:
> >>>
> >>> listenfd = socket(IPPROTO_QUIC);
> >>> /* match all during handshake with wildcard ALPN */
> >>> setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "*");
> >>> bind(listenfd)
> >>> listen(listenfd)
> >>>
> >>> while (1) {
> >>>     sockfd = accept(listenfd);
> >>>     /* the alpn from client will be set to sockfd during handshake */
> >>>     quic_server_handshake(sockfd, cert);
> >>>
> >>>     getsockopt(sockfd, QUIC_SOCKOPT_ALPN, alpn);
> >>
> >> Would quic_server_handshake() call setsockopt()?
> > Yes, I just made a bit change in the userspace libquic:
> >
> >    https://github.com/lxin/quic/commit/9c75bd42769a8cbc1652e2f4c8d77780f23afde6
> >
> > So you can set up multple ALPNs on listen sock:
> >
> >    setsockopt(listenfd, QUIC_SOCKOPT_ALPN, "smbd, h3, ksmbd");
> >
> > Then during handshake, the matched ALPN from client will be set into
> > the accept socket, then users can get it later after handshake.
> >
> > Note that userspace libquic is a very light lib (a couple of hundred lines
> > of code), you can add more TLS related support without touching Kernel code,
> > including the SNI support you mentioned.
> >
> >>
> >>>     switch (alpn) {
> >>>       case "smbd": smbd_thread(sockfd);
> >>>       case "h3": h3_thread(sockfd);
> >>>       case "ksmbd": ksmbd_thread(sockfd);
> >>>     }
> >>> }
> >>
> >> Ok, but that would mean all application need to be aware of each other,
> >> but it would be possible and socket fds could be passed to other
> >> processes.
> > It doesn't sound common to me, but yes, I think Unix Domain Sockets
> > can pass it to another process.
>
> I think it will be extremely common to have multiple services
> based on udp port 443.
>
> People will expect to find web services, smb and maybe more
> behind the same dnshost name. And multiple dnshostnames pointing
> to the same ip address is also very likely.
>
> With plain tcp/udp it's also possible to independent sockets
> per port. There's no single userspace daemon that listens on
> 'tcp' and will dispatch into different process base on the port.
>
> And with QUIC the port space is the ALPN and/or SNI
> combination.
>
> And I think this should be addressed before this becomes an
> unchangeable kernel ABI, written is stone.
>
> >>>>> So you expect (k)smbd server and web server both to listen on UDP
> >>>>> port 443 on the same host, and which APP server accepts the request
> >>>>> from a client depends on ALPN, right?
> >>>>
> >>>> yes.
> >>> Got you. This can be done by also moving TLS 1.3 message exchange to
> >>> kernel where we can get the ALPN before looking up the listening socket.
> >>> However, In-kernel TLS 1.3 Handshake had been NACKed by both kernel
> >>> netdev maintainers and userland ssl lib developers with good reasons.
> >>>
> >>>>
> >>>>> Currently, in Kernel, this implementation doesn't process any raw TLS
> >>>>> MSG/EXTs but deliver them to userspace after decryption, and the accept
> >>>>> socket is created before processing handshake.
> >>>>>
> >>>>> I'm actually curious how userland QUIC handles this, considering
> >>>>> that the UDP sockets('listening' on the same IP:PORT) are used in
> >>>>> two different servers' processes. I think socket lookup with ALPN
> >>>>> has to be done in Kernel Space. Do you know any userland QUIC
> >>>>> implementation for this?
> >>>>
> >>>> I don't now, but I guess QUIC is only used for http so
> >>>> far and maybe dns, but that seems to use port 853.
> >>>>
> >>>> So there's no strict need for it and the web server
> >>>> would handle all relevant ALPNs.
> >>> Honestly, I don't think any userland QUIC can use ALPN to lookup for
> >>> different sockets used by different servers/processes. As such thing
> >>> can be only done in Kernel Space.
> >>>
> >>>>
> >>>>>>
> >>>>>> So the server application should have a way to specify the desired
> >>>>>> ALPN before or during the bind() call. I'm not sure if the
> >>>>>> ALPN is available in cleartext before any crypto is needed,
> >>>>>> so if the ALPN is encrypted it might be needed to also register
> >>>>>> a server certificate and key together with the ALPN.
> >>>>>> Because multiple application may not want to share the same key.
> >>>>> On send side, ALPN extension is in raw TLS messages created in userspace
> >>>>> and passed into the kernel and encoded into QUIC crypto frame and then
> >>>>> *encrypted* before sending out.
> >>>>
> >>>> Ok.
> >>>>
> >>>>> On recv side, after decryption, the raw TLS messages are decoded from
> >>>>> the QUIC crypto frame and then delivered to userspace, so in userspace
> >>>>> it processes certificate validation and also see cleartext ALPN.
> >>>>>
> >>>>> Let me know if I don't make it clear.
> >>>>
> >>>> But the first "new" QUIC pdu from will trigger the accept() to
> >>>> return and userspace (or the kernel helper function) will to
> >>>> all crypto? Or does the first decryption happen in kernel (before accept returns)?
> >>> Good question!
> >>>
> >>> The first "new" QUIC pdu will cause to create a 'request sock' (contains
> >>> 4-tuple and connection IDs only) and queue up to reqsk list of the listen
> >>> sock (if validate_peer_address param is not set), and this pdu is enqueued
> >>> in the inq->backlog_list of the listen sock.
> >>>
> >>> When accept() is called, in Kernel, it dequeues the "request sock" from the
> >>> reqsk list of the listen sock, and creates the accept socket based on this
> >>> reqsk. Then it processes the pdu for this new accept socket from the
> >>> inq->backlog_list of the listen sock, including *decrypting* QUIC packet
> >>> and decoding CRYPTO frame, then deliver the raw/cleartext TLS message to
> >>> the Userspace libquic.
> >>
> >> Ok, when the kernel already decrypts it could already
> >> look find the ALPN. It doesn't mean it should do the full
> >> handshake, but parse enough to find the ALPN.
> > Correct, in-kernel QUIC should only do the QUIC related things,
> > and all TLS handshake msgs must be handled in Userspace.
> > This won't cause "layering violation", as Nick Banks said.
>
> But I think its unavoidable for the ALPN and SNI fields on
> the server side. As every service tries to use udp port 443
> and somehow that needs to be shared if multiple services want to
> use it.
>
> I guess on the acceptor side we would need to somehow detach low level
> udp struct sock from the logical listen struct sock.
>
> And quic_do_listen_rcv() would need to find the correct logical listening
> socket and call quic_request_sock_enqueue() on the logical socket
> not the lowlevel udo socket. The same for all stuff happening after
> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
>
The implementation allows one low level UDP sock to serve for multiple
QUIC socks.

Currently, if your 3 quic applications listen to the same address:port
with SO_REUSEPORT socket option set, the incoming connection will choose
one of your applications randomly with hash(client_addr+port) via
reuseport_select_sock() in quic_sock_lookup().

It should be easy to do a further match with ALPN between these 3 quic
socks that listens to the same address:port to get the right quic sock,
instead of that randomly choosing.

The problem is to parse the TLS Client_Hello message to get the ALPN in
quic_sock_lookup(), which is not a proper thing to do in kernel, and
might be rejected by networking maintainers, I need to check with them.

Will you be able to work around this by using Unix Domain Sockets pass
the sockfd to another process?

(Note that we're assuming all your 3 applications are using in-kernel QUIC)

> >> But I don't yet understand how the kernel gets the key to
> >> do the initlal decryption, I'd assume some call before listen()
> >> need to tell the kernel about the keys.
> > For initlal decryption, the keys can be derived with the initial packet.
> > basically, it only needs the dst_connection_id from the client initial
> > packet. see:
> >
> >    https://datatracker.ietf.org/doc/html/rfc9001#name-initial-secrets
> >
> > so we don't need to set up anything to kernel for initial's keys.
>
> I got it thanks!
>
> metze
>
Stefan Metzmacher April 19, 2024, 6:51 p.m. UTC | #11
Hi Xin Long,

>> But I think its unavoidable for the ALPN and SNI fields on
>> the server side. As every service tries to use udp port 443
>> and somehow that needs to be shared if multiple services want to
>> use it.
>>
>> I guess on the acceptor side we would need to somehow detach low level
>> udp struct sock from the logical listen struct sock.
>>
>> And quic_do_listen_rcv() would need to find the correct logical listening
>> socket and call quic_request_sock_enqueue() on the logical socket
>> not the lowlevel udo socket. The same for all stuff happening after
>> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
>>
> The implementation allows one low level UDP sock to serve for multiple
> QUIC socks.
> 
> Currently, if your 3 quic applications listen to the same address:port
> with SO_REUSEPORT socket option set, the incoming connection will choose
> one of your applications randomly with hash(client_addr+port) vi
> reuseport_select_sock() in quic_sock_lookup().
> 
> It should be easy to do a further match with ALPN between these 3 quic
> socks that listens to the same address:port to get the right quic sock,
> instead of that randomly choosing.

Ah, that sounds good.

> The problem is to parse the TLS Client_Hello message to get the ALPN in
> quic_sock_lookup(), which is not a proper thing to do in kernel, and
> might be rejected by networking maintainers, I need to check with them.

Is the reassembling of CRYPTO frames done in the kernel or
userspace? Can you point me to the place in the code?

If it's really impossible to do in C code maybe
registering a bpf function in order to allow a listener
to check the intial quic packet and decide if it wants to serve
that connection would be possible as last resort?

> Will you be able to work around this by using Unix Domain Sockets pass
> the sockfd to another process?

Not really. As that would strict coordination between a lot of
independent projects.

> (Note that we're assuming all your 3 applications are using in-kernel QUIC)

Sure, but I guess for servers using port 443 that the only long term option.
and I don't think it will be less performant than a userspace implementation.

Thanks!
metze
Xin Long April 19, 2024, 7:19 p.m. UTC | #12
On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher <metze@samba.org> wrote:
>
> Hi Xin Long,
>
> >> But I think its unavoidable for the ALPN and SNI fields on
> >> the server side. As every service tries to use udp port 443
> >> and somehow that needs to be shared if multiple services want to
> >> use it.
> >>
> >> I guess on the acceptor side we would need to somehow detach low level
> >> udp struct sock from the logical listen struct sock.
> >>
> >> And quic_do_listen_rcv() would need to find the correct logical listening
> >> socket and call quic_request_sock_enqueue() on the logical socket
> >> not the lowlevel udo socket. The same for all stuff happening after
> >> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
> >>
> > The implementation allows one low level UDP sock to serve for multiple
> > QUIC socks.
> >
> > Currently, if your 3 quic applications listen to the same address:port
> > with SO_REUSEPORT socket option set, the incoming connection will choose
> > one of your applications randomly with hash(client_addr+port) vi
> > reuseport_select_sock() in quic_sock_lookup().
> >
> > It should be easy to do a further match with ALPN between these 3 quic
> > socks that listens to the same address:port to get the right quic sock,
> > instead of that randomly choosing.
>
> Ah, that sounds good.
>
> > The problem is to parse the TLS Client_Hello message to get the ALPN in
> > quic_sock_lookup(), which is not a proper thing to do in kernel, and
> > might be rejected by networking maintainers, I need to check with them.
>
> Is the reassembling of CRYPTO frames done in the kernel or
> userspace? Can you point me to the place in the code?
In quic_inq_handshake_tail() in kernel, for Client Initial packet
is processed when calling accept(), this is the path:

quic_accept()-> quic_accept_sock_init() -> quic_packet_process() ->
quic_packet_handshake_process() -> quic_frame_process() ->
quic_frame_crypto_process() -> quic_inq_handshake_tail().

Note that it's with the accept sock, not the listen sock.

>
> If it's really impossible to do in C code maybe
> registering a bpf function in order to allow a listener
> to check the intial quic packet and decide if it wants to serve
> that connection would be possible as last resort?
That's a smart idea! man.
I think the bpf hook in reuseport_select_sock() is meant to do such
selection.

For the Client initial packet (the only packet you need to handle),
I double you will need to do the reassembling, as Client Hello TLS message
is always less than 400 byte in my env.

But I think you need to do the decryption for the Client initial packet
before decoding it then parsing the TLS message from its crypto frame.

BTW, for the TLS message parse, I have some prototype code for
TLS Handshake:
https://github.com/lxin/tls_hs/blob/master/crypto/tls_hs.c#L2084

The path to get ALPN:
tls_msg_handle() -> tls_msg_ch_handle() -> tls_ext_handle()

Hope it may be helpful to you.

>
> > Will you be able to work around this by using Unix Domain Sockets pass
> > the sockfd to another process?
>
> Not really. As that would strict coordination between a lot of
> independent projects.
>
> > (Note that we're assuming all your 3 applications are using in-kernel QUIC)
>
> Sure, but I guess for servers using port 443 that the only long term option.
> and I don't think it will be less performant than a userspace implementation.
Cool.
Xin Long April 20, 2024, 7:32 p.m. UTC | #13
On Fri, Apr 19, 2024 at 3:19 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher <metze@samba.org> wrote:
> >
> > Hi Xin Long,
> >
> > >> But I think its unavoidable for the ALPN and SNI fields on
> > >> the server side. As every service tries to use udp port 443
> > >> and somehow that needs to be shared if multiple services want to
> > >> use it.
> > >>
> > >> I guess on the acceptor side we would need to somehow detach low level
> > >> udp struct sock from the logical listen struct sock.
> > >>
> > >> And quic_do_listen_rcv() would need to find the correct logical listening
> > >> socket and call quic_request_sock_enqueue() on the logical socket
> > >> not the lowlevel udo socket. The same for all stuff happening after
> > >> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
> > >>
> > > The implementation allows one low level UDP sock to serve for multiple
> > > QUIC socks.
> > >
> > > Currently, if your 3 quic applications listen to the same address:port
> > > with SO_REUSEPORT socket option set, the incoming connection will choose
> > > one of your applications randomly with hash(client_addr+port) vi
> > > reuseport_select_sock() in quic_sock_lookup().
> > >
> > > It should be easy to do a further match with ALPN between these 3 quic
> > > socks that listens to the same address:port to get the right quic sock,
> > > instead of that randomly choosing.
> >
> > Ah, that sounds good.
> >
> > > The problem is to parse the TLS Client_Hello message to get the ALPN in
> > > quic_sock_lookup(), which is not a proper thing to do in kernel, and
> > > might be rejected by networking maintainers, I need to check with them.
> >
> > Is the reassembling of CRYPTO frames done in the kernel or
> > userspace? Can you point me to the place in the code?
> In quic_inq_handshake_tail() in kernel, for Client Initial packet
> is processed when calling accept(), this is the path:
>
> quic_accept()-> quic_accept_sock_init() -> quic_packet_process() ->
> quic_packet_handshake_process() -> quic_frame_process() ->
> quic_frame_crypto_process() -> quic_inq_handshake_tail().
>
> Note that it's with the accept sock, not the listen sock.
>
> >
> > If it's really impossible to do in C code maybe
> > registering a bpf function in order to allow a listener
> > to check the intial quic packet and decide if it wants to serve
> > that connection would be possible as last resort?
> That's a smart idea! man.
> I think the bpf hook in reuseport_select_sock() is meant to do such
> selection.
>
> For the Client initial packet (the only packet you need to handle),
> I double you will need to do the reassembling, as Client Hello TLS message
> is always less than 400 byte in my env.
>
> But I think you need to do the decryption for the Client initial packet
> before decoding it then parsing the TLS message from its crypto frame.
I created this patch:

https://github.com/lxin/quic/commit/aee0b7c77df3f39941f98bb901c73fdc560befb8

to do this decryption in quic_sock_look() before calling
reuseport_select_sock(), so that it provides the bpf selector with
a plain-text QUIC initial packet:

https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.2

If it's complex for you to do the decryption for the initial packet in
the bpf selector, I will apply this patch. Please let me know.

Thanks.

>
> BTW, for the TLS message parse, I have some prototype code for
> TLS Handshake:
> https://github.com/lxin/tls_hs/blob/master/crypto/tls_hs.c#L2084
>
> The path to get ALPN:
> tls_msg_handle() -> tls_msg_ch_handle() -> tls_ext_handle()
>
> Hope it may be helpful to you.
>
> >
> > > Will you be able to work around this by using Unix Domain Sockets pass
> > > the sockfd to another process?
> >
> > Not really. As that would strict coordination between a lot of
> > independent projects.
> >
> > > (Note that we're assuming all your 3 applications are using in-kernel QUIC)
> >
> > Sure, but I guess for servers using port 443 that the only long term option.
> > and I don't think it will be less performant than a userspace implementation.
> Cool.
Stefan Metzmacher April 21, 2024, 7:27 p.m. UTC | #14
Am 20.04.24 um 21:32 schrieb Xin Long:
> On Fri, Apr 19, 2024 at 3:19 PM Xin Long <lucien.xin@gmail.com> wrote:
>>
>> On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher <metze@samba.org> wrote:
>>>
>>> Hi Xin Long,
>>>
>>>>> But I think its unavoidable for the ALPN and SNI fields on
>>>>> the server side. As every service tries to use udp port 443
>>>>> and somehow that needs to be shared if multiple services want to
>>>>> use it.
>>>>>
>>>>> I guess on the acceptor side we would need to somehow detach low level
>>>>> udp struct sock from the logical listen struct sock.
>>>>>
>>>>> And quic_do_listen_rcv() would need to find the correct logical listening
>>>>> socket and call quic_request_sock_enqueue() on the logical socket
>>>>> not the lowlevel udo socket. The same for all stuff happening after
>>>>> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
>>>>>
>>>> The implementation allows one low level UDP sock to serve for multiple
>>>> QUIC socks.
>>>>
>>>> Currently, if your 3 quic applications listen to the same address:port
>>>> with SO_REUSEPORT socket option set, the incoming connection will choose
>>>> one of your applications randomly with hash(client_addr+port) vi
>>>> reuseport_select_sock() in quic_sock_lookup().
>>>>
>>>> It should be easy to do a further match with ALPN between these 3 quic
>>>> socks that listens to the same address:port to get the right quic sock,
>>>> instead of that randomly choosing.
>>>
>>> Ah, that sounds good.
>>>
>>>> The problem is to parse the TLS Client_Hello message to get the ALPN in
>>>> quic_sock_lookup(), which is not a proper thing to do in kernel, and
>>>> might be rejected by networking maintainers, I need to check with them.
>>>
>>> Is the reassembling of CRYPTO frames done in the kernel or
>>> userspace? Can you point me to the place in the code?
>> In quic_inq_handshake_tail() in kernel, for Client Initial packet
>> is processed when calling accept(), this is the path:
>>
>> quic_accept()-> quic_accept_sock_init() -> quic_packet_process() ->
>> quic_packet_handshake_process() -> quic_frame_process() ->
>> quic_frame_crypto_process() -> quic_inq_handshake_tail().
>>
>> Note that it's with the accept sock, not the listen sock.
>>
>>>
>>> If it's really impossible to do in C code maybe
>>> registering a bpf function in order to allow a listener
>>> to check the intial quic packet and decide if it wants to serve
>>> that connection would be possible as last resort?
>> That's a smart idea! man.
>> I think the bpf hook in reuseport_select_sock() is meant to do such
>> selection.
>>
>> For the Client initial packet (the only packet you need to handle),
>> I double you will need to do the reassembling, as Client Hello TLS message
>> is always less than 400 byte in my env.
>>
>> But I think you need to do the decryption for the Client initial packet
>> before decoding it then parsing the TLS message from its crypto frame.
> I created this patch:
> 
> https://github.com/lxin/quic/commit/aee0b7c77df3f39941f98bb901c73fdc560befb8
> 
> to do this decryption in quic_sock_look() before calling
> reuseport_select_sock(), so that it provides the bpf selector with
> a plain-text QUIC initial packet:
> 
> https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.2
> 
> If it's complex for you to do the decryption for the initial packet in
> the bpf selector, I will apply this patch. Please let me know.

I guess in addition to quic_server_handshake(), which is called
after accept(), there should be quic_server_prepare_listen()
(and something similar for in kernel servers) that setup the reuseport
magic for the socket, so that it's not needed in every application.

It seems there is only a single ebpf program possible per
reuseport group, so there has to be just a single one.

But is it possible for in kernel servers to also register an epbf program?

metze
Xin Long April 22, 2024, 8:58 p.m. UTC | #15
On Sun, Apr 21, 2024 at 3:27 PM Stefan Metzmacher <metze@samba.org> wrote:
>
> Am 20.04.24 um 21:32 schrieb Xin Long:
> > On Fri, Apr 19, 2024 at 3:19 PM Xin Long <lucien.xin@gmail.com> wrote:
> >>
> >> On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher <metze@samba.org> wrote:
> >>>
> >>> Hi Xin Long,
> >>>
> >>>>> But I think its unavoidable for the ALPN and SNI fields on
> >>>>> the server side. As every service tries to use udp port 443
> >>>>> and somehow that needs to be shared if multiple services want to
> >>>>> use it.
> >>>>>
> >>>>> I guess on the acceptor side we would need to somehow detach low level
> >>>>> udp struct sock from the logical listen struct sock.
> >>>>>
> >>>>> And quic_do_listen_rcv() would need to find the correct logical listening
> >>>>> socket and call quic_request_sock_enqueue() on the logical socket
> >>>>> not the lowlevel udo socket. The same for all stuff happening after
> >>>>> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
> >>>>>
> >>>> The implementation allows one low level UDP sock to serve for multiple
> >>>> QUIC socks.
> >>>>
> >>>> Currently, if your 3 quic applications listen to the same address:port
> >>>> with SO_REUSEPORT socket option set, the incoming connection will choose
> >>>> one of your applications randomly with hash(client_addr+port) vi
> >>>> reuseport_select_sock() in quic_sock_lookup().
> >>>>
> >>>> It should be easy to do a further match with ALPN between these 3 quic
> >>>> socks that listens to the same address:port to get the right quic sock,
> >>>> instead of that randomly choosing.
> >>>
> >>> Ah, that sounds good.
> >>>
> >>>> The problem is to parse the TLS Client_Hello message to get the ALPN in
> >>>> quic_sock_lookup(), which is not a proper thing to do in kernel, and
> >>>> might be rejected by networking maintainers, I need to check with them.
> >>>
> >>> Is the reassembling of CRYPTO frames done in the kernel or
> >>> userspace? Can you point me to the place in the code?
> >> In quic_inq_handshake_tail() in kernel, for Client Initial packet
> >> is processed when calling accept(), this is the path:
> >>
> >> quic_accept()-> quic_accept_sock_init() -> quic_packet_process() ->
> >> quic_packet_handshake_process() -> quic_frame_process() ->
> >> quic_frame_crypto_process() -> quic_inq_handshake_tail().
> >>
> >> Note that it's with the accept sock, not the listen sock.
> >>
> >>>
> >>> If it's really impossible to do in C code maybe
> >>> registering a bpf function in order to allow a listener
> >>> to check the intial quic packet and decide if it wants to serve
> >>> that connection would be possible as last resort?
> >> That's a smart idea! man.
> >> I think the bpf hook in reuseport_select_sock() is meant to do such
> >> selection.
> >>
> >> For the Client initial packet (the only packet you need to handle),
> >> I double you will need to do the reassembling, as Client Hello TLS message
> >> is always less than 400 byte in my env.
> >>
> >> But I think you need to do the decryption for the Client initial packet
> >> before decoding it then parsing the TLS message from its crypto frame.
> > I created this patch:
> >
> > https://github.com/lxin/quic/commit/aee0b7c77df3f39941f98bb901c73fdc560befb8
> >
> > to do this decryption in quic_sock_look() before calling
> > reuseport_select_sock(), so that it provides the bpf selector with
> > a plain-text QUIC initial packet:
> >
> > https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.2
> >
> > If it's complex for you to do the decryption for the initial packet in
> > the bpf selector, I will apply this patch. Please let me know.
>
> I guess in addition to quic_server_handshake(), which is called
> after accept(), there should be quic_server_prepare_listen()
> (and something similar for in kernel servers) that setup the reuseport
> magic for the socket, so that it's not needed in every application.
It's done when calling listen(), see quic_inet_listen()->quic_hash()
where only listening sockets with its sk_reuseport set will be
added into the reuseport group.

It means SO_REUSEPORT sockopt must be set for every socket
before calling listen().

>
> It seems there is only a single ebpf program possible per
> reuseport group, so there has to be just a single one.
Yes, a single ebpf program per reuseport group should work.
see prepare_sk_fds() in kernel selftests for select_reuseport bfp.

>
> But is it possible for in kernel servers to also register an epbf program?
Good question. TBH, I don't really know much about epbf programming.
I guess the real problem is how you pass the .o file to kernel space?

Another question is, in the selftests:
tools/testing/selftests/bpf/prog_tests/select_reuseport.c
tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c

it created a global reuseport_array, and then added these sockets
into this array for the later lookup, but these sockets are all created
in the same process.

But your case is that the sockets are created in different processes.
I'm not sure if it's possible to add sockets from different processes
into the same reuseport_array?

Added Martin who introduced BPF_PROG_TYPE_SK_REUSEPORT,
I guess he may know the answers.

Thanks.
Xin Long April 25, 2024, 6:06 p.m. UTC | #16
On Sun, Apr 21, 2024 at 3:27 PM Stefan Metzmacher <metze@samba.org> wrote:
>
> Am 20.04.24 um 21:32 schrieb Xin Long:
> > On Fri, Apr 19, 2024 at 3:19 PM Xin Long <lucien.xin@gmail.com> wrote:
> >>
> >> On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher <metze@samba.org> wrote:
> >>>
> >>> Hi Xin Long,
> >>>
> >>>>> But I think its unavoidable for the ALPN and SNI fields on
> >>>>> the server side. As every service tries to use udp port 443
> >>>>> and somehow that needs to be shared if multiple services want to
> >>>>> use it.
> >>>>>
> >>>>> I guess on the acceptor side we would need to somehow detach low level
> >>>>> udp struct sock from the logical listen struct sock.
> >>>>>
> >>>>> And quic_do_listen_rcv() would need to find the correct logical listening
> >>>>> socket and call quic_request_sock_enqueue() on the logical socket
> >>>>> not the lowlevel udo socket. The same for all stuff happening after
> >>>>> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
> >>>>>
> >>>> The implementation allows one low level UDP sock to serve for multiple
> >>>> QUIC socks.
> >>>>
> >>>> Currently, if your 3 quic applications listen to the same address:port
> >>>> with SO_REUSEPORT socket option set, the incoming connection will choose
> >>>> one of your applications randomly with hash(client_addr+port) vi
> >>>> reuseport_select_sock() in quic_sock_lookup().
> >>>>
> >>>> It should be easy to do a further match with ALPN between these 3 quic
> >>>> socks that listens to the same address:port to get the right quic sock,
> >>>> instead of that randomly choosing.
> >>>
> >>> Ah, that sounds good.
> >>>
> >>>> The problem is to parse the TLS Client_Hello message to get the ALPN in
> >>>> quic_sock_lookup(), which is not a proper thing to do in kernel, and
> >>>> might be rejected by networking maintainers, I need to check with them.
> >>>
> >>> Is the reassembling of CRYPTO frames done in the kernel or
> >>> userspace? Can you point me to the place in the code?
> >> In quic_inq_handshake_tail() in kernel, for Client Initial packet
> >> is processed when calling accept(), this is the path:
> >>
> >> quic_accept()-> quic_accept_sock_init() -> quic_packet_process() ->
> >> quic_packet_handshake_process() -> quic_frame_process() ->
> >> quic_frame_crypto_process() -> quic_inq_handshake_tail().
> >>
> >> Note that it's with the accept sock, not the listen sock.
> >>
> >>>
> >>> If it's really impossible to do in C code maybe
> >>> registering a bpf function in order to allow a listener
> >>> to check the intial quic packet and decide if it wants to serve
> >>> that connection would be possible as last resort?
> >> That's a smart idea! man.
> >> I think the bpf hook in reuseport_select_sock() is meant to do such
> >> selection.
> >>
> >> For the Client initial packet (the only packet you need to handle),
> >> I double you will need to do the reassembling, as Client Hello TLS message
> >> is always less than 400 byte in my env.
> >>
> >> But I think you need to do the decryption for the Client initial packet
> >> before decoding it then parsing the TLS message from its crypto frame.
> > I created this patch:
> >
> > https://github.com/lxin/quic/commit/aee0b7c77df3f39941f98bb901c73fdc560befb8
> >
> > to do this decryption in quic_sock_look() before calling
> > reuseport_select_sock(), so that it provides the bpf selector with
> > a plain-text QUIC initial packet:
> >
> > https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.2
> >
> > If it's complex for you to do the decryption for the initial packet in
> > the bpf selector, I will apply this patch. Please let me know.
>
> I guess in addition to quic_server_handshake(), which is called
> after accept(), there should be quic_server_prepare_listen()
> (and something similar for in kernel servers) that setup the reuseport
> magic for the socket, so that it's not needed in every application.
>
> It seems there is only a single ebpf program possible per
> reuseport group, so there has to be just a single one.
>
> But is it possible for in kernel servers to also register an epbf program?
>
Just confirmed from other ebpf experts, there are no in-kernel interfaces
for loading and interacting with BPF maps/programs(other than from BPF itself).

It seems that we have to do this match in QUIC stack. In the latest QUIC
code, I added quic_packet_get_alpn(), a 59-line function, to parse ALPNs
and then it will search for the listen sock with these ALPNs in
quic_sock_lookup().

I introduced 'alpn_match' module param, and it can be enabled when loading
the module QUIC by:

  # modprobe quic alpn_match=1

You can test it by tests/sample_test in the latest code:

  Start 3 servers:

    # ./sample_test server 0.0.0.0 1234 \
        ./keys/server-key.pem ./keys/server-cert.pem smbd
    # ./sample_test server 0.0.0.0 1234 \
        ./keys/server-key.pem ./keys/server-cert.pem h3
    # ./sample_test server 0.0.0.0 1234 \
        ./keys/server-key.pem ./keys/server-cert.pem ksmbd

  Try to connect on clients with:

    # ./sample_test client 127.0.0.1 1234 ksmbd
    # ./sample_test client 127.0.0.1 1234 smbd
    # ./sample_test client 127.0.0.1 1234 h3

  to see if the corresponding server responds.

There might be some concerns but it's also a useful feature that can not
be implemented in userland QUICs. The commit is here:

https://github.com/lxin/quic/commit/de82f8135f4e9196b503b4ab5b359d88f2b2097f

Please check if this is enough for SMB applications.

Note as a listen socket is now identified by [address + port + ALPN] when
alpn_match=1, this feature does NOT require SO_REUSEPORT socket option to
be set, unless one wants multiple sockets to listen to
the same [address + port + ALPN].

Thanks.
Martin KaFai Lau April 26, 2024, 4:58 a.m. UTC | #17
On 4/22/24 1:58 PM, Xin Long wrote:
> On Sun, Apr 21, 2024 at 3:27 PM Stefan Metzmacher <metze@samba.org> wrote:
>>
>> Am 20.04.24 um 21:32 schrieb Xin Long:
>>> On Fri, Apr 19, 2024 at 3:19 PM Xin Long <lucien.xin@gmail.com> wrote:
>>>>
>>>> On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher <metze@samba.org> wrote:
>>>>>
>>>>> Hi Xin Long,
>>>>>
>>>>>>> But I think its unavoidable for the ALPN and SNI fields on
>>>>>>> the server side. As every service tries to use udp port 443
>>>>>>> and somehow that needs to be shared if multiple services want to
>>>>>>> use it.
>>>>>>>
>>>>>>> I guess on the acceptor side we would need to somehow detach low level
>>>>>>> udp struct sock from the logical listen struct sock.
>>>>>>>
>>>>>>> And quic_do_listen_rcv() would need to find the correct logical listening
>>>>>>> socket and call quic_request_sock_enqueue() on the logical socket
>>>>>>> not the lowlevel udo socket. The same for all stuff happening after
>>>>>>> quic_request_sock_enqueue() at the end of quic_do_listen_rcv.
>>>>>>>
>>>>>> The implementation allows one low level UDP sock to serve for multiple
>>>>>> QUIC socks.
>>>>>>
>>>>>> Currently, if your 3 quic applications listen to the same address:port
>>>>>> with SO_REUSEPORT socket option set, the incoming connection will choose
>>>>>> one of your applications randomly with hash(client_addr+port) vi
>>>>>> reuseport_select_sock() in quic_sock_lookup().
>>>>>>
>>>>>> It should be easy to do a further match with ALPN between these 3 quic
>>>>>> socks that listens to the same address:port to get the right quic sock,
>>>>>> instead of that randomly choosing.
>>>>>
>>>>> Ah, that sounds good.
>>>>>
>>>>>> The problem is to parse the TLS Client_Hello message to get the ALPN in
>>>>>> quic_sock_lookup(), which is not a proper thing to do in kernel, and
>>>>>> might be rejected by networking maintainers, I need to check with them.
>>>>>
>>>>> Is the reassembling of CRYPTO frames done in the kernel or
>>>>> userspace? Can you point me to the place in the code?
>>>> In quic_inq_handshake_tail() in kernel, for Client Initial packet
>>>> is processed when calling accept(), this is the path:
>>>>
>>>> quic_accept()-> quic_accept_sock_init() -> quic_packet_process() ->
>>>> quic_packet_handshake_process() -> quic_frame_process() ->
>>>> quic_frame_crypto_process() -> quic_inq_handshake_tail().
>>>>
>>>> Note that it's with the accept sock, not the listen sock.
>>>>
>>>>>
>>>>> If it's really impossible to do in C code maybe
>>>>> registering a bpf function in order to allow a listener
>>>>> to check the intial quic packet and decide if it wants to serve
>>>>> that connection would be possible as last resort?
>>>> That's a smart idea! man.
>>>> I think the bpf hook in reuseport_select_sock() is meant to do such
>>>> selection.
>>>>
>>>> For the Client initial packet (the only packet you need to handle),
>>>> I double you will need to do the reassembling, as Client Hello TLS message
>>>> is always less than 400 byte in my env.
>>>>
>>>> But I think you need to do the decryption for the Client initial packet
>>>> before decoding it then parsing the TLS message from its crypto frame.
>>> I created this patch:
>>>
>>> https://github.com/lxin/quic/commit/aee0b7c77df3f39941f98bb901c73fdc560befb8
>>>
>>> to do this decryption in quic_sock_look() before calling
>>> reuseport_select_sock(), so that it provides the bpf selector with
>>> a plain-text QUIC initial packet:
>>>
>>> https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.2
>>>
>>> If it's complex for you to do the decryption for the initial packet in
>>> the bpf selector, I will apply this patch. Please let me know.
>>
>> I guess in addition to quic_server_handshake(), which is called
>> after accept(), there should be quic_server_prepare_listen()
>> (and something similar for in kernel servers) that setup the reuseport
>> magic for the socket, so that it's not needed in every application.
> It's done when calling listen(), see quic_inet_listen()->quic_hash()
> where only listening sockets with its sk_reuseport set will be
> added into the reuseport group.
> 
> It means SO_REUSEPORT sockopt must be set for every socket
> before calling listen().
> 
>>
>> It seems there is only a single ebpf program possible per
>> reuseport group, so there has to be just a single one.
> Yes, a single ebpf program per reuseport group should work.
> see prepare_sk_fds() in kernel selftests for select_reuseport bfp.
> 
>>
>> But is it possible for in kernel servers to also register an epbf program?
> Good question. TBH, I don't really know much about epbf programming.
> I guess the real problem is how you pass the .o file to kernel space?
> 
> Another question is, in the selftests:
> tools/testing/selftests/bpf/prog_tests/s
> tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c
> 
> it created a global reuseport_array, and then added these sockets
> into this array for the later lookup, but these sockets are all created
> in the same process.
> 
> But your case is that the sockets are created in different processes.
> I'm not sure if it's possible to add sockets from different processes
> into the same reuseport_array?
> 
> Added Martin who introduced BPF_PROG_TYPE_SK_REUSEPORT,
> I guess he may know the answers.

I didn't read the patchset, so I don't know what wanted to be done.

 From capturing the questions in this and next email:

the reuseport_array is a bpf map. Like any bpf map, it can be shared across
different processes. Meaning different processes can add sk to the map.

The bpf prog that selects a sk from the reuseport_array is set by the userspace 
through setsockopt(SO_ATTACH_REUSEPORT_EBPF). It is the only way right now, iirc.

If you can summarize what want to be done, it could help to see if there
are ways that work for the use case.


> 
> Thanks.
>
Stefan Metzmacher April 29, 2024, 3:20 p.m. UTC | #18
Hi Xin Long,

>>
> Just confirmed from other ebpf experts, there are no in-kernel interfaces
> for loading and interacting with BPF maps/programs(other than from BPF itself).
> 
> It seems that we have to do this match in QUIC stack. In the latest QUIC
> code, I added quic_packet_get_alpn(), a 59-line function, to parse ALPNs
> and then it will search for the listen sock with these ALPNs in
> quic_sock_lookup().
> 
> I introduced 'alpn_match' module param, and it can be enabled when loading
> the module QUIC by:
> 
>    # modprobe quic alpn_match=1
> 
> You can test it by tests/sample_test in the latest code:
> 
>    Start 3 servers:
> 
>      # ./sample_test server 0.0.0.0 1234 \
>          ./keys/server-key.pem ./keys/server-cert.pem smbd
>      # ./sample_test server 0.0.0.0 1234 \
>          ./keys/server-key.pem ./keys/server-cert.pem h3
>      # ./sample_test server 0.0.0.0 1234 \
>          ./keys/server-key.pem ./keys/server-cert.pem ksmbd
> 
>    Try to connect on clients with:
> 
>      # ./sample_test client 127.0.0.1 1234 ksmbd
>      # ./sample_test client 127.0.0.1 1234 smbd
>      # ./sample_test client 127.0.0.1 1234 h3
> 
>    to see if the corresponding server responds.
> 
> There might be some concerns but it's also a useful feature that can not
> be implemented in userland QUICs. The commit is here:
> 
> https://github.com/lxin/quic/commit/de82f8135f4e9196b503b4ab5b359d88f2b2097f
> 
> Please check if this is enough for SMB applications.

It look great thanks!

> Note as a listen socket is now identified by [address + port + ALPN] when
> alpn_match=1, this feature does NOT require SO_REUSEPORT socket option to
> be set, unless one wants multiple sockets to listen to
> the same [address + port + ALPN].

I'd argue that this should be the default and be required before listen()
or maybe before bind(), so that it can return EADDRINUSE. As EADDRINUSE should only
happen for servers it might be useful to have a QUIC_SOCKOPT_LISTEN_ALPN instead of
QUIC_SOCKOPT_ALPN. As QUIC_SOCKOPT_ALPN on a client socket should not generate let
bind() care about the alpn value at all.

For listens on tcp you also need to specify an explicit port (at least in order
to be useful).

And it would mean that all application would use it and not block other applications
from using an explicit alpn.

Also an module parameter for this means the administrator would have to take care
of it, which means it might be unuseable if loaded with it.

I hope to find some time in the next weeks to play with this.
Should be relatively trivial create a prototype for samba's smbd.

Thanks!
metze
Xin Long May 2, 2024, 6:08 p.m. UTC | #19
On Mon, Apr 29, 2024 at 11:20 AM Stefan Metzmacher <metze@samba.org> wrote:
>
> Hi Xin Long,
>
> >>
> > Just confirmed from other ebpf experts, there are no in-kernel interfaces
> > for loading and interacting with BPF maps/programs(other than from BPF itself).
> >
> > It seems that we have to do this match in QUIC stack. In the latest QUIC
> > code, I added quic_packet_get_alpn(), a 59-line function, to parse ALPNs
> > and then it will search for the listen sock with these ALPNs in
> > quic_sock_lookup().
> >
> > I introduced 'alpn_match' module param, and it can be enabled when loading
> > the module QUIC by:
> >
> >    # modprobe quic alpn_match=1
> >
> > You can test it by tests/sample_test in the latest code:
> >
> >    Start 3 servers:
> >
> >      # ./sample_test server 0.0.0.0 1234 \
> >          ./keys/server-key.pem ./keys/server-cert.pem smbd
> >      # ./sample_test server 0.0.0.0 1234 \
> >          ./keys/server-key.pem ./keys/server-cert.pem h3
> >      # ./sample_test server 0.0.0.0 1234 \
> >          ./keys/server-key.pem ./keys/server-cert.pem ksmbd
> >
> >    Try to connect on clients with:
> >
> >      # ./sample_test client 127.0.0.1 1234 ksmbd
> >      # ./sample_test client 127.0.0.1 1234 smbd
> >      # ./sample_test client 127.0.0.1 1234 h3
> >
> >    to see if the corresponding server responds.
> >
> > There might be some concerns but it's also a useful feature that can not
> > be implemented in userland QUICs. The commit is here:
> >
> > https://github.com/lxin/quic/commit/de82f8135f4e9196b503b4ab5b359d88f2b2097f
> >
> > Please check if this is enough for SMB applications.
>
> It look great thanks!
>
> > Note as a listen socket is now identified by [address + port + ALPN] when
> > alpn_match=1, this feature does NOT require SO_REUSEPORT socket option to
> > be set, unless one wants multiple sockets to listen to
> > the same [address + port + ALPN].
>
> I'd argue that this should be the default and be required before listen()
> or maybe before bind(), so that it can return EADDRINUSE. As EADDRINUSE should only
> happen for servers it might be useful to have a QUIC_SOCKOPT_LISTEN_ALPN instead of
> QUIC_SOCKOPT_ALPN. As QUIC_SOCKOPT_ALPN on a client socket should not generate let
> bind() care about the alpn value at all.
The latest patches have made it always do alpn_match in kernel, and also
support multiple ALPNs(split by ',' when setting it via sockopt) on both
server and client side. Feel free to check.

Note that:
1. As you expected, setsockopt(QUIC_SOCKOPT_ALPN) must be called before
   listen(), and it will return EADDRINUSE if there's a socket already
   listening to the same IP + PORT + ALPN.

2. ALPN bind/match is a *listening* sockets thing, so it checks ALPN only
   when adding listening sockets in quic_hash(), and it does ALPN only
   when looking up listening sockets in quic_sock_lookup().

   By setting ALPNs in client sockets it will ONLY pack these ALPNs into
   the Client Initial Packet when starting connecting, no bind/match for
   these regular sockets, as these sockets can be found by 4-tuple or
   a source_connection_id. bind() doesn't need to care about ALPN for
   client/regular socket either.

   So it's fine to use QUIC_SOCKOPT_ALPN sockopt for both listen and
   regular/client sockets, as in kernel it acts differently on ALPNs
   for listening and regular sockets. (sorry for confusing, I could
   have moved created another hashtable for listening sockets)

   In other word, a listen socket is identified by

        local_ip + local_port + ALPN(s)

   while a regular socket (represents a quic connection) is identified by:

       local_ip + local_port + remote_ip + remote_port

   or any of those

       source_connection_ids.

3. SO_REUSEPORT is still applied to do some load balance between multiple
   processes listening to the same IP + PORT + ALPN, like:

   on server:
   process A: skA = listen(127.0.0.1:1234:smbd)
   process B: skB = listen(127.0.0.1:1234:smbd)
   process C: skC = listen(127.0.0.1:1234:smbd)

   on client:
   connect(127.0.0.1:1234:smbd)
   connect(127.0.0.1:1234:smbd)
   ...

   on server it will select the sk among (skA, skB and skC) based on the
   source address + port in the request from client.

4. Not sure if multiple ALPNs support are useful to you, here is some
   example about how it works:
   - Without SO_REUSEPORT set:

     On server:
     process A: skA = listen(127.0.0.1:1234:smbd,h3,ksmbd)
     process B: skB = listen(127.0.0.1:1234:smbd,h3,ksmbd)

     listen() in process B fails and returns EADDRINUSE.

   - with SO_REUSEPORT set:

     On server:
     process A: skA = listen(127.0.0.1:1234:smbd,h3,ksmbd)
     process B: skB = listen(127.0.0.1:1234:smbd,h3,ksmbd)

     listen() in process B works.

   - with or without SO_REUSEPORT set:

     On server:
     process A: skA = listen(127.0.0.1:1234:h3,ksmbd)
     process B: skB = listen(127.0.0.1:1234:h3,smbd).
     (there's overlap on ALPN list but not exact the same ALPNs)

     listen() in process B fails and returns EADDRINUSE.

   - the match priority for multiple ALPNs is based on the order on the
     client ALPN list:

     On server:
     process A: skA = listen(127.0.0.1:1234:smbd)
     process B: skB = listen(127.0.0.1:1234:h3)
     process C: skC = listen(127.0.0.1:1234:ksmbd)

     On client:
     process X: skX = connect(27.0.0.1:1234:h3,ksmbd,smbd)

     skB will be the one selected to accept the connection, as h3 is the
     1st ALPN on the client ALPN list 'h3,ksmbd,smbd'.

>
> For listens on tcp you also need to specify an explicit port (at least in order
> to be useful).
>
> And it would mean that all application would use it and not block other applications
> from using an explicit alpn.
>
> Also an module parameter for this means the administrator would have to take care
> of it, which means it might be unuseable if loaded with it.
Agree, already dropped this param.

>
> I hope to find some time in the next weeks to play with this.
> Should be relatively trivial create a prototype for samba's smbd.
Sounds Cool!

Thanks.