diff mbox series

[net-next,01/12] net: homa: define user-visible API for Homa

Message ID 20241028213541.1529-2-ouster@cs.stanford.edu (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Begin upstreaming Homa transport protocol | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers success CCed 1 of 1 maintainers
netdev/build_clang success Errors and warnings before: 4 this patch: 4
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 82 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc fail Errors and warnings before: 0 this patch: 14
netdev/source_inline success Was 0 now: 0

Commit Message

John Ousterhout Oct. 28, 2024, 9:35 p.m. UTC
Note: for man pages, see the Homa Wiki at:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 include/uapi/linux/homa.h | 199 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 199 insertions(+)
 create mode 100644 include/uapi/linux/homa.h

Comments

Andrew Lunn Oct. 29, 2024, 9:59 p.m. UTC | #1
> +/**
> + * struct homa_recvmsg_args - Provides information needed by Homa's
> + * recvmsg; passed to recvmsg using the msg_control field.
> + */
> +struct homa_recvmsg_args {
> +	/**
> +	 * @id: (in/out) Initially specifies the id of the desired RPC, or 0
> +	 * if any RPC is OK; returns the actual id received.
> +	 */
> +	uint64_t id;
> +
> +	/**
> +	 * @completion_cookie: (out) If the incoming message is a response,
> +	 * this will return the completion cookie specified when the
> +	 * request was sent. For requests this will always be zero.
> +	 */
> +	uint64_t completion_cookie;
> +
> +	/**
> +	 * @flags: (in) OR-ed combination of bits that control the operation.
> +	 * See below for values.
> +	 */
> +	int flags;

Maybe give this a fixed size, otherwise it gets interesting when you
have a 32 bit userspace running on top of a 64 bit kernel.

> +
> +	/**
> +	 * @error_addr: the address of the peer is stored here when available.
> +	 * This field is different from the msg_name field in struct msghdr
> +	 * in that the msg_name field isn't set after errors. This field will
> +	 * always be set when peer information is available, which includes
> +	 * some error cases.
> +	 */
> +	union sockaddr_in_union peer_addr;
> +
> +	/**
> +	 * @num_bpages: (in/out) Number of valid entries in @bpage_offsets.
> +	 * Passes in bpages from previous messages that can now be
> +	 * recycled; returns bpages from the new message.
> +	 */
> +	uint32_t num_bpages;
> +
> +	uint32_t _pad[1];

If you ever want to be able to use this sometime in the future, it
would be good to document that it should be filled with zero, and test
is it zero. And if the kernel ever passes this structure back to
userspace it should also fill it with zero.

> +#if !defined(__cplusplus)
> +_Static_assert(sizeof(struct homa_recvmsg_args) >= 120,
> +	       "homa_recvmsg_args shrunk");
> +_Static_assert(sizeof(struct homa_recvmsg_args) <= 120,
> +	       "homa_recvmsg_args grew");

Did you build for 32 bit systems? 

	Andrew
John Ousterhout Oct. 30, 2024, 4:06 a.m. UTC | #2
On Tue, Oct 29, 2024 at 2:59 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > +     int flags;
>
> Maybe give this a fixed size, otherwise it gets interesting when you
> have a 32 bit userspace running on top of a 64 bit kernel.

Good point; will do.

> > +     uint32_t _pad[1];
>
> If you ever want to be able to use this sometime in the future, it
> would be good to document that it should be filled with zero, and test
> is it zero. And if the kernel ever passes this structure back to
> userspace it should also fill it with zero.

It does have to be filled with zero, and it is checked. I'll document that.

> > +#if !defined(__cplusplus)
> > +_Static_assert(sizeof(struct homa_recvmsg_args) >= 120,
> > +            "homa_recvmsg_args shrunk");
> > +_Static_assert(sizeof(struct homa_recvmsg_args) <= 120,
> > +            "homa_recvmsg_args grew");
>
> Did you build for 32 bit systems?

Sadly no: my development system doesn't currently have any
cross-compiling versions of gcc :-( I think the best thing is to
remove these assertions from the kernel version of Homa. They are
there to make sure that I don't accidentally change the size of the
structure; I will keep them in the GitHub repo for Homa, which should
serve that purpose.

Thanks for the quick comments.

-John-
Andrew Lunn Oct. 30, 2024, 12:41 p.m. UTC | #3
> > Did you build for 32 bit systems?
> 
> Sadly no: my development system doesn't currently have any
> cross-compiling versions of gcc :-(

I'm not sure in this case it is actually a cross compile. Your default
amd64 tool chain should also be able to compile for i386.

export ARCH=i386
unset CROSS_COMPILE
make defconfig
make

	Andrew
John Ousterhout Nov. 1, 2024, 5:47 p.m. UTC | #4
On Wed, Oct 30, 2024 at 5:41 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > Did you build for 32 bit systems?
> >
> > Sadly no: my development system doesn't currently have any
> > cross-compiling versions of gcc :-(
>
> I'm not sure in this case it is actually a cross compile. Your default
> amd64 tool chain should also be able to compile for i386.
>
> export ARCH=i386
> unset CROSS_COMPILE
> make defconfig
> make

Thanks for this additional information. I have now compiled Homa
(along with the rest of the kernel) for ARCH=i386; in the process I
learned about uintptr_t and do_div.

Question: is the distinction between the types u64 and __u64
significant? If so, is there someplace where it is explained when I
should use each? So far I have been using __u64 (almost) everywhere.

-John-
Andrew Lunn Nov. 1, 2024, 6:01 p.m. UTC | #5
On Fri, Nov 01, 2024 at 10:47:20AM -0700, John Ousterhout wrote:
> On Wed, Oct 30, 2024 at 5:41 AM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > > Did you build for 32 bit systems?
> > >
> > > Sadly no: my development system doesn't currently have any
> > > cross-compiling versions of gcc :-(
> >
> > I'm not sure in this case it is actually a cross compile. Your default
> > amd64 tool chain should also be able to compile for i386.
> >
> > export ARCH=i386
> > unset CROSS_COMPILE
> > make defconfig
> > make
> 
> Thanks for this additional information. I have now compiled Homa
> (along with the rest of the kernel) for ARCH=i386; in the process I
> learned about uintptr_t and do_div.
> 

> Question: is the distinction between the types u64 and __u64
> significant? If so, is there someplace where it is explained when I
> should use each? So far I have been using __u64 (almost) everywhere.

/include/uapi/asm-generic/int-ll64.h says:

/*
 * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the
 * header files exported to user space
 */

So for files you export to userspace, anything in include/uapi, you
should be using __u64. In the kernel, i think it does not matter, and
i did find:

typedef __u64 u64;

so they probably end up identical. u64 seems more popular in net/ than
__u64, probably because it is shorter.

	Andrew
Edward Cree Nov. 7, 2024, 9:58 p.m. UTC | #6
On 28/10/2024 21:35, John Ousterhout wrote:
> Note: for man pages, see the Homa Wiki at:
> https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
> 
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
...
> +/**
> + * Holds either an IPv4 or IPv6 address (smaller and easier to use than
> + * sockaddr_storage).
> + */
> +union sockaddr_in_union {
> +	struct sockaddr sa;
> +	struct sockaddr_in in4;
> +	struct sockaddr_in6 in6;
> +};

Are there fundamental reasons why Homa can only run over IP and not
 other L3 networks?  Or performance measurements showing that the
 cost of using sockaddr_storage is excessive?
Otherwise, baking this into the uAPI seems unwise.

> +	/**
> +	 * @error_addr: the address of the peer is stored here when available.
> +	 * This field is different from the msg_name field in struct msghdr
> +	 * in that the msg_name field isn't set after errors. This field will
> +	 * always be set when peer information is available, which includes
> +	 * some error cases.
> +	 */
> +	union sockaddr_in_union peer_addr;

Member name (peer_addr) doesn't match the kerneldoc (@error_addr).

> +int     homa_send(int sockfd, const void *message_buf,
> +		  size_t length, const union sockaddr_in_union *dest_addr,
> +		  uint64_t *id, uint64_t completion_cookie);
> +int     homa_sendv(int sockfd, const struct iovec *iov,
> +		   int iovcnt, const union sockaddr_in_union *dest_addr,
> +		   uint64_t *id, uint64_t completion_cookie);
> +ssize_t homa_reply(int sockfd, const void *message_buf,
> +		   size_t length, const union sockaddr_in_union *dest_addr,
> +		   uint64_t id);
> +ssize_t homa_replyv(int sockfd, const struct iovec *iov,
> +		    int iovcnt, const union sockaddr_in_union *dest_addr,
> +		    uint64_t id);

I don't think these belong in here.  They seem to be userland
 library functions which wrap the sendmsg syscall, and as far as
 I can tell the definitions corresponding to these prototypes do
 not appear in the patch series.
John Ousterhout Nov. 8, 2024, 5:55 p.m. UTC | #7
On Thu, Nov 7, 2024 at 1:58 PM Edward Cree <ecree.xilinx@gmail.com> wrote:
>
> On 28/10/2024 21:35, John Ousterhout wrote:
> > Note: for man pages, see the Homa Wiki at:
> > https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
> >
> > Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> ...
> > +/**
> > + * Holds either an IPv4 or IPv6 address (smaller and easier to use than
> > + * sockaddr_storage).
> > + */
> > +union sockaddr_in_union {
> > +     struct sockaddr sa;
> > +     struct sockaddr_in in4;
> > +     struct sockaddr_in6 in6;
> > +};
>
> Are there fundamental reasons why Homa can only run over IP and not
>  other L3 networks?  Or performance measurements showing that the
>  cost of using sockaddr_storage is excessive?
> Otherwise, baking this into the uAPI seems unwise.

This structure made it easier to write code that runs over both IPv4
and IPv6. But, I see your point about the limitations it creates
(there is no fundamental reason Homa couldn't run over other datagram
protocols). In looking over the code, I don't think this structure is
used anymore in the kernel code or the kernel-user interface (it
appears in one structure, but I believe that field is now obsolete and
can be eliminated); its remaining uses are in user-level code. I will
remove sockaddr_in_union from this file.

> > +     /**
> > +      * @error_addr: the address of the peer is stored here when available.
> > +      * This field is different from the msg_name field in struct msghdr
> > +      * in that the msg_name field isn't set after errors. This field will
> > +      * always be set when peer information is available, which includes
> > +      * some error cases.
> > +      */
> > +     union sockaddr_in_union peer_addr;
>
> Member name (peer_addr) doesn't match the kerneldoc (@error_addr).

I will fix.

> > +int     homa_send(int sockfd, const void *message_buf,
> > +               size_t length, const union sockaddr_in_union *dest_addr,
> > +               uint64_t *id, uint64_t completion_cookie);
> > +int     homa_sendv(int sockfd, const struct iovec *iov,
> > +                int iovcnt, const union sockaddr_in_union *dest_addr,
> > +                uint64_t *id, uint64_t completion_cookie);
> > +ssize_t homa_reply(int sockfd, const void *message_buf,
> > +                size_t length, const union sockaddr_in_union *dest_addr,
> > +                uint64_t id);
> > +ssize_t homa_replyv(int sockfd, const struct iovec *iov,
> > +                 int iovcnt, const union sockaddr_in_union *dest_addr,
> > +                 uint64_t id);
>
> I don't think these belong in here.  They seem to be userland
>  library functions which wrap the sendmsg syscall, and as far as
>  I can tell the definitions corresponding to these prototypes do
>  not appear in the patch series.

I'll remove for now. This leaves open the question of where these
declarations should go once the userland library is upstreamed. Those
library methods are low-level wrappers that make it easier to use the
sendmsg kernel call for Homa; users will probably think of them as if
they were system calls. It feels awkward to require people to #include
2 different header files in order to use Homa kernel calls; is it
considered bad form to mix declarations for very low-level methods
like these ("not much more than kernel calls") with those for "real"
kernel calls? Do you know of other low-level kernel-call wrappers in
Linux that are analogous to these? If so, how are they handled?

Thanks for your comments.

-John-
Edward Cree Nov. 8, 2024, 10:02 p.m. UTC | #8
On 08/11/2024 17:55, John Ousterhout wrote:
> This leaves open the question of where these
> declarations should go once the userland library is upstreamed. Those
> library methods are low-level wrappers that make it easier to use the
> sendmsg kernel call for Homa; users will probably think of them as if
> they were system calls. It feels awkward to require people to #include
> 2 different header files in order to use Homa kernel calls; is it
> considered bad form to mix declarations for very low-level methods
> like these ("not much more than kernel calls") with those for "real"
> kernel calls?

include/uapi/ does sometimes contain 'static inline' wrappers.  But
 declarations for actual functions that need linkage are avoided AFAICT.
The expectation normally is that userland application code will #include
 a library header, which takes care of #including any necessary kernel
 uAPI headers, ideally packaged separately from the kernel rather than
 just taking the include/uapi/ directory of whatever kernel is currently
 running.  (Back in the day there were some classic Linus rants[1]
 warning against the latter.)
Then both the helper functions and their declarations live in the
 library, where they can be linked into the application, and not mixed
 in with the kernel headers.

> Do you know of other low-level kernel-call wrappers in
> Linux that are analogous to these? If so, how are they handled?

The closest analogy that comes to mind is the bpf system call and libbpf.
libbpf lives in the tools/lib/bpf/ directory of the kernel tree, but is
 often packaged and distributed independently[2] of the kernel package.
If there is a reason to tie the maintenance of your wrappers to the
 kernel project/git repo then this can be suitable.

But I'm not an expert on this, so I hope someone with more experience
 around uAPI stuff will chime in.  Might be worth CCing linux-api[3] on
 the next version of this patch.

HTH,
-ed

[1]: https://yarchive.net/comp/linux/kernel_headers.html#23
[2]: https://github.com/libbpf/libbpf
[3]: https://www.kernel.org/doc/man-pages/linux-api-ml.html
Stephen Hemminger Nov. 8, 2024, 10:32 p.m. UTC | #9
On Fri, 8 Nov 2024 22:02:27 +0000
Edward Cree <ecree.xilinx@gmail.com> wrote:

> > Do you know of other low-level kernel-call wrappers in
> > Linux that are analogous to these? If so, how are they handled?  
> 
> The closest analogy that comes to mind is the bpf system call and libbpf.
> libbpf lives in the tools/lib/bpf/ directory of the kernel tree, but is
>  often packaged and distributed independently[2] of the kernel package.
> If there is a reason to tie the maintenance of your wrappers to the
>  kernel project/git repo then this can be suitable.

liburing for ioring calls is a better example.
There are lots of versioning issues in any API. It took several years
for BPF to get to run anywhere status. Hopefully, you can learn from
those problems.
diff mbox series

Patch

diff --git a/include/uapi/linux/homa.h b/include/uapi/linux/homa.h
new file mode 100644
index 000000000000..306d272e4b63
--- /dev/null
+++ b/include/uapi/linux/homa.h
@@ -0,0 +1,199 @@ 
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the kernel call interface for the Homa
+ * transport protocol.
+ */
+
+#ifndef _UAPI_LINUX_HOMA_H
+#define _UAPI_LINUX_HOMA_H
+
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <netinet/in.h>
+#include <sys/socket.h>
+#endif
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif
+
+/* IANA-assigned Internet Protocol number for Homa. */
+#define IPPROTO_HOMA 146
+
+/**
+ * define HOMA_MAX_MESSAGE_LENGTH - Maximum bytes of payload in a Homa
+ * request or response message.
+ */
+#define HOMA_MAX_MESSAGE_LENGTH 1000000
+
+/**
+ * define HOMA_BPAGE_SIZE - Number of bytes in pages used for receive
+ * buffers. Must be power of two.
+ */
+#define HOMA_BPAGE_SHIFT 16
+#define HOMA_BPAGE_SIZE (1 << HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MAX_BPAGES: The largest number of bpages that will be required
+ * to store an incoming message.
+ */
+#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) \
+		>> HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MIN_DEFAULT_PORT - The 16-bit port space is divided into
+ * two nonoverlapping regions. Ports 1-32767 are reserved exclusively
+ * for well-defined server ports. The remaining ports are used for client
+ * ports; these are allocated automatically by Homa. Port 0 is reserved.
+ */
+#define HOMA_MIN_DEFAULT_PORT 0x8000
+
+/**
+ * Holds either an IPv4 or IPv6 address (smaller and easier to use than
+ * sockaddr_storage).
+ */
+union sockaddr_in_union {
+	struct sockaddr sa;
+	struct sockaddr_in in4;
+	struct sockaddr_in6 in6;
+};
+
+/**
+ * struct homa_sendmsg_args - Provides information needed by Homa's
+ * sendmsg; passed to sendmsg using the msg_control field.
+ */
+struct homa_sendmsg_args {
+	/**
+	 * @id: (in/out) An initial value of 0 means a new request is
+	 * being sent; nonzero means the message is a reply to the given
+	 * id. If the message is a request, then the value is modified to
+	 * hold the id of the new RPC.
+	 */
+	uint64_t id;
+
+	/**
+	 * @completion_cookie: (in) Used only for request messages; will be
+	 * returned by recvmsg when the RPC completes. Typically used to
+	 * locate app-specific info about the RPC.
+	 */
+	uint64_t completion_cookie;
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_sendmsg_args) >= 16,
+	       "homa_sendmsg_args shrunk");
+_Static_assert(sizeof(struct homa_sendmsg_args) <= 16,
+	       "homa_sendmsg_args grew");
+#endif
+
+/**
+ * struct homa_recvmsg_args - Provides information needed by Homa's
+ * recvmsg; passed to recvmsg using the msg_control field.
+ */
+struct homa_recvmsg_args {
+	/**
+	 * @id: (in/out) Initially specifies the id of the desired RPC, or 0
+	 * if any RPC is OK; returns the actual id received.
+	 */
+	uint64_t id;
+
+	/**
+	 * @completion_cookie: (out) If the incoming message is a response,
+	 * this will return the completion cookie specified when the
+	 * request was sent. For requests this will always be zero.
+	 */
+	uint64_t completion_cookie;
+
+	/**
+	 * @flags: (in) OR-ed combination of bits that control the operation.
+	 * See below for values.
+	 */
+	int flags;
+
+	/**
+	 * @error_addr: the address of the peer is stored here when available.
+	 * This field is different from the msg_name field in struct msghdr
+	 * in that the msg_name field isn't set after errors. This field will
+	 * always be set when peer information is available, which includes
+	 * some error cases.
+	 */
+	union sockaddr_in_union peer_addr;
+
+	/**
+	 * @num_bpages: (in/out) Number of valid entries in @bpage_offsets.
+	 * Passes in bpages from previous messages that can now be
+	 * recycled; returns bpages from the new message.
+	 */
+	uint32_t num_bpages;
+
+	uint32_t _pad[1];
+
+	/**
+	 * @bpage_offsets: (in/out) Each entry is an offset into the buffer
+	 * region for the socket pool. When returned from recvmsg, the
+	 * offsets indicate where fragments of the new message are stored. All
+	 * entries but the last refer to full buffer pages (HOMA_BPAGE_SIZE bytes)
+	 * and are bpage-aligned. The last entry may refer to a bpage fragment and
+	 * is not necessarily aligned. The application now owns these bpages and
+	 * must eventually return them to Homa, using bpage_offsets in a future
+	 * recvmsg invocation.
+	 */
+	uint32_t bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_recvmsg_args) >= 120,
+	       "homa_recvmsg_args shrunk");
+_Static_assert(sizeof(struct homa_recvmsg_args) <= 120,
+	       "homa_recvmsg_args grew");
+#endif
+
+/* Flag bits for homa_recvmsg_args.flags (see man page for documentation):
+ */
+#define HOMA_RECVMSG_REQUEST       0x01
+#define HOMA_RECVMSG_RESPONSE      0x02
+#define HOMA_RECVMSG_NONBLOCKING   0x04
+#define HOMA_RECVMSG_VALID_FLAGS   0x07
+
+/** define SO_HOMA_SET_BUF: setsockopt option for specifying buffer region. */
+#define SO_HOMA_SET_BUF 10
+
+/** struct homa_set_buf - setsockopt argument for SO_HOMA_SET_BUF. */
+struct homa_set_buf_args {
+	/** @start: First byte of buffer region. */
+	void *start;
+
+	/** @length: Total number of bytes available at @start. */
+	size_t length;
+};
+
+/**
+ * Meanings of the bits in Homa's flag word, which can be set using
+ * "sysctl /net/homa/flags".
+ */
+
+/**
+ * Disable the output throttling mechanism: always send all packets
+ * immediately.
+ */
+#define HOMA_FLAG_DONT_THROTTLE   2
+
+int     homa_send(int sockfd, const void *message_buf,
+		  size_t length, const union sockaddr_in_union *dest_addr,
+		  uint64_t *id, uint64_t completion_cookie);
+int     homa_sendv(int sockfd, const struct iovec *iov,
+		   int iovcnt, const union sockaddr_in_union *dest_addr,
+		   uint64_t *id, uint64_t completion_cookie);
+ssize_t homa_reply(int sockfd, const void *message_buf,
+		   size_t length, const union sockaddr_in_union *dest_addr,
+		   uint64_t id);
+ssize_t homa_replyv(int sockfd, const struct iovec *iov,
+		    int iovcnt, const union sockaddr_in_union *dest_addr,
+		    uint64_t id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _UAPI_LINUX_HOMA_H */