From patchwork Mon Oct 28 21:35:28 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854184
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C3BC61E22EF
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:37:35 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151458; cv=none;
 b=Ae98JQQSppyh9Qq/Qwel9t/jy0kUhi/epjtEnbQ3eMr3dicmnmNiMOVreBiB8pQSDw3s/jD0DM+6933WACsgQsNA9iFThPq/YshcwzWgW3FVwAf79v16Cyl35ZnzJF3qR6eaDW6C3qABkqOVxB/Q9hRyLMXEvuHIuBY9nfh9GTE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151458; c=relaxed/simple;
	bh=rkABdtpdZQgMw9KpwwCPQnqjc5Zh1orI2MZQCPdhFI4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=EPxcsFtU6gGYzdQIjx6ves9LqAWvCNDrp8C99D7BRWE5cQgG9CUQduoyLeZcee8RImBUD9jSqfp1/nZ04yZiC2tC7MHz5qQlsn95LyojUgwSjEkQyFbRuO6ICDCZ5Jk4GtTHB7pRCmOWsVqbgQhIh3Of28dKNbyr1XtmBA3TZDQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=eRajWTD0; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="eRajWTD0"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=k/waYF7swEutjGVEqGtZvwtJ1dlrOc5cg0ch+n/cOZg=; t=1730151455; x=1731015455;
	b=eRajWTD08Ytgqau7you1otCZVRq6S0NKsWKFk55KsMtxTPVH8IEJvc4haj0HoLs69rMdjTihCAf
	IL+8IFOful12zezrRGg+7MxgxGmDq4Ogmf7WLUikkRlQSFiKZSwuZ4coeDpNf7RlLNubOmmgTzvPV
	xejhJaaVjK2YoK4VtxrcYgaPdSkYLYs20UYLxMlFfaPC2+3Uq7pxlSNzKAO96WBaD4o1/fvQQOJi+
	ETADx9jhP1oQA0EEiSt5BNP3rZXqwX7hHXPYApGl2kp8MXqe/kA2x7fdLSoTW2aDgTXAeDOEJlfAc
	DdLLnLDYWBPjSEG+9qrDmRCfjo6xchOsAylQ==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPJ-0005xj-69; Mon, 28 Oct 2024 14:36:13 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 01/12] net: homa: define user-visible API for Homa
Date: Mon, 28 Oct 2024 14:35:28 -0700
Message-ID: <20241028213541.1529-2-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 21932f0bb9e8b158673ea01bbd84b966
X-Patchwork-Delegate: kuba@kernel.org

Note: for man pages, see the Homa Wiki at:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 include/uapi/linux/homa.h | 199 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 199 insertions(+)
 create mode 100644 include/uapi/linux/homa.h

diff --git a/include/uapi/linux/homa.h b/include/uapi/linux/homa.h
new file mode 100644
index 000000000000..306d272e4b63
--- /dev/null
+++ b/include/uapi/linux/homa.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the kernel call interface for the Homa
+ * transport protocol.
+ */
+
+#ifndef _UAPI_LINUX_HOMA_H
+#define _UAPI_LINUX_HOMA_H
+
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <netinet/in.h>
+#include <sys/socket.h>
+#endif
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif
+
+/* IANA-assigned Internet Protocol number for Homa. */
+#define IPPROTO_HOMA 146
+
+/**
+ * define HOMA_MAX_MESSAGE_LENGTH - Maximum bytes of payload in a Homa
+ * request or response message.
+ */
+#define HOMA_MAX_MESSAGE_LENGTH 1000000
+
+/**
+ * define HOMA_BPAGE_SIZE - Number of bytes in pages used for receive
+ * buffers. Must be power of two.
+ */
+#define HOMA_BPAGE_SHIFT 16
+#define HOMA_BPAGE_SIZE (1 << HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MAX_BPAGES: The largest number of bpages that will be required
+ * to store an incoming message.
+ */
+#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) \
+		>> HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MIN_DEFAULT_PORT - The 16-bit port space is divided into
+ * two nonoverlapping regions. Ports 1-32767 are reserved exclusively
+ * for well-defined server ports. The remaining ports are used for client
+ * ports; these are allocated automatically by Homa. Port 0 is reserved.
+ */
+#define HOMA_MIN_DEFAULT_PORT 0x8000
+
+/**
+ * Holds either an IPv4 or IPv6 address (smaller and easier to use than
+ * sockaddr_storage).
+ */
+union sockaddr_in_union {
+	struct sockaddr sa;
+	struct sockaddr_in in4;
+	struct sockaddr_in6 in6;
+};
+
+/**
+ * struct homa_sendmsg_args - Provides information needed by Homa's
+ * sendmsg; passed to sendmsg using the msg_control field.
+ */
+struct homa_sendmsg_args {
+	/**
+	 * @id: (in/out) An initial value of 0 means a new request is
+	 * being sent; nonzero means the message is a reply to the given
+	 * id. If the message is a request, then the value is modified to
+	 * hold the id of the new RPC.
+	 */
+	uint64_t id;
+
+	/**
+	 * @completion_cookie: (in) Used only for request messages; will be
+	 * returned by recvmsg when the RPC completes. Typically used to
+	 * locate app-specific info about the RPC.
+	 */
+	uint64_t completion_cookie;
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_sendmsg_args) >= 16,
+	       "homa_sendmsg_args shrunk");
+_Static_assert(sizeof(struct homa_sendmsg_args) <= 16,
+	       "homa_sendmsg_args grew");
+#endif
+
+/**
+ * struct homa_recvmsg_args - Provides information needed by Homa's
+ * recvmsg; passed to recvmsg using the msg_control field.
+ */
+struct homa_recvmsg_args {
+	/**
+	 * @id: (in/out) Initially specifies the id of the desired RPC, or 0
+	 * if any RPC is OK; returns the actual id received.
+	 */
+	uint64_t id;
+
+	/**
+	 * @completion_cookie: (out) If the incoming message is a response,
+	 * this will return the completion cookie specified when the
+	 * request was sent. For requests this will always be zero.
+	 */
+	uint64_t completion_cookie;
+
+	/**
+	 * @flags: (in) OR-ed combination of bits that control the operation.
+	 * See below for values.
+	 */
+	int flags;
+
+	/**
+	 * @error_addr: the address of the peer is stored here when available.
+	 * This field is different from the msg_name field in struct msghdr
+	 * in that the msg_name field isn't set after errors. This field will
+	 * always be set when peer information is available, which includes
+	 * some error cases.
+	 */
+	union sockaddr_in_union peer_addr;
+
+	/**
+	 * @num_bpages: (in/out) Number of valid entries in @bpage_offsets.
+	 * Passes in bpages from previous messages that can now be
+	 * recycled; returns bpages from the new message.
+	 */
+	uint32_t num_bpages;
+
+	uint32_t _pad[1];
+
+	/**
+	 * @bpage_offsets: (in/out) Each entry is an offset into the buffer
+	 * region for the socket pool. When returned from recvmsg, the
+	 * offsets indicate where fragments of the new message are stored. All
+	 * entries but the last refer to full buffer pages (HOMA_BPAGE_SIZE bytes)
+	 * and are bpage-aligned. The last entry may refer to a bpage fragment and
+	 * is not necessarily aligned. The application now owns these bpages and
+	 * must eventually return them to Homa, using bpage_offsets in a future
+	 * recvmsg invocation.
+	 */
+	uint32_t bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_recvmsg_args) >= 120,
+	       "homa_recvmsg_args shrunk");
+_Static_assert(sizeof(struct homa_recvmsg_args) <= 120,
+	       "homa_recvmsg_args grew");
+#endif
+
+/* Flag bits for homa_recvmsg_args.flags (see man page for documentation):
+ */
+#define HOMA_RECVMSG_REQUEST       0x01
+#define HOMA_RECVMSG_RESPONSE      0x02
+#define HOMA_RECVMSG_NONBLOCKING   0x04
+#define HOMA_RECVMSG_VALID_FLAGS   0x07
+
+/** define SO_HOMA_SET_BUF: setsockopt option for specifying buffer region. */
+#define SO_HOMA_SET_BUF 10
+
+/** struct homa_set_buf - setsockopt argument for SO_HOMA_SET_BUF. */
+struct homa_set_buf_args {
+	/** @start: First byte of buffer region. */
+	void *start;
+
+	/** @length: Total number of bytes available at @start. */
+	size_t length;
+};
+
+/**
+ * Meanings of the bits in Homa's flag word, which can be set using
+ * "sysctl /net/homa/flags".
+ */
+
+/**
+ * Disable the output throttling mechanism: always send all packets
+ * immediately.
+ */
+#define HOMA_FLAG_DONT_THROTTLE   2
+
+int     homa_send(int sockfd, const void *message_buf,
+		  size_t length, const union sockaddr_in_union *dest_addr,
+		  uint64_t *id, uint64_t completion_cookie);
+int     homa_sendv(int sockfd, const struct iovec *iov,
+		   int iovcnt, const union sockaddr_in_union *dest_addr,
+		   uint64_t *id, uint64_t completion_cookie);
+ssize_t homa_reply(int sockfd, const void *message_buf,
+		   size_t length, const union sockaddr_in_union *dest_addr,
+		   uint64_t id);
+ssize_t homa_replyv(int sockfd, const struct iovec *iov,
+		    int iovcnt, const union sockaddr_in_union *dest_addr,
+		    uint64_t id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _UAPI_LINUX_HOMA_H */

From patchwork Mon Oct 28 21:35:29 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854189
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 101591EE009
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:37:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151463; cv=none;
 b=liGZF4+kG52DCVa7VVCKOomi7+Hx9W5QXVY7NKbbUeX3hRmAJ0JzqBkj2q3H4FuX1bNilElnrFBeGo2sXND/5K//wFKB+QnzX/1EKI1cyhgYisVG7fEdsCdiqQbkyyuf3JD5bpJAzq/M97TddSmxthkBUp8pF7JpmV/LHq3I/rg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151463; c=relaxed/simple;
	bh=zlnLUOim/QaHKJc9UofPvMKXSv0FNklj9ig6zWE++P8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ZnSl1XmTRS+1bOMeYLMgqbpVRM7SgYU89DIj2pVokbp1SzATDXD8j60eLMWhjnx0Df82OOrlKhmTh/wT7fm7kMQHbipkmv61SlDkZyRIkDVRJAz0uBI9cN1lIYhtJoqmxwC83TFUKyz0fGSLrGEExbpJhvox/fDdF51pYH/kH9k=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=XbF5FJAX; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="XbF5FJAX"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=i1+hFTKpDJbk7fQaYfHZF1/cNOl8KtR9xjMKGxdgykY=; t=1730151461; x=1731015461;
	b=XbF5FJAX2d2qiS8WhCCFFP09/fagY0M2KMipOz0KcmAkxoKGsts9RvvEw4wrO3qgO2+Y3IJKJtR
	3+zUPKiVrIvLiFC6a6V7XWHNliDK+AB1B4/L93IjUlgg30MSgoHRg7LuZfJmx0VdhdBvhGLV0/LjN
	cbl//pf8zUwVlNR6MEZ9ok3GNnyZgyhNOKASf9gV83BoTUeumg+bnXpY3UyTQYj1Pqk4V6VYkIrfd
	RLMW82KSH6fvgN3cqbJ17zCOeYU7jOzpLTOmQOlfufImYStdc1NSPKdGgKJVV0tpLJ1C1UJ//qPVs
	w27qnkZ6o/t9Lme5S6Mwau1QT8iU83dWsEHg==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPJ-0005xj-WF; Mon, 28 Oct 2024 14:36:14 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 02/12] net: homa: define Homa packet formats
Date: Mon, 28 Oct 2024 14:35:29 -0700
Message-ID: <20241028213541.1529-3-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: a4197552717df0bd80eeda27cbeae713
X-Patchwork-Delegate: kuba@kernel.org

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_wire.h | 378 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 378 insertions(+)
 create mode 100644 net/homa/homa_wire.h

diff --git a/net/homa/homa_wire.h b/net/homa/homa_wire.h
new file mode 100644
index 000000000000..d7b87dc88cc1
--- /dev/null
+++ b/net/homa/homa_wire.h
@@ -0,0 +1,378 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the on-the-wire format of Homa packets. */
+
+#ifndef _HOMA_WIRE_H
+#define _HOMA_WIRE_H
+
+#include <linux/skbuff.h>
+
+/**
+ * enum homa_packet_type - Defines the possible types of Homa packets.
+ *
+ * See the xxx_header structs below for more information about each type.
+ */
+enum homa_packet_type {
+	DATA               = 0x10,
+	RESEND             = 0x12,
+	UNKNOWN            = 0x13,
+	BUSY               = 0x14,
+	NEED_ACK           = 0x17,
+	ACK                = 0x18,
+	BOGUS              = 0x19,      /* Used only in unit tests. */
+	/* If you add a new type here, you must also do the following:
+	 * 1. Change BOGUS so it is the highest opcode
+	 * 2. Add support for the new opcode in homa_print_packet,
+	 *    homa_print_packet_short, homa_symbol_for_type, and mock_skb_new.
+	 * 3. Add the header length to header_lengths in homa_plumbing.c.
+	 */
+};
+
+/** define HOMA_IPV6_HEADER_LENGTH - Size of IP header (V6). */
+#define HOMA_IPV6_HEADER_LENGTH 40
+
+/** define HOMA_IPV4_HEADER_LENGTH - Size of IP header (V4). */
+#define HOMA_IPV4_HEADER_LENGTH 20
+
+/**
+ * define HOMA_SKB_EXTRA - How many bytes of additional space to allow at the
+ * beginning of each sk_buff, before the IP header. This includes room for a
+ * VLAN header and also includes some extra space, "just to be safe" (not
+ * really sure if this is needed).
+ */
+#define HOMA_SKB_EXTRA 40
+
+/**
+ * define HOMA_ETH_OVERHEAD - Number of bytes per Ethernet packet for Ethernet
+ * header, CRC, preamble, and inter-packet gap.
+ */
+#define HOMA_ETH_OVERHEAD 42
+
+/**
+ * define HOMA_MIN_PKT_LENGTH - Every Homa packet must be padded to at least
+ * this length to meet Ethernet frame size limitations. This number includes
+ * Homa headers and data, but not IP or Ethernet headers.
+ */
+#define HOMA_MIN_PKT_LENGTH 26
+
+/**
+ * define HOMA_MAX_HEADER - Number of bytes in the largest Homa header.
+ */
+#define HOMA_MAX_HEADER 90
+
+/**
+ * define ETHERNET_MAX_PAYLOAD - Maximum length of an Ethernet packet,
+ * excluding preamble, frame delimeter, VLAN header, CRC, and interpacket gap;
+ * i.e. all of this space is available for Homa.
+ */
+#define ETHERNET_MAX_PAYLOAD 1500
+
+/**
+ * define HOMA_MAX_PRIORITIES - The maximum number of priority levels that
+ * Homa can use (the actual number can be restricted to less than this at
+ * runtime). Changing this value will affect packet formats.
+ */
+#define HOMA_MAX_PRIORITIES 8
+
+/**
+ * struct common_header - Wire format for the first bytes in every Homa
+ * packet. This must (mostly) match the format of a TCP header to enable
+ * Homa packets to actually be transmitted as TCP packets (and thereby
+ * take advantage of TSO and other features).
+ */
+struct common_header {
+	/**
+	 * @sport: Port on source machine from which packet was sent.
+	 * Must be in the same position as in a TCP header.
+	 */
+	__be16 sport;
+
+	/**
+	 * @dport: Port on destination that is to receive packet. Must be
+	 * in the same position as in a TCP header.
+	 */
+	__be16 dport;
+
+	/**
+	 * @sequence: corresponds to the sequence number field in TCP headers;
+	 * used in DATA packets to hold the offset in the message of the first
+	 * byte of data. This value will only be correct in the first segment
+	 * of a GSO packet.
+	 */
+	__be32 sequence;
+
+	/**
+	 * The fields below correspond to the acknowledgment field in TCP
+	 * headers; not used by Homa, except for the low-order 8 bits, which
+	 * specify the Homa packet type (one of the values in the
+	 * homa_packet_type enum).
+	 */
+	__be16 ack1;
+	__u8 ack2;
+	__u8 type;
+
+	/**
+	 * @doff: High order 4 bits holds the number of 4-byte chunks in a
+	 * data_header (low-order bits unused). Used only for DATA packets;
+	 * must be in the same position as the data offset in a TCP header.
+	 * Used by TSO to determine where the replicated header portion ends.
+	 */
+	__u8 doff;
+
+	__u8 dummy1;
+
+	/**
+	 * @window: Corresponds to the window field in TCP headers. Not used
+	 * by HOMA.
+	 */
+	__be16 window;
+
+	/**
+	 * @checksum: not used by Homa, but must occupy the same bytes as
+	 * the checksum in a TCP header (TSO may modify this?).
+	 */
+	__be16 checksum;
+
+	__be16 dummy2;
+
+	/**
+	 * @sender_id: the identifier of this RPC as used on the sender (i.e.,
+	 * if the low-order bit is set, then the sender is the server for
+	 * this RPC).
+	 */
+	__be64 sender_id;
+} __packed;
+
+/**
+ * struct homa_ack - Identifies an RPC that can be safely deleted by its
+ * server. After sending the response for an RPC, the server must retain its
+ * state for the RPC until it knows that the client has successfully
+ * received the entire response. An ack indicates this. Clients will
+ * piggyback acks on future data packets, but if a client doesn't send
+ * any data to the server, the server will eventually request an ack
+ * explicitly with a NEED_ACK packet, in which case the client will
+ * return an explicit ACK.
+ */
+struct homa_ack {
+	/**
+	 * @id: The client's identifier for the RPC. 0 means this ack
+	 * is invalid.
+	 */
+	__be64 client_id;
+
+	/** @client_port: The client-side port for the RPC. */
+	__be16 client_port;
+
+	/** @server_port: The server-side port for the RPC. */
+	__be16 server_port;
+} __packed;
+
+/* struct data_header - Contains data for part or all of a Homa message.
+ * An incoming packet consists of a data_header followed by message data.
+ * An outgoing packet can have this simple format as well, or it can be
+ * structured as a GSO packet. GSO packets look like this:
+ *
+ *    No hijacking:
+ *
+ *    |-----------------------|
+ *    |                       |
+ *    |     data_header       |
+ *    |                       |
+ *    |---------------------- |
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *    |      seg_header       |
+ *    |-----------------------|
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *    |      seg_header       |
+ *    |-----------------------|
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *
+ * TSO will not adjust @common.sequence in the segments, so Homa sprinkles
+ * correct offsets (in seg_headers) throughout the segment data; TSO/GSO
+ * will include a different seg_header in each generated packet.
+ */
+
+struct seg_header {
+	/**
+	 * @offset: Offset within message of the first byte of data in
+	 * this segment.
+	 */
+	__be32 offset;
+} __packed;
+
+struct data_header {
+	struct common_header common;
+
+	/** @message_length: Total #bytes in the message. */
+	__be32 message_length;
+
+	/**
+	 * @incoming: The receiver can expect the sender to send all of the
+	 * bytes in the message up to at least this offset (exclusive),
+	 * even without additional grants. This includes unscheduled
+	 * bytes, granted bytes, plus any additional bytes the sender
+	 * transmits unilaterally (e.g., to round up to a full GSO batch).
+	 */
+	__be32 incoming;
+
+	/** @ack: If the @client_id field of this is nonzero, provides info
+	 * about an RPC that the recipient can now safely free. Note: in
+	 * TSO packets this will get duplicated in each of the segments;
+	 * in order to avoid repeated attempts to ack the same RPC,
+	 * homa_gro_receive will clear this field in all segments but the
+	 * first.
+	 */
+	struct homa_ack ack;
+
+	__be16 dummy;
+
+	/**
+	 * @retransmit: 1 means this packet was sent in response to a RESEND
+	 * (it has already been sent previously).
+	 */
+	__u8 retransmit;
+
+	__u8 pad;
+
+	/** @seg: First of possibly many segments. */
+	struct seg_header seg;
+} __packed;
+_Static_assert(sizeof(struct data_header) <= HOMA_MAX_HEADER,
+	       "data_header too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+_Static_assert(sizeof(struct data_header) >= HOMA_MIN_PKT_LENGTH,
+	       "data_header too small: Homa doesn't currently have codeto pad data packets");
+_Static_assert(((sizeof(struct data_header) - sizeof(struct seg_header)) & 0x3) == 0,
+	       " data_header length not a multiple of 4 bytes (required for TCP/TSO compatibility");
+
+/**
+ * homa_data_len() - Returns the total number of bytes in a DATA packet
+ * after the data_header. Note: if the packet is a GSO packet, the result
+ * may include metadata as well as packet data.
+ */
+static inline int homa_data_len(struct sk_buff *skb)
+{
+	return skb->len - skb_transport_offset(skb) - sizeof(struct data_header);
+}
+
+/**
+ * struct resend_header - Wire format for RESEND packets.
+ *
+ * A RESEND is sent by the receiver when it believes that message data may
+ * have been lost in transmission (or if it is concerned that the sender may
+ * have crashed). The receiver should resend the specified portion of the
+ * message, even if it already sent it previously.
+ */
+struct resend_header {
+	/** @common: Fields common to all packet types. */
+	struct common_header common;
+
+	/**
+	 * @offset: Offset within the message of the first byte of data that
+	 * should be retransmitted.
+	 */
+	__be32 offset;
+
+	/**
+	 * @length: Number of bytes of data to retransmit; this could specify
+	 * a range longer than the total message size. Zero is a special case
+	 * used by servers; in this case, there is no need to actually resend
+	 * anything; the purpose of this packet is to trigger an UNKNOWN
+	 * response if the client no longer cares about this RPC.
+	 */
+	__be32 length;
+} __packed;
+_Static_assert(sizeof(struct resend_header) <= HOMA_MAX_HEADER,
+	       "resend_header too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct unknown_header - Wire format for UNKNOWN packets.
+ *
+ * An UNKNOWN packet is sent by either server or client when it receives a
+ * packet for an RPC that is unknown to it. When a client receives an
+ * UNKNOWN packet it will typically restart the RPC from the beginning;
+ * when a server receives an UNKNOWN packet it will typically discard its
+ * state for the RPC.
+ */
+struct unknown_header {
+	/** @common: Fields common to all packet types. */
+	struct common_header common;
+} __packed;
+_Static_assert(sizeof(struct unknown_header) <= HOMA_MAX_HEADER,
+	       "unknown_header too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct busy_header - Wire format for BUSY packets.
+ *
+ * These packets tell the recipient that the sender is still alive (even if
+ * it isn't sending data expected by the recipient).
+ */
+struct busy_header {
+	/** @common: Fields common to all packet types. */
+	struct common_header common;
+} __packed;
+_Static_assert(sizeof(struct busy_header) <= HOMA_MAX_HEADER,
+	       "busy_header too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct need_ack_header - Wire format for NEED_ACK packets.
+ *
+ * These packets ask the recipient (a client) to return an ACK message if
+ * the packet's RPC is no longer active.
+ */
+struct need_ack_header {
+	/** @common: Fields common to all packet types. */
+	struct common_header common;
+} __packed;
+_Static_assert(sizeof(struct need_ack_header) <= HOMA_MAX_HEADER,
+	       "need_ack_header too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct ack_header - Wire format for ACK packets.
+ *
+ * These packets are sent from a client to a server to indicate that
+ * a set of RPCs is no longer active on the client, so the server can
+ * free any state it may have for them.
+ */
+struct ack_header {
+	/** @common: Fields common to all packet types. */
+	struct common_header common;
+
+	/** @num_acks: number of (leading) elements in @acks that are valid. */
+	__be16 num_acks;
+
+#define HOMA_MAX_ACKS_PER_PKT 5
+	struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+} __packed;
+_Static_assert(sizeof(struct ack_header) <= HOMA_MAX_HEADER,
+	       "ack_header too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * homa_local_id(): given an RPC identifier from an input packet (which
+ * is network-encoded), return the decoded id we should use for that
+ * RPC on this machine.
+ * @sender_id:  RPC id from an incoming packet, such as h->common.sender_id
+ */
+static inline __u64 homa_local_id(__be64 sender_id)
+{
+	/* If the client bit was set on the sender side, it needs to be
+	 * removed here, and conversely.
+	 */
+	return be64_to_cpu(sender_id) ^ 1;
+}
+
+#endif /* _HOMA_WIRE_H */

From patchwork Mon Oct 28 21:35:30 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854190
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 639871EE01E
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:37:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151468; cv=none;
 b=RVmzmicH9lLZFE4RFd7s9lJgH8xjaNwa40e8WOHdFxMgI5mrg4H523BPRGSPfEhZQaEdvK8OzhWryb6SuM1OBiDsTsDw0D0gQsmi7N3OXgPqQhIoE/aYHHz7IPcjcnU5eWOGlOfNiE8fOi4BFko31fHN5Qnw9+8KLn4MTXX2OUw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151468; c=relaxed/simple;
	bh=OcMowhOyQt7JlKediIU8DY4J52dv0GEXJ/jhxK9Ga/Q=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=XHQ/UbkUWwRqZvbA0rQPnfKQt9VRrPCsyWW+p+05844Wgvchm1SRnKDUDDNY4MJnPDGev3fHYtAES13jQUH0Y4aXBeB6aC0GO11TLkt1VFNNrhjlaI5/GlC0MNOtJ78xttjBwKMVZ166fhzUX8/CDQOKc2fFZGSsH2Tv2313WP4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=PyclYmAg; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="PyclYmAg"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=wxTN3YwJ0oEW+7cJBBBlVXfsAWBU/ChLDi0s/AlWrMw=; t=1730151464; x=1731015464;
	b=PyclYmAgoFzg6Tvf9NbD4PNVMoYOmfYeTNVmaNQW2ihmHkz7TnG3TT/SWFgiPbcEsOULTgfXYgw
	BF3xParEmD1CO2OTeavjsC5CAxPtNHvAn9LO+o61fFqJLCxBCH0ZE5r23vhZz0WakgzcfpuMyBZ6w
	ur31Cgo4q7uoyV28iFPKxz7D67sSoBaJQYUhw4xb6HVUO5hUkJpsIFllmMDjWRGtohCwfn1dtRrV/
	k3Oqx2UmmITMQJSF28FtwLnhZZm6okgiaOMV+cTt2LTIzqVu8xDZprYsiRjkVo1liueuULdSQ+vhh
	WOkbHBVUxiuhLZwNqfP31Cz7j3NIsytjCd3g==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPK-0005xj-P7; Mon, 28 Oct 2024 14:36:16 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 03/12] net: homa: create shared Homa header files
Date: Mon, 28 Oct 2024 14:35:30 -0700
Message-ID: <20241028213541.1529-4-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 9eb37da18be5e863a6692c7c13a337d1
X-Patchwork-Delegate: kuba@kernel.org

homa_impl.h defines "struct homa", which contains overall information
about the Homa transport, plus various odds and ends that are used
throughout the Homa implementation.

homa_stub.h is a temporary header file that provides stubs for
facilities that have omitted for this first patch series. This file
will go away once Home is fully upstreamed.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_impl.h | 801 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_stub.h |  80 +++++
 2 files changed, 881 insertions(+)
 create mode 100644 net/homa/homa_impl.h
 create mode 100644 net/homa/homa_stub.h

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
new file mode 100644
index 000000000000..0c8676726ce7
--- /dev/null
+++ b/net/homa/homa_impl.h
@@ -0,0 +1,801 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions that are shared across the files
+ * that implement Homa for Linux.
+ */
+
+#ifndef _HOMA_IMPL_H
+#define _HOMA_IMPL_H
+
+#include <linux/bug.h>
+
+#include <linux/audit.h>
+#include <linux/icmp.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/completion.h>
+#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/version.h>
+#include <linux/vmalloc.h>
+#include <net/icmp.h>
+#include <net/ip.h>
+#include <net/protocol.h>
+#include <net/inet_common.h>
+#include <net/gro.h>
+#include <net/rps.h>
+#include <uapi/linux/homa.h>
+#include "homa_wire.h"
+
+/* Forward declarations. */
+struct homa_peer;
+struct homa_sock;
+struct homa;
+
+/* Declarations used in this file, so they can't be made at the end. */
+void     homa_throttle_lock_slow(struct homa *homa);
+
+#define sizeof32(type) ((int)(sizeof(type)))
+
+/**
+ * struct homa_interest - Contains various information used while waiting
+ * for incoming messages (indicates what kinds of messages a particular
+ * thread is interested in receiving).
+ */
+struct homa_interest {
+	/**
+	 * @thread: Thread that would like to receive a message. Will get
+	 * woken up when a suitable message becomes available.
+	 */
+	struct task_struct *thread;
+
+	/**
+	 * @ready_rpc: This is actually a (struct homa_rpc *) identifying the
+	 * RPC that was found; NULL if no RPC has been found yet. This
+	 * variable is used for synchronization to handoff the RPC, and
+	 * must be set only after @locked is set.
+	 */
+	atomic_long_t ready_rpc;
+
+	/**
+	 * @locked: Nonzero means that @ready_rpc is locked; only valid
+	 * if @ready_rpc is non-NULL.
+	 */
+	int locked;
+
+	/**
+	 * @core: Core on which @thread was executing when it registered
+	 * its interest.  Used for load balancing (see balance.txt).
+	 */
+	int core;
+
+	/**
+	 * @reg_rpc: RPC whose @interest field points here, or
+	 * NULL if none.
+	 */
+	struct homa_rpc *reg_rpc;
+
+	/**
+	 * @request_links: For linking this object into
+	 * &homa_sock.request_interests. The interest must not be linked
+	 * on either this list or @response_links if @id is nonzero.
+	 */
+	struct list_head request_links;
+
+	/**
+	 * @response_links: For linking this object into
+	 * &homa_sock.request_interests.
+	 */
+	struct list_head response_links;
+};
+
+/**
+ * homa_interest_init() - Fill in default values for all of the fields
+ * of a struct homa_interest.
+ * @interest:   Struct to initialize.
+ */
+static inline void homa_interest_init(struct homa_interest *interest)
+{
+	interest->thread = current;
+	atomic_long_set(&interest->ready_rpc, 0);
+	interest->locked = 0;
+	interest->core = raw_smp_processor_id();
+	interest->reg_rpc = NULL;
+	interest->request_links.next = LIST_POISON1;
+	interest->response_links.next = LIST_POISON1;
+}
+
+/**
+ * struct homa - Overall information about the Homa protocol implementation.
+ *
+ * There will typically only exist one of these at a time, except during
+ * unit tests.
+ */
+struct homa {
+	/**
+	 * @next_outgoing_id: Id to use for next outgoing RPC request.
+	 * This is always even: it's used only to generate client-side ids.
+	 * Accessed without locks.
+	 */
+	atomic64_t next_outgoing_id;
+
+	/**
+	 * @link_idle_time: The time, measured by get_cycles() at which we
+	 * estimate that all of the packets we have passed to Linux for
+	 * transmission will have been transmitted. May be in the past.
+	 * This estimate assumes that only Homa is transmitting data, so
+	 * it could be a severe underestimate if there is competing traffic
+	 * from, say, TCP. Access only with atomic ops.
+	 */
+	atomic64_t link_idle_time __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @pacer_mutex: Ensures that only one instance of homa_pacer_xmit
+	 * runs at a time. Only used in "try" mode: never block on this.
+	 */
+	spinlock_t pacer_mutex __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @pacer_fifo_fraction: The fraction of time (in thousandths) when
+	 * the pacer should transmit next from the oldest message, rather
+	 * than the highest-priority message. Set externally via sysctl.
+	 */
+	int pacer_fifo_fraction;
+
+	/**
+	 * @pacer_fifo_count: When this becomes <= zero, it's time for the
+	 * pacer to allow the oldest RPC to transmit.
+	 */
+	int pacer_fifo_count;
+
+	/**
+	 * @pacer_start: get_cycles() time when the pacer last woke up
+	 * (if the pacer is running) or 0 if the pacer is sleeping.
+	 */
+	__u64 pacer_wake_time;
+
+	/**
+	 * @throttle_lock: Used to synchronize access to @throttled_rpcs. To
+	 * insert or remove an RPC from throttled_rpcs, must first acquire
+	 * the RPC's socket lock, then this lock.
+	 */
+	spinlock_t throttle_lock;
+
+	/**
+	 * @throttled_rpcs: Contains all homa_rpcs that have bytes ready
+	 * for transmission, but which couldn't be sent without exceeding
+	 * the queue limits for transmission. Manipulate only with "_rcu"
+	 * functions.
+	 */
+	struct list_head throttled_rpcs;
+
+	/**
+	 * @throttle_add: The get_cycles() time when the most recent RPC
+	 * was added to @throttled_rpcs.
+	 */
+	__u64 throttle_add;
+
+	/**
+	 * @throttle_min_bytes: If a packet has fewer bytes than this, then it
+	 * bypasses the throttle mechanism and is transmitted immediately.
+	 * We have this limit because for very small packets we can't keep
+	 * up with the NIC (we're limited by CPU overheads); there's no
+	 * need for throttling and going through the throttle mechanism
+	 * adds overhead, which slows things down. At least, that's the
+	 * hypothesis (needs to be verified experimentally!). Set externally
+	 * via sysctl.
+	 */
+	int throttle_min_bytes;
+
+	/**
+	 * @next_client_port: A client port number to consider for the
+	 * next Homa socket; increments monotonically. Current value may
+	 * be in the range allocated for servers; must check before using.
+	 * This port may also be in use already; must check.
+	 */
+	__u16 next_client_port __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @port_map: Information about all open sockets. Dynamically
+	 * allocated; must be kfreed.
+	 */
+	struct homa_socktab *port_map __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @peertab: Info about all the other hosts we have communicated with.
+	 * Dynamically allocated; must be kfreed.
+	 */
+	struct homa_peertab *peers;
+
+	/** @max_numa: Highest NUMA node id in use by any core. */
+	int max_numa;
+
+	/**
+	 * @unsched_bytes: The number of bytes that may be sent in a
+	 * new message without receiving any grants. There used to be a
+	 * variable rtt_bytes that served this purpose, and was also used
+	 * for window.  Historically, rtt_bytes was intended to be the amount
+	 * of data that can be transmitted over the wire in the time it
+	 * takes to send a full-size data packet and receive back a grant.
+	 * But, for fast networks that value could result in too much
+	 * buffer utilization (and, we wanted to have separate values for
+	 * @unsched_bytes and @window). Set externally via sysctl.
+	 */
+	int unsched_bytes;
+
+	/**
+	 * @window_param: Set externally via sysctl to select a policy for
+	 * computing homa-grant_window. If 0 then homa->grant_window is
+	 * computed dynamically based on the number of RPCs we're currently
+	 * granting to. If nonzero then homa->grant_window will always be the
+	 * same as @window_param.
+	 */
+	int window_param;
+
+	/**
+	 * @link_bandwidth: The raw bandwidth of the network uplink, in
+	 * units of 1e06 bits per second.  Set externally via sysctl.
+	 */
+	int link_mbps;
+
+	/**
+	 * @poll_cycles: The value of @poll_usecs in the units returned
+	 * by get_cycles().
+	 */
+	int poll_cycles;
+
+	/**
+	 * @fifo_grant_increment: how many additional bytes to grant in
+	 * a "pity" grant sent to the oldest outstanding message. Set
+	 * externally via sysctl.
+	 */
+	int fifo_grant_increment;
+
+	/**
+	 * @grant_fifo_fraction: The fraction (in thousandths) of granted
+	 * bytes that should go to the *oldest* incoming message, rather
+	 * than the highest priority ones. Set externally via sysctl.
+	 */
+	int grant_fifo_fraction;
+
+	/**
+	 * @max_overcommit: The maximum number of messages to which Homa will
+	 * send grants at any given point in time.  Set externally via sysctl.
+	 */
+	int max_overcommit;
+
+	/**
+	 * @max_incoming: Homa will try to ensure that the total number of
+	 * bytes senders have permission to send to this host (either
+	 * unscheduled bytes or granted bytes) does not exceeds this value.
+	 * Set externally via sysctl.
+	 */
+	int max_incoming;
+
+	/**
+	 * @max_rpcs_per_peer: If there are multiple incoming messages from
+	 * the same peer, Homa will only issue grants to this many of them
+	 * at a time.  Set externally via sysctl.
+	 */
+	int max_rpcs_per_peer;
+
+	/**
+	 * @resend_ticks: When an RPC's @silent_ticks reaches this value,
+	 * start sending RESEND requests.
+	 */
+	int resend_ticks;
+
+	/**
+	 * @resend_interval: minimum number of homa timer ticks between
+	 * RESENDs for the same RPC.
+	 */
+	int resend_interval;
+
+	/**
+	 * @timeout_ticks: abort an RPC if its silent_ticks reaches this value.
+	 */
+	int timeout_ticks;
+
+	/**
+	 * @timeout_resends: Assume that a server is dead if it has not
+	 * responded after this many RESENDs have been sent to it.
+	 */
+	int timeout_resends;
+
+	/**
+	 * @request_ack_ticks: How many timer ticks we'll wait for the
+	 * client to ack an RPC before explicitly requesting an ack.
+	 * Set externally via sysctl.
+	 */
+	int request_ack_ticks;
+
+	/**
+	 * @reap_limit: Maximum number of packet buffers to free in a
+	 * single call to home_rpc_reap.
+	 */
+	int reap_limit;
+
+	/**
+	 * @dead_buffs_limit: If the number of packet buffers in dead but
+	 * not yet reaped RPCs is less than this number, then Homa reaps
+	 * RPCs in a way that minimizes impact on performance but may permit
+	 * dead RPCs to accumulate. If the number of dead packet buffers
+	 * exceeds this value, then Homa switches to a more aggressive approach
+	 * to reaping RPCs. Set externally via sysctl.
+	 */
+	int dead_buffs_limit;
+
+	/**
+	 * @max_dead_buffs: The largest aggregate number of packet buffers
+	 * in dead (but not yet reaped) RPCs that has existed so far in a
+	 * single socket.  Readable via sysctl, and may be reset via sysctl
+	 * to begin recalculating.
+	 */
+	int max_dead_buffs;
+
+	/**
+	 * @pacer_kthread: Kernel thread that transmits packets from
+	 * throttled_rpcs in a way that limits queue buildup in the
+	 * NIC.
+	 */
+	struct task_struct *pacer_kthread;
+
+	/**
+	 * @pacer_exit: true means that the pacer thread should exit as
+	 * soon as possible.
+	 */
+	bool pacer_exit;
+
+	/**
+	 * @max_nic_queue_ns: Limits the NIC queue length: we won't queue
+	 * up a packet for transmission if link_idle_time is this many
+	 * nanoseconds in the future (or more). Set externally via sysctl.
+	 */
+	int max_nic_queue_ns;
+
+	/**
+	 * @max_nic_queue_cycles: Same as max_nic_queue_ns, except in units
+	 * of get_cycles().
+	 */
+	int max_nic_queue_cycles;
+
+	/**
+	 * @cycles_per_kbyte: the number of cycles, as measured by get_cycles(),
+	 * that it takes to transmit 1000 bytes on our uplink. This is actually
+	 * a slight overestimate of the value, to ensure that we don't
+	 * underestimate NIC queue length and queue too many packets.
+	 */
+	__u32 cycles_per_kbyte;
+
+	/**
+	 * @max_gso_size: Maximum number of bytes that will be included
+	 * in a single output packet that Homa passes to Linux. Can be set
+	 * externally via sysctl to lower the limit already enforced by Linux.
+	 */
+	int max_gso_size;
+
+	/**
+	 * @gso_force_software: A non-zero value will cause Home to perform
+	 * segmentation in software using GSO; zero means ask the NIC to
+	 * perform TSO. Set externally via sysctl.
+	 */
+	int gso_force_software;
+
+	/**
+	 * @max_gro_skbs: Maximum number of socket buffers that can be
+	 * aggregated by the GRO mechanism.  Set externally via sysctl.
+	 */
+	int max_gro_skbs;
+
+	/**
+	 * @gro_policy: An OR'ed together collection of bits that determine
+	 * how Homa packets should be steered for SoftIRQ handling.  A value
+	 * of zero will eliminate any Homa-specific behaviors, reverting
+	 * to the Linux defaults. Set externally via sysctl (but modifying
+	 * it is almost certainly a bad idea; see below).
+	 */
+	int gro_policy;
+
+	/* Bits that can be specified for gro_policy. These were created for
+	 * testing, in order to evaluate various possible policies; you almost
+	 * certainly should not use any value other than HOMA_GRO_NORMAL.
+	 * HOMA_GRO_SAME_CORE         If isolated packets arrive (not part of
+	 *                            a batch) use the GRO core for SoftIRQ also.
+	 * HOMA_GRO_IDLE              Use old mechanism for selecting an idle
+	 *                            core for SoftIRQ (deprecated).
+	 * HOMA_GRO_NEXT              Always use the next core in circular
+	 *                            order for SoftIRQ (deprecated).
+	 * HOMA_GRO_GEN2              Use the new mechanism for selecting an
+	 *                            idle core for SoftIRQ.
+	 * HOMA_GRO_SHORT_BYPASS      Pass all single-packet messages directly
+	 *                            to homa_softirq during GRO (only if the
+	 *                            core isn't overloaded).
+	 * HOMA_GRO_GEN3              Use the "Gen3" mechanisms for load
+	 *                            balancing.
+	 */
+	#define HOMA_GRO_SAME_CORE         2
+	#define HOMA_GRO_IDLE              4
+	#define HOMA_GRO_NEXT              8
+	#define HOMA_GRO_GEN2           0x10
+	#define HOMA_GRO_SHORT_BYPASS   0x40
+	#define HOMA_GRO_GEN3           0x80
+	#define HOMA_GRO_NORMAL      (HOMA_GRO_SAME_CORE | HOMA_GRO_GEN2 | \
+				      HOMA_GRO_SHORT_BYPASS)
+
+	/*
+	 * @busy_usecs: if there has been activity on a core within the
+	 * last @busy_usecs, it is considered to be busy and Homa will
+	 * try to avoid scheduling other activities on the core. See
+	 * balance.txt for more on load balancing. Set externally via sysctl.
+	 */
+	int busy_usecs;
+
+	/** @busy_cycles: Same as busy_usecs except in get_cycles() units. */
+	int busy_cycles;
+
+	/*
+	 * @gro_busy_usecs: if the gap between the completion of
+	 * homa_gro_receive and the next call to homa_gro_receive on the same
+	 * core is less than this, then GRO on that core is considered to be
+	 * "busy", and optimizations such as HOMA_GRO_SHORT_BYPASS will not be
+	 * done because they risk overloading the core. Set externally via
+	 * sysctl.
+	 */
+	int gro_busy_usecs;
+
+	/** @gro_busy_cycles: Same as busy_usecs except in get_cycles() units. */
+	int gro_busy_cycles;
+
+	/**
+	 * @timer_ticks: number of times that homa_timer has been invoked
+	 * (may wraparound, which is safe).
+	 */
+	__u32 timer_ticks;
+
+	/**
+	 * @flags: a collection of bits that can be set using sysctl
+	 * to trigger various behaviors.
+	 */
+	int flags;
+
+	/**
+	 * @bpage_lease_usecs: how long a core can own a bpage (microseconds)
+	 * before its ownership can be revoked to reclaim the page.
+	 */
+	int bpage_lease_usecs;
+
+	/**
+	 * @bpage_lease_cycles: The value of @bpage_lease_usecs in get_cycles
+	 * units.
+	 */
+	int bpage_lease_cycles;
+
+	/**
+	 * @next_id: Set via sysctl; causes next_outgoing_id to be set to
+	 * this value; always reads as zero. Typically used while debugging to
+	 * ensure that different nodes use different ranges of ids.
+	 */
+	int next_id;
+
+	/**
+	 * @temp: the values in this array can be read and written with sysctl.
+	 * They have no officially defined purpose, and are available for
+	 * short-term use during testing.
+	 */
+	int temp[4];
+};
+
+/**
+ * struct homa_skb_info - Additional information needed by Homa for each
+ * outbound DATA packet. Space is allocated for this at the very end of the
+ * linear part of the skb.
+ */
+struct homa_skb_info {
+	/**
+	 * @next_skb: used to link together all of the skb's for a Homa
+	 * message (in order of offset).
+	 */
+	struct sk_buff *next_skb;
+
+	/**
+	 * @wire_bytes: total number of bytes of network bandwidth that
+	 * will be consumed by this packet. This includes everything,
+	 * including additional headers added by GSO, IP header, Ethernet
+	 * header, CRC, preamble, and inter-packet gap.
+	 */
+	int wire_bytes;
+
+	/**
+	 * @data_bytes: total bytes of message data across all of the
+	 * segments in this packet.
+	 */
+	int data_bytes;
+
+	/** @seg_length: maximum number of data bytes in each GSO segment. */
+	int seg_length;
+
+	/**
+	 * @offset: offset within the message of the first byte of data in
+	 * this packet.
+	 */
+	int offset;
+};
+
+/**
+ * homa_get_skb_info() - Return the address of Homa's private information
+ * for an sk_buff.
+ * @skb:     Socket buffer whose info is needed.
+ */
+static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
+{
+	return (struct homa_skb_info *)(skb_end_pointer(skb)
+			- sizeof(struct homa_skb_info));
+}
+
+/**
+ * homa_next_skb() - Compute address of Homa's private link field in @skb.
+ * @skb:     Socket buffer containing private link field.
+ *
+ * Homa needs to keep a list of buffers in a message, but it can't use the
+ * links built into sk_buffs because Homa wants to retain its list even
+ * after sending the packet, and the built-in links get used during sending.
+ * Thus we allocate extra space at the very end of the packet's data
+ * area to hold a forward pointer for a list.
+ */
+static inline struct sk_buff **homa_next_skb(struct sk_buff *skb)
+{
+	return (struct sk_buff **)(skb_end_pointer(skb) - sizeof(char *));
+}
+
+/**
+ * homa_set_doff() - Fills in the doff TCP header field for a Homa packet.
+ * @h:     Packet header whose doff field is to be set.
+ * @size:  Size of the "header", bytes (must be a multiple of 4). This
+ *         information is used only for TSO; it's the number of bytes
+ *         that should be replicated in each segment. The bytes after
+ *         this will be distributed among segments.
+ */
+static inline void homa_set_doff(struct data_header *h, int size)
+{
+	h->common.doff = size << 2;
+}
+
+/**
+ * homa_throttle_lock() - Acquire the throttle lock. If the lock
+ * isn't immediately available, record stats on the waiting time.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+static inline void homa_throttle_lock(struct homa *homa)
+{
+	if (!spin_trylock_bh(&homa->throttle_lock))
+		homa_throttle_lock_slow(homa);
+}
+
+/**
+ * homa_throttle_unlock() - Release the throttle lock.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+static inline void homa_throttle_unlock(struct homa *homa)
+{
+	spin_unlock_bh(&homa->throttle_lock);
+}
+
+/** skb_is_ipv6() - Return true if the packet is encapsulated with IPv6,
+ *  false otherwise (presumably it's IPv4).
+ */
+static inline bool skb_is_ipv6(const struct sk_buff *skb)
+{
+	return ipv6_hdr(skb)->version == 6;
+}
+
+/**
+ * Given an IPv4 address, return an equivalent IPv6 address (an IPv4-mapped
+ * one)
+ * @ip4: IPv4 address, in network byte order.
+ */
+static inline struct in6_addr ipv4_to_ipv6(__be32 ip4)
+{
+	struct in6_addr ret = {};
+
+	if (ip4 == INADDR_ANY)
+		return in6addr_any;
+	ret.in6_u.u6_addr32[2] = htonl(0xffff);
+	ret.in6_u.u6_addr32[3] = ip4;
+	return ret;
+}
+
+/**
+ * ipv6_to_ipv4() - Given an IPv6 address produced by ipv4_to_ipv6, return
+ * the original IPv4 address (in network byte order).
+ * @ip6:  IPv6 address; assumed to be a mapped IPv4 address.
+ */
+static inline __be32 ipv6_to_ipv4(const struct in6_addr ip6)
+{
+	return ip6.in6_u.u6_addr32[3];
+}
+
+/**
+ * skb_canonical_ipv6_addr() - Convert a socket address to the "standard"
+ * form used in Homa, which is always an IPv6 address; if the original address
+ * was IPv4, convert it to an IPv4-mapped IPv6 address.
+ * @addr:   Address to canonicalize.
+ */
+static inline struct in6_addr canonical_ipv6_addr(const union sockaddr_in_union *addr)
+{
+	if (addr) {
+		return (addr->sa.sa_family == AF_INET6)
+			? addr->in6.sin6_addr
+			: ipv4_to_ipv6(addr->in4.sin_addr.s_addr);
+	} else {
+		return in6addr_any;
+	}
+}
+
+/**
+ * skb_canonical_ipv6_saddr() - Given a packet buffer, return its source
+ * address in the "standard" form used in Homa, which is always an IPv6
+ * address; if the original address was IPv4, convert it to an IPv4-mapped
+ * IPv6 address.
+ * @skb:   The source address will be extracted from this packet buffer.
+ */
+static inline struct in6_addr skb_canonical_ipv6_saddr(struct sk_buff *skb)
+{
+	return skb_is_ipv6(skb) ? ipv6_hdr(skb)->saddr
+				: ipv4_to_ipv6(ip_hdr(skb)->saddr);
+}
+
+/**
+ * is_mapped_ipv4() - Return true if an IPv6 address is actually an
+ * IPv4-mapped address, false otherwise.
+ * @x:  The address to check.
+ */
+static inline bool is_mapped_ipv4(const struct in6_addr x)
+{
+	return ((x.in6_u.u6_addr32[0] == 0) &&
+		(x.in6_u.u6_addr32[1] == 0) &&
+		(x.in6_u.u6_addr32[2] == htonl(0xffff)));
+}
+
+/**
+ * tt_addr() - Given an address, return a 4-byte id that will (hopefully)
+ * provide a unique identifier for the address in a timetrace record.
+ * @x:  Address (either IPv6 or IPv4-mapped IPv6)
+ */
+static inline __be32 tt_addr(const struct in6_addr x)
+{
+	return is_mapped_ipv4(x) ? ntohl(x.in6_u.u6_addr32[3])
+			: (x.in6_u.u6_addr32[3] ? ntohl(x.in6_u.u6_addr32[3])
+			: ntohl(x.in6_u.u6_addr32[1]));
+}
+
+#ifdef __UNIT_TEST__
+void unit_log_printf(const char *separator, const char *format, ...)
+		__printf(2, 3);
+#define UNIT_LOG unit_log_printf
+void unit_hook(char *id);
+#define UNIT_HOOK(msg) unit_hook(msg)
+#else
+#define UNIT_LOG(...)
+#define UNIT_HOOK(...)
+#endif
+
+void     homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+			 int port, int error);
+void     homa_abort_sock_rpcs(struct homa_sock *hsk, int error);
+void     homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		      struct homa_rpc *rpc);
+void     homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
+void     homa_add_to_throttled(struct homa_rpc *rpc);
+int      homa_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+int      homa_bind(struct socket *sk, struct sockaddr *addr,
+		   int addr_len);
+int      homa_check_nic_queue(struct homa *homa, struct sk_buff *skb,
+			      bool force);
+struct homa_interest *homa_choose_interest(struct homa *homa,
+					   struct list_head *head,
+					   int offset);
+void     homa_close(struct sock *sock, long timeout);
+int      homa_copy_to_user(struct homa_rpc *rpc);
+void     homa_data_from_server(struct sk_buff *skb,
+			       struct homa_rpc *crpc);
+void     homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+void     homa_destroy(struct homa *homa);
+int      homa_disconnect(struct sock *sk, int flags);
+void     homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa);
+int      homa_err_handler_v4(struct sk_buff *skb, u32 info);
+int      homa_err_handler_v6(struct sk_buff *skb,
+			     struct inet6_skb_parm *opt, u8 type,  u8 code,
+			     int offset, __be32 info);
+int      homa_fill_data_interleaved(struct homa_rpc *rpc,
+				    struct sk_buff *skb, struct iov_iter *iter);
+struct homa_gap *homa_gap_new(struct list_head *next, int start, int end);
+void     homa_gap_retry(struct homa_rpc *rpc);
+int      homa_get_port(struct sock *sk, unsigned short snum);
+int      homa_getsockopt(struct sock *sk, int level, int optname,
+			 char __user *optval, int __user *option);
+int      homa_hash(struct sock *sk);
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer);
+int      homa_init(struct homa *homa);
+void     homa_incoming_sysctl_changed(struct homa *homa);
+int      homa_ioctl(struct sock *sk, int cmd, int *karg);
+int      homa_message_in_init(struct homa_rpc *rpc, int length);
+int      homa_message_out_fill(struct homa_rpc *rpc,
+			       struct iov_iter *iter, int xmit);
+void     homa_message_out_init(struct homa_rpc *rpc, int length);
+void     homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+			   struct homa_rpc *rpc);
+struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
+				     struct iov_iter *iter, int offset,
+				     int length, int max_seg_data);
+void     homa_outgoing_sysctl_changed(struct homa *homa);
+int      homa_pacer_main(void *transport);
+void     homa_pacer_stop(struct homa *homa);
+void     homa_pacer_xmit(struct homa *homa);
+__poll_t homa_poll(struct file *file, struct socket *sock,
+		   struct poll_table_struct *wait);
+int      homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+		      int flags, int *addr_len);
+int      homa_register_interests(struct homa_interest *interest,
+				 struct homa_sock *hsk, int flags, __u64 id);
+void     homa_remove_from_throttled(struct homa_rpc *rpc);
+void     homa_resend_data(struct homa_rpc *rpc, int start, int end);
+void     homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+			 struct homa_sock *hsk);
+void     homa_rpc_abort(struct homa_rpc *crpc, int error);
+void     homa_rpc_acked(struct homa_sock *hsk,
+			const struct in6_addr *saddr, struct homa_ack *ack);
+void     homa_rpc_free(struct homa_rpc *rpc);
+void     homa_rpc_handoff(struct homa_rpc *rpc);
+int      homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+int      homa_setsockopt(struct sock *sk, int level, int optname,
+			 sockptr_t __user optval, unsigned int optlen);
+int      homa_shutdown(struct socket *sock, int how);
+int      homa_softirq(struct sk_buff *skb);
+void     homa_spin(int ns);
+char    *homa_symbol_for_type(uint8_t type);
+void     homa_timer(struct homa *homa);
+int      homa_timer_main(void *transport);
+void     homa_unhash(struct sock *sk);
+void     homa_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
+				       __u64 id);
+int      homa_xmit_control(enum homa_packet_type type, void *contents,
+			   size_t length, struct homa_rpc *rpc);
+int      __homa_xmit_control(void *contents, size_t length,
+			     struct homa_peer *peer, struct homa_sock *hsk);
+void     homa_xmit_data(struct homa_rpc *rpc, bool force);
+void     __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc);
+void     homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk);
+
+/**
+ * homa_check_pacer() - This method is invoked at various places in Homa to
+ * see if the pacer needs to transmit more packets and, if so, transmit
+ * them. It's needed because the pacer thread may get descheduled by
+ * Linux, result in output stalls.
+ * @homa:    Overall data about the Homa protocol implementation. No locks
+ *           should be held when this function is invoked.
+ * @softirq: Nonzero means this code is running at softirq (bh) level;
+ *           zero means it's running in process context.
+ */
+static inline void homa_check_pacer(struct homa *homa, int softirq)
+{
+	if (list_empty(&homa->throttled_rpcs))
+		return;
+
+	/* The "/2" in the line below gives homa_pacer_main the first chance
+	 * to queue new packets; if the NIC queue becomes more than half
+	 * empty, then we will help out here.
+	 */
+	if ((get_cycles() + homa->max_nic_queue_cycles / 2) <
+			atomic64_read(&homa->link_idle_time))
+		return;
+	homa_pacer_xmit(homa);
+}
+
+extern struct completion homa_pacer_kthread_done;
+#endif /* _HOMA_IMPL_H */
diff --git a/net/homa/homa_stub.h b/net/homa/homa_stub.h
new file mode 100644
index 000000000000..fb2ba4195dd2
--- /dev/null
+++ b/net/homa/homa_stub.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains stripped-down replacements that have been
+ * temporarily removed from Homa during the Linux upstreaming
+ * process. By the time upstreaming is complete this file will
+ * have gone away.
+ */
+
+#ifndef _HOMA_STUB_H
+#define _HOMA_STUB_H
+
+#include "homa_impl.h"
+
+static inline int homa_skb_append_from_iter(struct homa *homa,
+					    struct sk_buff *skb,
+					    struct iov_iter *iter, int length)
+{
+	char *dst = skb_put(skb, length);
+
+	if (copy_from_iter(dst, length, iter) != length)
+		return -EFAULT;
+	return 0;
+}
+
+static inline int homa_skb_append_to_frag(struct homa *homa, struct sk_buff *skb,
+					  void *buf, int length)
+{
+	char *dst = skb_put(skb, length);
+
+	memcpy(dst, buf, length);
+	return 0;
+}
+
+static inline int  homa_skb_append_from_skb(struct homa *homa,
+					    struct sk_buff *dst_skb,
+					    struct sk_buff *src_skb,
+					    int offset, int length)
+{
+	return homa_skb_append_to_frag(homa, dst_skb,
+			skb_transport_header(src_skb) + offset, length);
+}
+
+static inline void homa_skb_free_tx(struct homa *homa, struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+static inline void homa_skb_free_many_tx(struct homa *homa,
+					 struct sk_buff **skbs, int count)
+{
+	int i;
+
+	for (i = 0; i < count; i++)
+		kfree_skb(skbs[i]);
+}
+
+static inline void homa_skb_get(struct sk_buff *skb, void *dest, int offset,
+				int length)
+{
+	memcpy(dest, skb_transport_header(skb) + offset, length);
+}
+
+static inline struct sk_buff *homa_skb_new_tx(int length)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(HOMA_SKB_EXTRA + HOMA_IPV6_HEADER_LENGTH +
+			sizeof(struct homa_skb_info) + length,
+			GFP_KERNEL);
+	if (likely(skb)) {
+		skb_reserve(skb, HOMA_SKB_EXTRA + HOMA_IPV6_HEADER_LENGTH);
+		skb_reset_transport_header(skb);
+	}
+	return skb;
+}
+
+static inline void homa_skb_stash_pages(struct homa *homa, int length)
+{}
+
+#endif /* _HOMA_STUB_H */

From patchwork Mon Oct 28 21:35:31 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854183
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB0DD1DFE2B
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:37:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151455; cv=none;
 b=LsCxpdU51AqGX1RO0zcY9rOKrD1Mx6B6IEw6eAjFd6vqbipclhbfKca4vn99xuGdvML7JRpjZN9qm/3OlYgmLjIXqFGyuGdw2tSeujDI/O5f+Aaw0bwtvfaS+c0L+AQFleSgFeQJ2y9d29NmdPtP+85aWpNAkwhh7nFdMcz76SY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151455; c=relaxed/simple;
	bh=lJ5VuUPzd9+HKsfCrkBhMieTP7K2NOshsGLJ25wClzc=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Ak8rSmxFlM4I0dexMMsD5NqlNb9E5hS4FbKGZz1CyjmQRBHqA2yiASkmKqc7jQaqcALUwMMX2VESPRteqr7ZokYudi/4P/bELYKhH9u5UJi42ZgZAulYFnRIvlPTSu8QLfDX2Yux52fNlG+9qvh34j0UMDIXWJPQ5Dh8HzCNPyI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=u80ChAbN; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="u80ChAbN"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=alJ+ZCvuJJPphxIhCC0/BuBbLBiwNwA74E4+ji0bNFs=; t=1730151451; x=1731015451;
	b=u80ChAbNPDFanpTxKj3cJL7lGiHNokUPYxewFY0umcorcCgNJflbdPKR7x2zfEAZdOYdGndV94u
	dYT8hBaculpyYine3mwRFXeFDMUf4IW4Gp+ufSm4sKB8MpSrAP/gPJTnvNUvUVDO4YdY0i5ZKOglp
	zAFxToTK0NuPZHNqV6J3YQ+YfdHTeKAI5V/9s9Wk+FYM0U+uG5/15IftdVyvP9Nw2gW+PsitG52QN
	yA05waK1yi0y1nYxmVjrRi/qJ2T2zq87xz2HK/hiaW1/Z4ULFfP2TcPAjyTPPs/+rE5EKfrt4e/qO
	Rb+Uzj+JO8x4TGxu3UHgHXh+VCsTGyu8CUtA==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPM-0005xj-1o; Mon, 28 Oct 2024 14:36:16 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 04/12] net: homa: create homa_pool.h and homa_pool.c
Date: Mon, 28 Oct 2024 14:35:31 -0700
Message-ID: <20241028213541.1529-5-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 9338af1f8a289cc19771722dfe20e169
X-Patchwork-Delegate: kuba@kernel.org

These files implement Homa's mechanism for managing application-level
buffer space for incoming messages This mechanism is needed to allow
Homa to copy data out to user space in parallel with receiving packets;
it was discussed in a talk at NetDev 0x17.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_pool.c | 434 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_pool.h | 152 +++++++++++++++
 2 files changed, 586 insertions(+)
 create mode 100644 net/homa/homa_pool.c
 create mode 100644 net/homa/homa_pool.h

diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
new file mode 100644
index 000000000000..6c59f659fba1
--- /dev/null
+++ b/net/homa/homa_pool.c
@@ -0,0 +1,434 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+#include "homa_impl.h"
+#include "homa_pool.h"
+
+/* This file contains functions that manage user-space buffer pools. */
+
+/* Pools must always have at least this many bpages (no particular
+ * reasoning behind this value).
+ */
+#define MIN_POOL_SIZE 2
+
+/* Used when determining how many bpages to consider for allocation. */
+#define MIN_EXTRA 4
+
+/* When running unit tests, allow HOMA_BPAGE_SIZE and HOMA_BPAGE_SHIFT
+ * to be overridden.
+ */
+#ifdef __UNIT_TEST__
+#include "mock.h"
+#undef HOMA_BPAGE_SIZE
+#define HOMA_BPAGE_SIZE mock_bpage_size
+#undef HOMA_BPAGE_SHIFT
+#define HOMA_BPAGE_SHIFT mock_bpage_shift
+#endif
+
+/**
+ * set_bpages_needed() - Set the bpages_needed field of @pool based
+ * on the length of the first RPC that's waiting for buffer space.
+ * The caller must own the lock for @pool->hsk.
+ * @pool: Pool to update.
+ */
+static inline void set_bpages_needed(struct homa_pool *pool)
+{
+	struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+			struct homa_rpc, buf_links);
+	pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
+			>> HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_pool_init() - Initialize a homa_pool; any previous contents of the
+ * objects are overwritten.
+ * @hsk:          Socket containing the pool to initialize.
+ * @region:       First byte of the memory region for the pool, allocated
+ *                by the application; must be page-aligned.
+ * @region_size:  Total number of bytes available at @buf_region.
+ * Return: Either zero (for success) or a negative errno for failure.
+ */
+int homa_pool_init(struct homa_sock *hsk, void *region, __u64 region_size)
+{
+	struct homa_pool *pool = hsk->buffer_pool;
+	int i, result;
+
+	if (((__u64)region) & ~PAGE_MASK)
+		return -EINVAL;
+	pool->hsk = hsk;
+	pool->region = (char *)region;
+	pool->num_bpages = region_size >> HOMA_BPAGE_SHIFT;
+	pool->descriptors = NULL;
+	pool->cores = NULL;
+	if (pool->num_bpages < MIN_POOL_SIZE) {
+		result = -EINVAL;
+		goto error;
+	}
+	pool->descriptors = kmalloc_array(pool->num_bpages,
+					  sizeof(struct homa_bpage), GFP_ATOMIC);
+	if (!pool->descriptors) {
+		result = -ENOMEM;
+		goto error;
+	}
+	for (i = 0; i < pool->num_bpages; i++) {
+		struct homa_bpage *bp = &pool->descriptors[i];
+
+		spin_lock_init(&bp->lock);
+		atomic_set(&bp->refs, 0);
+		bp->owner = -1;
+		bp->expiration = 0;
+	}
+	atomic_set(&pool->free_bpages, pool->num_bpages);
+	pool->bpages_needed = INT_MAX;
+
+	/* Allocate and initialize core-specific data. */
+	pool->cores = kmalloc_array(nr_cpu_ids, sizeof(struct homa_pool_core),
+				    GFP_ATOMIC);
+	if (!pool->cores) {
+		result = -ENOMEM;
+		goto error;
+	}
+	pool->num_cores = nr_cpu_ids;
+	for (i = 0; i < pool->num_cores; i++) {
+		pool->cores[i].page_hint = 0;
+		pool->cores[i].allocated = 0;
+		pool->cores[i].next_candidate = 0;
+	}
+	pool->check_waiting_invoked = 0;
+
+	return 0;
+
+error:
+	kfree(pool->descriptors);
+	kfree(pool->cores);
+	pool->region = NULL;
+	return result;
+}
+
+/**
+ * homa_pool_destroy() - Destructor for homa_pool. After this method
+ * returns, the object should not be used unless it has been reinitialized.
+ * @pool: Pool to destroy.
+ */
+void homa_pool_destroy(struct homa_pool *pool)
+{
+	if (!pool->region)
+		return;
+	kfree(pool->descriptors);
+	kfree(pool->cores);
+	pool->region = NULL;
+}
+
+/**
+ * homa_pool_get_pages() - Allocate one or more full pages from the pool.
+ * @pool:         Pool from which to allocate pages
+ * @num_pages:    Number of pages needed
+ * @pages:        The indices of the allocated pages are stored here; caller
+ *                must ensure this array is big enough. Reference counts have
+ *                been set to 1 on all of these pages (or 2 if set_owner
+ *                was specified).
+ * @set_owner:    If nonzero, the current core is marked as owner of all
+ *                of the allocated pages (and the expiration time is also
+ *                set). Otherwise the pages are left unowned.
+ * Return: 0 for success, -1 if there wasn't enough free space in the pool.
+ */
+int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
+			int set_owner)
+{
+	int core_num = raw_smp_processor_id();
+	struct homa_pool_core *core;
+	__u64 now = get_cycles();
+	int alloced = 0;
+	int limit = 0;
+
+	core = &pool->cores[core_num];
+	if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
+		atomic_add(num_pages, &pool->free_bpages);
+		return -1;
+	}
+
+	/* Once we get to this point we know we will be able to find
+	 * enough free pages; now we just have to find them.
+	 */
+	while (alloced != num_pages) {
+		struct homa_bpage *bpage;
+		int cur, ref_count;
+
+		/* If we don't need to use all of the bpages in the pool,
+		 * then try to use only the ones with low indexes. This
+		 * will reduce the cache footprint for the pool by reusing
+		 * a few bpages over and over. Specifically this code will
+		 * not consider any candidate page whose index is >= limit.
+		 * Limit is chosen to make sure there are a reasonable
+		 * number of free pages in the range, so we won't have to
+		 * check a huge number of pages.
+		 */
+		if (limit == 0) {
+			int extra;
+
+			limit = pool->num_bpages
+					- atomic_read(&pool->free_bpages);
+			extra = limit >> 2;
+			limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
+			if (limit > pool->num_bpages)
+				limit = pool->num_bpages;
+		}
+
+		cur = core->next_candidate;
+		core->next_candidate++;
+		if (cur >= limit) {
+			core->next_candidate = 0;
+
+			/* Must recompute the limit for each new loop through
+			 * the bpage array: we may need to consider a larger
+			 * range of pages because of concurrent allocations.
+			 */
+			limit = 0;
+			continue;
+		}
+		bpage = &pool->descriptors[cur];
+
+		/* Figure out whether this candidate is free (or can be
+		 * stolen). Do a quick check without locking the page, and
+		 * if the page looks promising, then lock it and check again
+		 * (must check again in case someone else snuck in and
+		 * grabbed the page).
+		 */
+		ref_count = atomic_read(&bpage->refs);
+		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
+							  bpage->expiration > now)))
+			continue;
+		if (!spin_trylock_bh(&bpage->lock))
+			continue;
+		ref_count = atomic_read(&bpage->refs);
+		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
+							  bpage->expiration > now))) {
+			spin_unlock_bh(&bpage->lock);
+			continue;
+		}
+		if (bpage->owner >= 0)
+			atomic_inc(&pool->free_bpages);
+		if (set_owner) {
+			atomic_set(&bpage->refs, 2);
+			bpage->owner = core_num;
+			bpage->expiration = now
+					+ pool->hsk->homa->bpage_lease_cycles;
+		} else {
+			atomic_set(&bpage->refs, 1);
+			bpage->owner = -1;
+		}
+		spin_unlock_bh(&bpage->lock);
+		pages[alloced] = cur;
+		alloced++;
+	}
+	return 0;
+}
+
+/**
+ * homa_pool_allocate() - Allocate buffer space for an RPC.
+ * @rpc:  RPC that needs space allocated for its incoming message (space must
+ *        not already have been allocated). The fields @msgin->num_buffers
+ *        and @msgin->buffers are filled in. Must be locked by caller.
+ * Return: The return value is normally 0, which means either buffer space
+ * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
+ * occurred, such as no buffer pool present, then a negative errno is
+ * returned.
+ */
+int homa_pool_allocate(struct homa_rpc *rpc)
+{
+	struct homa_pool *pool = rpc->hsk->buffer_pool;
+	int full_pages, partial, i, core_id;
+	__u32 pages[HOMA_MAX_BPAGES];
+	struct homa_pool_core *core;
+	struct homa_bpage *bpage;
+	__u64 now = get_cycles();
+	struct homa_rpc *other;
+
+	if (!pool->region)
+		return -ENOMEM;
+
+	/* First allocate any full bpages that are needed. */
+	full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
+	if (unlikely(full_pages)) {
+		if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
+			goto out_of_space;
+		for (i = 0; i < full_pages; i++)
+			rpc->msgin.bpage_offsets[i] = pages[i] << HOMA_BPAGE_SHIFT;
+	}
+	rpc->msgin.num_bpages = full_pages;
+
+	/* The last chunk may be less than a full bpage; for this we use
+	 * the bpage that we own (and reuse it for multiple messages).
+	 */
+	partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
+	if (unlikely(partial == 0))
+		goto success;
+	core_id = raw_smp_processor_id();
+	core = &pool->cores[core_id];
+	bpage = &pool->descriptors[core->page_hint];
+	if (!spin_trylock_bh(&bpage->lock))
+		spin_lock_bh(&bpage->lock);
+	if (bpage->owner != core_id) {
+		spin_unlock_bh(&bpage->lock);
+		goto new_page;
+	}
+	if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
+		if (atomic_read(&bpage->refs) == 1) {
+			/* Bpage is totally free, so we can reuse it. */
+			core->allocated = 0;
+		} else {
+			bpage->owner = -1;
+
+			/* We know the reference count can't reach zero here
+			 * because of check above, so we won't have to decrement
+			 * pool->free_bpages.
+			 */
+			atomic_dec_return(&bpage->refs);
+			spin_unlock_bh(&bpage->lock);
+			goto new_page;
+		}
+	}
+	bpage->expiration = now + pool->hsk->homa->bpage_lease_cycles;
+	atomic_inc(&bpage->refs);
+	spin_unlock_bh(&bpage->lock);
+	goto allocate_partial;
+
+	/* Can't use the current page; get another one. */
+new_page:
+	if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
+		homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
+					  rpc->msgin.bpage_offsets);
+		rpc->msgin.num_bpages = 0;
+		goto out_of_space;
+	}
+	core->page_hint = pages[0];
+	core->allocated = 0;
+
+allocate_partial:
+	rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
+			+ (core->page_hint << HOMA_BPAGE_SHIFT);
+	rpc->msgin.num_bpages++;
+	core->allocated += partial;
+
+success:
+	return 0;
+
+	/* We get here if there wasn't enough buffer space for this
+	 * message; add the RPC to hsk->waiting_for_bufs.
+	 */
+out_of_space:
+	homa_sock_lock(pool->hsk, "homa_pool_allocate");
+	list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
+		if (other->msgin.length > rpc->msgin.length) {
+			list_add_tail(&rpc->buf_links, &other->buf_links);
+			goto queued;
+		}
+	}
+	list_add_tail_rcu(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
+
+queued:
+	set_bpages_needed(pool);
+	homa_sock_unlock(pool->hsk);
+	return 0;
+}
+
+/**
+ * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
+ * message data.
+ * @rpc:        RPC for which incoming message data is being processed; its
+ *              msgin must be properly initialized and buffer space must have
+ *              been allocated for the message.
+ * @offset:     Offset within @rpc's incoming message.
+ * @available:  Will be filled in with the number of bytes of space available
+ *              at the returned address.
+ * Return:      The application's virtual address for buffer space corresponding
+ *              to @offset in the incoming message for @rpc.
+ */
+void *homa_pool_get_buffer(struct homa_rpc *rpc, int offset, int *available)
+{
+	int bpage_index, bpage_offset;
+
+	bpage_index = offset >> HOMA_BPAGE_SHIFT;
+	BUG_ON(bpage_index >= rpc->msgin.num_bpages);
+	bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
+	*available = (bpage_index < (rpc->msgin.num_bpages - 1))
+			? HOMA_BPAGE_SIZE - bpage_offset
+			: rpc->msgin.length - offset;
+	return rpc->hsk->buffer_pool->region + rpc->msgin.bpage_offsets[bpage_index]
+			+ bpage_offset;
+}
+
+/**
+ * homa_pool_release_buffers() - Release buffer space so that it can be
+ * reused.
+ * @pool:         Pool that the buffer space belongs to. Doesn't need to
+ *                be locked.
+ * @num_buffers:  How many buffers to release.
+ * @buffers:      Points to @num_buffers values, each of which is an offset
+ *                from the start of the pool to the buffer to be released.
+ */
+void homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
+			       __u32 *buffers)
+{
+	int i;
+
+	if (!pool->region)
+		return;
+	for (i = 0; i < num_buffers; i++) {
+		__u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
+		struct homa_bpage *bpage = &pool->descriptors[bpage_index];
+
+		if (bpage_index < pool->num_bpages)
+			if (atomic_dec_return(&bpage->refs) == 0)
+				atomic_inc(&pool->free_bpages);
+		}
+	}
+
+/**
+ * homa_pool_check_waiting() - Checks to see if there are enough free
+ * bpages to wake up any RPCs that were blocked. Whenever
+ * homa_pool_release_buffers is invoked, this function must be invoked later,
+ * at a point when the caller holds no locks (homa_pool_release_buffers may
+ * be invoked with locks held, so it can't safely invoke this function).
+ * This is regrettably tricky, but I can't think of a better solution.
+ * @pool:         Information about the buffer pool.
+ */
+void homa_pool_check_waiting(struct homa_pool *pool)
+{
+#ifdef __UNIT_TEST__
+	pool->check_waiting_invoked += 1;
+#endif
+	while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
+		struct homa_rpc *rpc;
+
+		homa_sock_lock(pool->hsk, "buffer pool");
+		if (list_empty(&pool->hsk->waiting_for_bufs)) {
+			pool->bpages_needed = INT_MAX;
+			homa_sock_unlock(pool->hsk);
+			break;
+		}
+		rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+				       struct homa_rpc, buf_links);
+		if (!homa_bucket_try_lock(rpc->bucket, rpc->id,
+					  "homa_pool_check_waiting")) {
+			/* Can't just spin on the RPC lock because we're
+			 * holding the socket lock (see sync.txt). Instead,
+			 * release the socket lock and try the entire
+			 * operation again.
+			 */
+			homa_sock_unlock(pool->hsk);
+			UNIT_LOG("; ", "rpc lock unavailable in %s", __func__);
+			continue;
+		}
+		list_del_init(&rpc->buf_links);
+		if (list_empty(&pool->hsk->waiting_for_bufs))
+			pool->bpages_needed = INT_MAX;
+		else
+			set_bpages_needed(pool);
+		homa_sock_unlock(pool->hsk);
+		homa_pool_allocate(rpc);
+		if (rpc->msgin.num_bpages > 0)
+			/* Allocation succeeded; "wake up" the RPC. */
+			rpc->msgin.resend_all = 1;
+		homa_rpc_unlock(rpc);
+	}
+}
diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
new file mode 100644
index 000000000000..e14a71bae9e6
--- /dev/null
+++ b/net/homa/homa_pool.h
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions used to manage user-space buffer pools.
+ */
+
+#ifndef _HOMA_POOL_H
+#define _HOMA_POOL_H
+
+#include "homa_rpc.h"
+
+/**
+ * struct homa_bpage - Contains information about a single page in
+ * a buffer pool.
+ */
+struct homa_bpage {
+	union {
+		/**
+		 * @cache_line: Ensures that each homa_bpage object
+		 * is exactly one cache line long.
+		 */
+		char cache_line[L1_CACHE_BYTES];
+		struct {
+			/** @lock: to synchronize shared access. */
+			spinlock_t lock;
+
+			/**
+			 * @refs: Counts number of distinct uses of this
+			 * bpage (1 tick for each message that is using
+			 * this page, plus an additional tick if the @owner
+			 * field is set).
+			 */
+			atomic_t refs;
+
+			/**
+			 * @owner: kernel core that currently owns this page
+			 * (< 0 if none).
+			 */
+			int owner;
+
+			/**
+			 * @expiration: time (in get_cycles units) after
+			 * which it's OK to steal this page from its current
+			 * owner (if @refs is 1).
+			 */
+			__u64 expiration;
+		};
+	};
+};
+
+/**
+ * struct homa_pool_core - Holds core-specific data for a homa_pool (a bpage
+ * out of which that core is allocating small chunks).
+ */
+struct homa_pool_core {
+	union {
+		/**
+		 * @cache_line: Ensures that each object is exactly one
+		 * cache line long.
+		 */
+		char cache_line[L1_CACHE_BYTES];
+		struct {
+			/**
+			 * @page_hint: Index of bpage in pool->descriptors,
+			 * which may be owned by this core. If so, we'll use it
+			 * for allocating partial pages.
+			 */
+			int page_hint;
+
+			/**
+			 * @allocated: if the page given by @page_hint is
+			 * owned by this core, this variable gives the number of
+			 * (initial) bytes that have already been allocated
+			 * from the page.
+			 */
+			int allocated;
+
+			/**
+			 * @next_candidate: when searching for free bpages,
+			 * check this index next.
+			 */
+			int next_candidate;
+		};
+	};
+};
+
+/**
+ * struct homa_pool - Describes a pool of buffer space for incoming
+ * messages for a particular socket; managed by homa_pool.c. The pool is
+ * divided up into "bpages", which are a multiple of the hardware page size.
+ * A bpage may be owned by a particular core so that it can more efficiently
+ * allocate space for small messages.
+ */
+struct homa_pool {
+	/**
+	 * @hsk: the socket that this pool belongs to.
+	 */
+	struct homa_sock *hsk;
+
+	/**
+	 * @region: beginning of the pool's region (in the app's virtual
+	 * memory). Divided into bpages. 0 means the pool hasn't yet been
+	 * initialized.
+	 */
+	char *region;
+
+	/** @num_bpages: total number of bpages in the pool. */
+	int num_bpages;
+
+	/** @descriptors: kmalloced area containing one entry for each bpage. */
+	struct homa_bpage *descriptors;
+
+	/**
+	 * @free_bpages: the number of pages still available for allocation
+	 * by homa_pool_get pages. This equals the number of pages with zero
+	 * reference counts, minus the number of pages that have been claimed
+	 * by homa_get_pool_pages but not yet allocated.
+	 */
+	atomic_t free_bpages;
+
+	/**
+	 * The number of free bpages required to satisfy the needs of the
+	 * first RPC on @hsk->waiting_for_bufs, or INT_MAX if that queue
+	 * is empty.
+	 */
+	int bpages_needed;
+
+	/** @cores: core-specific info; dynamically allocated. */
+	struct homa_pool_core *cores;
+
+	/** @num_cores: number of elements in @cores. */
+	int num_cores;
+
+	/**
+	 * @check_waiting_invoked: incremented during unit tests when
+	 * homa_pool_check_waiting is invoked.
+	 */
+	int check_waiting_invoked;
+};
+
+int      homa_pool_allocate(struct homa_rpc *rpc);
+void     homa_pool_check_waiting(struct homa_pool *pool);
+void     homa_pool_destroy(struct homa_pool *pool);
+void    *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+			      int *available);
+int      homa_pool_get_pages(struct homa_pool *pool, int num_pages,
+			     __u32 *pages, int leave_locked);
+int      homa_pool_init(struct homa_sock *hsk, void *buf_region,
+			__u64 region_size);
+void     homa_pool_release_buffers(struct homa_pool *pool,
+				   int num_buffers, __u32 *buffers);
+
+#endif /* _HOMA_POOL_H */

From patchwork Mon Oct 28 21:35:32 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854176
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DEC091E0B7C
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:18 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151383; cv=none;
 b=dqUCXkHZr6SaJHca1fXW8K6CdPbIdIKtvKyV4tdroGKtCGmISt4rFm1ak9ZhmR08Xr9oG6FhN0QVZ4HAxqDueBmChfHOH0J/51bzhj5xWBZf3Nh670CmhwoOwSuTidiAIIwuXgFNV8T4FeE51Za70SRULOR0LEYtTYEhiXTgmSM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151383; c=relaxed/simple;
	bh=mSE1Od4nE4viRdNWHiAqYS7IuG/ylCCr6ZEsfz4alqw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=gFU+H1AKmQtssMHAOgdoFqWuMaNxvFwJQW0LljfEEmpAOpqS7vRxB+9LP8vQsuiNmpJ67egb4DKlM4plor3rlYA8lwGHT8tVmYeJMm+ukXCEwtdloCm2a9vsQGQyPNRI/M2HDfVGO+UmqtZT2jE0xt+5aBO3wEhFsbFnnONg6sM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=Wv03do+W; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="Wv03do+W"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=rLqVCE+8AKs0YR5cdyKJVl8e7opkmwOjF+J4auHvupc=; t=1730151378; x=1731015378;
	b=Wv03do+Www991f/Wv4okEcng8z9I1Ib/xPlYVPkgMuYZfLwgsWxl5bmUwHM86+KESed5CcZdJCm
	QG/wh2YRE5v/SGaevHpWAwP7OX8KuyJFh41BpfIuPAReh7cLkAW+H75SCHbBlZGuZ++jYlCD6adaK
	H85cfeK8KvpaLkrh0cfd8l952BRCk3AsKKoCfjJREwNuQwzYV9kM8UviI9ch3YhUBRgytaFUX22sT
	3NXZLKbljlbKcNdS+Y2Iw9STvGoo3FNV10QGvc0BMTclIeozJPFbXZNgBlRnz2eBoRdAV1wagBCBh
	ZZsWIjwdNt7mDaSBE1Q7ec86I83XS2dqhmUw==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPN-0005xj-0Q; Mon, 28 Oct 2024 14:36:18 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 05/12] net: homa: create homa_rpc.h and homa_rpc.c
Date: Mon, 28 Oct 2024 14:35:32 -0700
Message-ID: <20241028213541.1529-6-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 49c929fcfde82ba71a4afdbb4c0ad636
X-Patchwork-Delegate: kuba@kernel.org

These files provide basic functions for managing remote procedure calls,
which are the fundamental entities managed by Homa. Each RPC consists
of a request message from a client to a server, followed by a response
message returned from the server to the client.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_rpc.c | 489 ++++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_rpc.h | 439 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 928 insertions(+)
 create mode 100644 net/homa/homa_rpc.c
 create mode 100644 net/homa/homa_rpc.h

diff --git a/net/homa/homa_rpc.c b/net/homa/homa_rpc.c
new file mode 100644
index 000000000000..5f19aa040442
--- /dev/null
+++ b/net/homa/homa_rpc.c
@@ -0,0 +1,489 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions for managing homa_rpc structs. */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+#include "homa_stub.h"
+
+/**
+ * homa_rpc_new_client() - Allocate and construct a client RPC (one that is used
+ * to issue an outgoing request). Doesn't send any packets. Invoked with no
+ * locks held.
+ * @hsk:      Socket to which the RPC belongs.
+ * @dest:     Address of host (ip and port) to which the RPC will be sent.
+ *
+ * Return:    A printer to the newly allocated object, or a negative
+ *            errno if an error occurred. The RPC will be locked; the
+ *            caller must eventually unlock it.
+ */
+struct homa_rpc *homa_rpc_new_client(struct homa_sock *hsk,
+				     const union sockaddr_in_union *dest)
+{
+	struct in6_addr dest_addr_as_ipv6 = canonical_ipv6_addr(dest);
+	struct homa_rpc_bucket *bucket;
+	struct homa_rpc *crpc;
+	int err;
+
+	crpc = kmalloc(sizeof(*crpc), GFP_KERNEL);
+	if (unlikely(!crpc))
+		return ERR_PTR(-ENOMEM);
+
+	/* Initialize fields that don't require the socket lock. */
+	crpc->hsk = hsk;
+	crpc->id = atomic64_fetch_add(2, &hsk->homa->next_outgoing_id);
+	bucket = homa_client_rpc_bucket(hsk, crpc->id);
+	crpc->bucket = bucket;
+	crpc->state = RPC_OUTGOING;
+	atomic_set(&crpc->flags, 0);
+	crpc->peer = homa_peer_find(hsk->homa->peers, &dest_addr_as_ipv6,
+				    &hsk->inet);
+	if (IS_ERR(crpc->peer)) {
+		err = PTR_ERR(crpc->peer);
+		goto error;
+	}
+	crpc->dport = ntohs(dest->in6.sin6_port);
+	crpc->completion_cookie = 0;
+	crpc->error = 0;
+	crpc->msgin.length = -1;
+	crpc->msgin.num_bpages = 0;
+	memset(&crpc->msgout, 0, sizeof(crpc->msgout));
+	crpc->msgout.length = -1;
+	INIT_LIST_HEAD(&crpc->ready_links);
+	INIT_LIST_HEAD(&crpc->buf_links);
+	INIT_LIST_HEAD(&crpc->dead_links);
+	crpc->interest = NULL;
+	INIT_LIST_HEAD(&crpc->throttled_links);
+	crpc->silent_ticks = 0;
+	crpc->resend_timer_ticks = hsk->homa->timer_ticks;
+	crpc->done_timer_ticks = 0;
+	crpc->magic = HOMA_RPC_MAGIC;
+	crpc->start_cycles = get_cycles();
+
+	/* Initialize fields that require locking. This allows the most
+	 * expensive work, such as copying in the message from user space,
+	 * to be performed without holding locks. Also, can't hold spin
+	 * locks while doing things that could block, such as memory allocation.
+	 */
+	homa_bucket_lock(bucket, crpc->id, "homa_rpc_new_client");
+	homa_sock_lock(hsk, "homa_rpc_new_client");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		homa_rpc_unlock(crpc);
+		err = -ESHUTDOWN;
+		goto error;
+	}
+	hlist_add_head(&crpc->hash_links, &bucket->rpcs);
+	list_add_tail_rcu(&crpc->active_links, &hsk->active_rpcs);
+	homa_sock_unlock(hsk);
+
+	return crpc;
+
+error:
+	kfree(crpc);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_new_server() - Allocate and construct a server RPC (one that is
+ * used to manage an incoming request). If appropriate, the RPC will also
+ * be handed off (we do it here, while we have the socket locked, to avoid
+ * acquiring the socket lock a second time later for the handoff).
+ * @hsk:      Socket that owns this RPC.
+ * @source:   IP address (network byte order) of the RPC's client.
+ * @h:        Header for the first data packet received for this RPC; used
+ *            to initialize the RPC.
+ * @created:  Will be set to 1 if a new RPC was created and 0 if an
+ *            existing RPC was found.
+ *
+ * Return:  A pointer to a new RPC, which is locked, or a negative errno
+ *          if an error occurred. If there is already an RPC corresponding
+ *          to h, then it is returned instead of creating a new RPC.
+ */
+struct homa_rpc *homa_rpc_new_server(struct homa_sock *hsk,
+				     const struct in6_addr *source,
+				     struct data_header *h, int *created)
+{
+	__u64 id = homa_local_id(h->common.sender_id);
+	struct homa_rpc_bucket *bucket;
+	struct homa_rpc *srpc = NULL;
+	int err;
+
+	/* Lock the bucket, and make sure no-one else has already created
+	 * the desired RPC.
+	 */
+	bucket = homa_server_rpc_bucket(hsk, id);
+	homa_bucket_lock(bucket, id, "homa_rpc_new_server");
+	hlist_for_each_entry_rcu(srpc, &bucket->rpcs, hash_links) {
+		if (srpc->id == id &&
+		    srpc->dport == ntohs(h->common.sport) &&
+		    ipv6_addr_equal(&srpc->peer->addr, source)) {
+			/* RPC already exists; just return it instead
+			 * of creating a new RPC.
+			 */
+			*created = 0;
+			return srpc;
+		}
+	}
+
+	/* Initialize fields that don't require the socket lock. */
+	srpc = kmalloc(sizeof(*srpc), GFP_KERNEL);
+	if (!srpc) {
+		err = -ENOMEM;
+		goto error;
+	}
+	srpc->hsk = hsk;
+	srpc->bucket = bucket;
+	srpc->state = RPC_INCOMING;
+	atomic_set(&srpc->flags, 0);
+	srpc->peer = homa_peer_find(hsk->homa->peers, source, &hsk->inet);
+	if (IS_ERR(srpc->peer)) {
+		err = PTR_ERR(srpc->peer);
+		goto error;
+	}
+	srpc->dport = ntohs(h->common.sport);
+	srpc->id = id;
+	srpc->completion_cookie = 0;
+	srpc->error = 0;
+	srpc->msgin.length = -1;
+	srpc->msgin.num_bpages = 0;
+	memset(&srpc->msgout, 0, sizeof(srpc->msgout));
+	srpc->msgout.length = -1;
+	INIT_LIST_HEAD(&srpc->ready_links);
+	INIT_LIST_HEAD(&srpc->buf_links);
+	INIT_LIST_HEAD(&srpc->dead_links);
+	srpc->interest = NULL;
+	INIT_LIST_HEAD(&srpc->throttled_links);
+	srpc->silent_ticks = 0;
+	srpc->resend_timer_ticks = hsk->homa->timer_ticks;
+	srpc->done_timer_ticks = 0;
+	srpc->magic = HOMA_RPC_MAGIC;
+	srpc->start_cycles = get_cycles();
+	err = homa_message_in_init(srpc, ntohl(h->message_length));
+	if (err != 0)
+		goto error;
+
+	/* Initialize fields that require socket to be locked. */
+	homa_sock_lock(hsk, "homa_rpc_new_server");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		err = -ESHUTDOWN;
+		goto error;
+	}
+	hlist_add_head(&srpc->hash_links, &bucket->rpcs);
+	list_add_tail_rcu(&srpc->active_links, &hsk->active_rpcs);
+	if (ntohl(h->seg.offset) == 0 && srpc->msgin.num_bpages > 0) {
+		atomic_or(RPC_PKTS_READY, &srpc->flags);
+		homa_rpc_handoff(srpc);
+	}
+	homa_sock_unlock(hsk);
+	*created = 1;
+	return srpc;
+
+error:
+	homa_bucket_unlock(bucket, id);
+	kfree(srpc);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_acked() - This function is invoked when an ack is received
+ * for an RPC; if the RPC still exists, is freed.
+ * @hsk:     Socket on which the ack was received. May or may not correspond
+ *           to the RPC, but can sometimes be used to avoid a socket lookup.
+ * @saddr:   Source address from which the act was received (the client
+ *           note for the RPC)
+ * @ack:     Information about an RPC from @saddr that may now be deleted safely.
+ */
+void homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+		    struct homa_ack *ack)
+{
+	__u16 client_port = ntohs(ack->client_port);
+	__u16 server_port = ntohs(ack->server_port);
+	__u64 id = homa_local_id(ack->client_id);
+	struct homa_sock *hsk2 = hsk;
+	struct homa_rpc *rpc;
+
+	UNIT_LOG("; ", "ack %llu", id);
+	if (hsk2->port != server_port) {
+		/* Without RCU, sockets other than hsk can be deleted
+		 * out from under us.
+		 */
+		rcu_read_lock();
+		hsk2 = homa_sock_find(hsk->homa->port_map, server_port);
+		if (!hsk2)
+			goto done;
+	}
+	rpc = homa_find_server_rpc(hsk2, saddr, client_port, id);
+	if (rpc) {
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+done:
+	if (hsk->port != server_port)
+		rcu_read_unlock();
+}
+
+/**
+ * homa_rpc_free() - Destructor for homa_rpc; will arrange for all resources
+ * associated with the RPC to be released (eventually).
+ * @rpc:  Structure to clean up, or NULL. Must be locked. Its socket must
+ *        not be locked.
+ */
+void homa_rpc_free(struct homa_rpc *rpc)
+{
+	/* The goal for this function is to make the RPC inaccessible,
+	 * so that no other code will ever access it again. However, don't
+	 * actually release resources; leave that to homa_rpc_reap, which
+	 * runs later. There are two reasons for this. First, releasing
+	 * resources may be expensive, so we don't want to keep the caller
+	 * waiting; homa_rpc_reap will run in situations where there is time
+	 * to spare. Second, there may be other code that currently has
+	 * pointers to this RPC but temporarily released the lock (e.g. to
+	 * copy data to/from user space). It isn't safe to clean up until
+	 * that code has finished its work and released any pointers to the
+	 * RPC (homa_rpc_reap will ensure that this has happened). So, this
+	 * function should only make changes needed to make the RPC
+	 * inaccessible.
+	 */
+	if (!rpc || rpc->state == RPC_DEAD)
+		return;
+	UNIT_LOG("; ", "homa_rpc_free invoked");
+	rpc->state = RPC_DEAD;
+
+	/* Unlink from all lists, so no-one will ever find this RPC again. */
+	homa_sock_lock(rpc->hsk, "homa_rpc_free");
+	__hlist_del(&rpc->hash_links);
+	list_del_rcu(&rpc->active_links);
+	list_add_tail_rcu(&rpc->dead_links, &rpc->hsk->dead_rpcs);
+	__list_del_entry(&rpc->ready_links);
+	__list_del_entry(&rpc->buf_links);
+	if (rpc->interest) {
+		rpc->interest->reg_rpc = NULL;
+		wake_up_process(rpc->interest->thread);
+		rpc->interest = NULL;
+	}
+//	tt_record3("Freeing rpc id %d, socket %d, dead_skbs %d", rpc->id,
+//			rpc->hsk->client_port,
+//			rpc->hsk->dead_skbs);
+
+	if (rpc->msgin.length >= 0) {
+		rpc->hsk->dead_skbs += skb_queue_len(&rpc->msgin.packets);
+		while (1) {
+			struct homa_gap *gap = list_first_entry_or_null(&rpc->msgin.gaps,
+					struct homa_gap, links);
+			if (!gap)
+				break;
+			list_del(&gap->links);
+			kfree(gap);
+		}
+	}
+	rpc->hsk->dead_skbs += rpc->msgout.num_skbs;
+	if (rpc->hsk->dead_skbs > rpc->hsk->homa->max_dead_buffs)
+		/* This update isn't thread-safe; it's just a
+		 * statistic so it's OK if updates occasionally get
+		 * missed.
+		 */
+		rpc->hsk->homa->max_dead_buffs = rpc->hsk->dead_skbs;
+
+	homa_sock_unlock(rpc->hsk);
+	homa_remove_from_throttled(rpc);
+}
+
+/**
+ * homa_rpc_reap() - Invoked to release resources associated with dead
+ * RPCs for a given socket. For a large RPC, it can take a long time to
+ * free all of its packet buffers, so we try to perform this work
+ * off the critical path where it won't delay applications. Each call to
+ * this function does a small chunk of work. See the file reap.txt for
+ * more information.
+ * @hsk:   Homa socket that may contain dead RPCs. Must not be locked by the
+ *         caller; this function will lock and release.
+ * @count: Number of buffers to free during this call.
+ *
+ * Return: A return value of 0 means that we ran out of work to do; calling
+ *         again will do no work (there could be unreaped RPCs, but if so,
+ *         reaping has been disabled for them).  A value greater than
+ *         zero means there is still more reaping work to be done.
+ */
+int homa_rpc_reap(struct homa_sock *hsk, int count)
+{
+#ifdef __UNIT_TEST__
+#define BATCH_MAX 3
+#else
+#define BATCH_MAX 20
+#endif
+	struct homa_rpc *rpcs[BATCH_MAX];
+	struct sk_buff *skbs[BATCH_MAX];
+	int num_skbs, num_rpcs;
+	struct homa_rpc *rpc;
+	int i, batch_size;
+	int rx_frees = 0;
+	int result;
+
+	/* Each iteration through the following loop will reap
+	 * BATCH_MAX skbs.
+	 */
+	while (count > 0) {
+		batch_size = count;
+		if (batch_size > BATCH_MAX)
+			batch_size = BATCH_MAX;
+		count -= batch_size;
+		num_skbs = 0;
+		num_rpcs = 0;
+
+		homa_sock_lock(hsk, "homa_rpc_reap");
+		if (atomic_read(&hsk->protect_count)) {
+			homa_sock_unlock(hsk);
+			return 0;
+		}
+
+		/* Collect buffers and freeable RPCs. */
+		list_for_each_entry_rcu(rpc, &hsk->dead_rpcs, dead_links) {
+			if ((atomic_read(&rpc->flags) & RPC_CANT_REAP) ||
+			    atomic_read(&rpc->msgout.active_xmits) != 0) {
+				continue;
+			}
+			rpc->magic = 0;
+
+			/* For Tx sk_buffs, collect them here but defer
+			 * freeing until after releasing the socket lock.
+			 */
+			if (rpc->msgout.length >= 0) {
+				while (rpc->msgout.packets) {
+					skbs[num_skbs] = rpc->msgout.packets;
+					rpc->msgout.packets = homa_get_skb_info(rpc
+							->msgout.packets)->next_skb;
+					num_skbs++;
+					rpc->msgout.num_skbs--;
+					if (num_skbs >= batch_size)
+						goto release;
+				}
+			}
+
+			/* In the normal case rx sk_buffs will already have been
+			 * freed before we got here. Thus it's OK to free
+			 * immediately in rare situations where there are
+			 * buffers left.
+			 */
+			if (rpc->msgin.length >= 0) {
+				while (1) {
+					struct sk_buff *skb;
+
+					skb = skb_dequeue(&rpc->msgin.packets);
+					if (!skb)
+						break;
+					kfree_skb(skb);
+					rx_frees++;
+				}
+			}
+
+			/* If we get here, it means all packets have been
+			 *  removed from the RPC.
+			 */
+			rpcs[num_rpcs] = rpc;
+			num_rpcs++;
+			list_del_rcu(&rpc->dead_links);
+			if (num_rpcs >= batch_size)
+				goto release;
+		}
+
+		/* Free all of the collected resources; release the socket
+		 * lock while doing this.
+		 */
+release:
+		hsk->dead_skbs -= num_skbs + rx_frees;
+		result = !list_empty(&hsk->dead_rpcs) &&
+				(num_skbs + num_rpcs) != 0;
+		homa_sock_unlock(hsk);
+		homa_skb_free_many_tx(hsk->homa, skbs, num_skbs);
+		for (i = 0; i < num_rpcs; i++) {
+			rpc = rpcs[i];
+			UNIT_LOG("; ", "reaped %llu", rpc->id);
+			/* Lock and unlock the RPC before freeing it. This
+			 * is needed to deal with races where the code
+			 * that invoked homa_rpc_free hasn't unlocked the
+			 * RPC yet.
+			 */
+			homa_rpc_lock(rpc, "homa_rpc_reap");
+			homa_rpc_unlock(rpc);
+
+			if (unlikely(rpc->msgin.num_bpages))
+				homa_pool_release_buffers(rpc->hsk->buffer_pool,
+							  rpc->msgin.num_bpages,
+							  rpc->msgin.bpage_offsets);
+			if (rpc->msgin.length >= 0) {
+				while (1) {
+					struct homa_gap *gap = list_first_entry_or_null(&rpc
+							->msgin.gaps,
+							struct homa_gap, links);
+					if (!gap)
+						break;
+					list_del(&gap->links);
+					kfree(gap);
+				}
+			}
+			rpc->state = 0;
+			kfree(rpc);
+		}
+		if (!result)
+			break;
+	}
+	homa_pool_check_waiting(hsk->buffer_pool);
+	return result;
+}
+
+/**
+ * homa_find_client_rpc() - Locate client-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk:      Socket via which packet was received.
+ * @id:       Unique identifier for the RPC.
+ *
+ * Return:    A pointer to the homa_rpc for this id, or NULL if none.
+ *            The RPC will be locked; the caller must eventually unlock it
+ *            by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_find_client_rpc(struct homa_sock *hsk, __u64 id)
+{
+	struct homa_rpc_bucket *bucket = homa_client_rpc_bucket(hsk, id);
+	struct homa_rpc *crpc;
+
+	homa_bucket_lock(bucket, id, __func__);
+	hlist_for_each_entry_rcu(crpc, &bucket->rpcs, hash_links) {
+		if (crpc->id == id)
+			return crpc;
+	}
+	homa_bucket_unlock(bucket, id);
+	return NULL;
+}
+
+/**
+ * homa_find_server_rpc() - Locate server-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk:      Socket via which packet was received.
+ * @saddr:    Address from which the packet was sent.
+ * @sport:    Port at @saddr from which the packet was sent.
+ * @id:       Unique identifier for the RPC (must have server bit set).
+ *
+ * Return:    A pointer to the homa_rpc matching the arguments, or NULL
+ *            if none. The RPC will be locked; the caller must eventually
+ *            unlock it by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_find_server_rpc(struct homa_sock *hsk,
+				      const struct in6_addr *saddr, __u16 sport,
+				      __u64 id)
+{
+	struct homa_rpc_bucket *bucket = homa_server_rpc_bucket(hsk, id);
+	struct homa_rpc *srpc;
+
+	homa_bucket_lock(bucket, id, __func__);
+	hlist_for_each_entry_rcu(srpc, &bucket->rpcs, hash_links) {
+		if (srpc->id == id && srpc->dport == sport &&
+		    ipv6_addr_equal(&srpc->peer->addr, saddr))
+			return srpc;
+	}
+	homa_bucket_unlock(bucket, id);
+	return NULL;
+}
diff --git a/net/homa/homa_rpc.h b/net/homa/homa_rpc.h
new file mode 100644
index 000000000000..1e299fc8d8bc
--- /dev/null
+++ b/net/homa/homa_rpc.h
@@ -0,0 +1,439 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines homa_rpc and related structs.  */
+
+#ifndef _HOMA_RPC_H
+#define _HOMA_RPC_H
+
+#include <linux/percpu-defs.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+
+#include "homa_sock.h"
+#include "homa_wire.h"
+
+/* Forward references. */
+struct homa_ack;
+
+/**
+ * struct homa_message_out - Describes a message (either request or response)
+ * for which this machine is the sender.
+ */
+struct homa_message_out {
+	/**
+	 * @length: Total bytes in message (excluding headers).  A value
+	 * less than 0 means this structure is uninitialized and therefore
+	 * not in use (all other fields will be zero in this case).
+	 */
+	int length;
+
+	/** @num_skbs: Total number of buffers currently in @packets. */
+	int num_skbs;
+
+	/**
+	 * @copied_from_user: Number of bytes of the message that have
+	 * been copied from user space into skbs in @packets.
+	 */
+	int copied_from_user;
+
+	/**
+	 * @packets: Singly-linked list of all packets in message, linked
+	 * using homa_next_skb. The list is in order of offset in the message
+	 * (offset 0 first); each sk_buff can potentially contain multiple
+	 * data_segments, which will be split into separate packets by GSO.
+	 * This list grows gradually as data is copied in from user space,
+	 * so it may not be complete.
+	 */
+	struct sk_buff *packets;
+
+	/**
+	 * @next_xmit: Pointer to pointer to next packet to transmit (will
+	 * either refer to @packets or homa_next_skb(skb) for some skb
+	 * in @packets).
+	 */
+	struct sk_buff **next_xmit;
+
+	/**
+	 * @next_xmit_offset: All bytes in the message, up to but not
+	 * including this one, have been transmitted.
+	 */
+	int next_xmit_offset;
+
+	/**
+	 * @active_xmits: The number of threads that are currently
+	 * transmitting data packets for this RPC; can't reap the RPC
+	 * until this count becomes zero.
+	 */
+	atomic_t active_xmits;
+
+	/**
+	 * @init_cycles: Time in get_cycles units when this structure was
+	 * initialized.  Used to find the oldest outgoing message.
+	 */
+	__u64 init_cycles;
+};
+
+/**
+ * struct homa_gap - Represents a range of bytes within a message that have
+ * not yet been received.
+ */
+struct homa_gap {
+	/** @start: offset of first byte in this gap. */
+	int start;
+
+	/** @end: offset of byte just after last one in this gap. */
+	int end;
+
+	/**
+	 * @time: time (in get_cycles units) when the gap was first detected.
+	 * As of 7/2024 this isn't used for anything.
+	 */
+	__u64 time;
+
+	/** @links: for linking into list in homa_message_in. */
+	struct list_head links;
+};
+
+/**
+ * struct homa_message_in - Holds the state of a message received by
+ * this machine; used for both requests and responses.
+ */
+struct homa_message_in {
+	/**
+	 * @length: Payload size in bytes. A value less than 0 means this
+	 * structure is uninitialized and therefore not in use.
+	 */
+	int length;
+
+	/**
+	 * @packets: DATA packets for this message that have been received but
+	 * not yet copied to user space (no particular order).
+	 */
+	struct sk_buff_head packets;
+
+	/**
+	 * @recv_end: Offset of the byte just after the highest one that
+	 * has been received so far.
+	 */
+	int recv_end;
+
+	/**
+	 * @gaps: List of homa_gaps describing all of the bytes with
+	 * offsets less than @recv_end that have not yet been received.
+	 */
+	struct list_head gaps;
+
+	/**
+	 * @bytes_remaining: Amount of data for this message that has
+	 * not yet been received; will determine the message's priority.
+	 */
+	int bytes_remaining;
+
+	/** @resend_all: if nonzero, set resend_all in the next grant packet. */
+	__u8 resend_all;
+
+	/**
+	 * @num_bpages: The number of entries in @bpage_offsets used for this
+	 * message (0 means buffers not allocated yet).
+	 */
+	__u32 num_bpages;
+
+	/** @bpage_offsets: Describes buffer space allocated for this message.
+	 * Each entry is an offset from the start of the buffer region.
+	 * All but the last pointer refer to areas of size HOMA_BPAGE_SIZE.
+	 */
+	__u32 bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+/**
+ * struct homa_rpc - One of these structures exists for each active
+ * RPC. The same structure is used to manage both outgoing RPCs on
+ * clients and incoming RPCs on servers.
+ */
+struct homa_rpc {
+	/** @hsk:  Socket that owns the RPC. */
+	struct homa_sock *hsk;
+
+	/** @bucket: Pointer to the bucket in hsk->client_rpc_buckets or
+	 * hsk->server_rpc_buckets where this RPC is linked. Used primarily
+	 * for locking the RPC (which is done by locking its bucket).
+	 */
+	struct homa_rpc_bucket *bucket;
+
+	/**
+	 * @state: The current state of this RPC:
+	 *
+	 * @RPC_OUTGOING:     The RPC is waiting for @msgout to be transmitted
+	 *                    to the peer.
+	 * @RPC_INCOMING:     The RPC is waiting for data @msgin to be received
+	 *                    from the peer; at least one packet has already
+	 *                    been received.
+	 * @RPC_IN_SERVICE:   Used only for server RPCs: the request message
+	 *                    has been read from the socket, but the response
+	 *                    message has not yet been presented to the kernel.
+	 * @RPC_DEAD:         RPC has been deleted and is waiting to be
+	 *                    reaped. In some cases, information in the RPC
+	 *                    structure may be accessed in this state.
+	 *
+	 * Client RPCs pass through states in the following order:
+	 * RPC_OUTGOING, RPC_INCOMING, RPC_DEAD.
+	 *
+	 * Server RPCs pass through states in the following order:
+	 * RPC_INCOMING, RPC_IN_SERVICE, RPC_OUTGOING, RPC_DEAD.
+	 */
+	enum {
+		RPC_OUTGOING            = 5,
+		RPC_INCOMING            = 6,
+		RPC_IN_SERVICE          = 8,
+		RPC_DEAD                = 9
+	} state;
+
+	/**
+	 * @flags: Additional state information: an OR'ed combination of
+	 * various single-bit flags. See below for definitions. Must be
+	 * manipulated with atomic operations because some of the manipulations
+	 * occur without holding the RPC lock.
+	 */
+	atomic_t flags;
+
+	/* Valid bits for @flags:
+	 * RPC_PKTS_READY -        The RPC has input packets ready to be
+	 *                         copied to user space.
+	 * RPC_COPYING_FROM_USER - Data is being copied from user space into
+	 *                         the RPC; the RPC must not be reaped.
+	 * RPC_COPYING_TO_USER -   Data is being copied from this RPC to
+	 *                         user space; the RPC must not be reaped.
+	 * RPC_HANDING_OFF -       This RPC is in the process of being
+	 *                         handed off to a waiting thread; it must
+	 *                         not be reaped.
+	 * APP_NEEDS_LOCK -        Means that code in the application thread
+	 *                         needs the RPC lock (e.g. so it can start
+	 *                         copying data to user space) so others
+	 *                         (e.g. SoftIRQ processing) should relinquish
+	 *                         the lock ASAP. Without this, SoftIRQ can
+	 *                         lock out the application for a long time,
+	 *                         preventing data copies to user space from
+	 *                         starting (and they limit throughput at
+	 *                         high network speeds).
+	 */
+#define RPC_PKTS_READY        1
+#define RPC_COPYING_FROM_USER 2
+#define RPC_COPYING_TO_USER   4
+#define RPC_HANDING_OFF       8
+#define APP_NEEDS_LOCK       16
+
+#define RPC_CANT_REAP (RPC_COPYING_FROM_USER | RPC_COPYING_TO_USER \
+		| RPC_HANDING_OFF)
+
+	/**
+	 * @peer: Information about the other machine (the server, if
+	 * this is a client RPC, or the client, if this is a server RPC).
+	 */
+	struct homa_peer *peer;
+
+	/** @dport: Port number on @peer that will handle packets. */
+	__u16 dport;
+
+	/**
+	 * @id: Unique identifier for the RPC among all those issued
+	 * from its port. The low-order bit indicates whether we are
+	 * server (1) or client (0) for this RPC.
+	 */
+	__u64 id;
+
+	/**
+	 * @completion_cookie: Only used on clients. Contains identifying
+	 * information about the RPC provided by the application; returned to
+	 * the application with the RPC's result.
+	 */
+	__u64 completion_cookie;
+
+	/**
+	 * @error: Only used on clients. If nonzero, then the RPC has
+	 * failed and the value is a negative errno that describes the
+	 * problem.
+	 */
+	int error;
+
+	/**
+	 * @msgin: Information about the message we receive for this RPC
+	 * (for server RPCs this is the request, for client RPCs this is the
+	 * response).
+	 */
+	struct homa_message_in msgin;
+
+	/**
+	 * @msgout: Information about the message we send for this RPC
+	 * (for client RPCs this is the request, for server RPCs this is the
+	 * response).
+	 */
+	struct homa_message_out msgout;
+
+	/**
+	 * @hash_links: Used to link this object into a hash bucket for
+	 * either @hsk->client_rpc_buckets (for a client RPC), or
+	 * @hsk->server_rpc_buckets (for a server RPC).
+	 */
+	struct hlist_node hash_links;
+
+	/**
+	 * @ready_links: Used to link this object into
+	 * @hsk->ready_requests or @hsk->ready_responses.
+	 */
+	struct list_head ready_links;
+
+	/**
+	 * @buf_links: Used to link this RPC into @hsk->waiting_for_bufs.
+	 * If the RPC isn't on @hsk->waiting_for_bufs, this is an empty
+	 * list pointing to itself.
+	 */
+	struct list_head buf_links;
+
+	/**
+	 * @active_links: For linking this object into @hsk->active_rpcs.
+	 * The next field will be LIST_POISON1 if this RPC hasn't yet been
+	 * linked into @hsk->active_rpcs. Access with RCU.
+	 */
+	struct list_head active_links;
+
+	/** @dead_links: For linking this object into @hsk->dead_rpcs. */
+	struct list_head dead_links;
+
+	/**
+	 * @interest: Describes a thread that wants to be notified when
+	 * msgin is complete, or NULL if none.
+	 */
+	struct homa_interest *interest;
+
+	/**
+	 * @throttled_links: Used to link this RPC into homa->throttled_rpcs.
+	 * If this RPC isn't in homa->throttled_rpcs, this is an empty
+	 * list pointing to itself.
+	 */
+	struct list_head throttled_links;
+
+	/**
+	 * @silent_ticks: Number of times homa_timer has been invoked
+	 * since the last time a packet indicating progress was received
+	 * for this RPC, so we don't need to send a resend for a while.
+	 */
+	int silent_ticks;
+
+	/**
+	 * @resend_timer_ticks: Value of homa->timer_ticks the last time
+	 * we sent a RESEND for this RPC.
+	 */
+	__u32 resend_timer_ticks;
+
+	/**
+	 * @done_timer_ticks: The value of homa->timer_ticks the first
+	 * time we noticed that this (server) RPC is done (all response
+	 * packets have been transmitted), so we're ready for an ack.
+	 * Zero means we haven't reached that point yet.
+	 */
+	__u32 done_timer_ticks;
+
+	/**
+	 * @magic: when the RPC is alive, this holds a distinct value that
+	 * is unlikely to occur naturally. The value is cleared when the
+	 * RPC is reaped, so we can detect accidental use of an RPC after
+	 * it has been reaped.
+	 */
+#define HOMA_RPC_MAGIC 0xdeadbeef
+	int magic;
+
+	/**
+	 * @start_cycles: time (from get_cycles()) when this RPC was created.
+	 * Used (sometimes) for testing.
+	 */
+	u64 start_cycles;
+};
+
+void             homa_check_rpc(struct homa_rpc *rpc);
+struct homa_rpc *homa_find_client_rpc(struct homa_sock *hsk, __u64 id);
+struct homa_rpc *homa_find_server_rpc(struct homa_sock *hsk,
+				      const struct in6_addr *saddr, __u16 sport,
+				      __u64 id);
+void             homa_rpc_acked(struct homa_sock *hsk,
+				const struct in6_addr *saddr,
+				struct homa_ack *ack);
+void             homa_rpc_free(struct homa_rpc *rpc);
+struct homa_rpc *homa_rpc_new_client(struct homa_sock *hsk,
+				     const union sockaddr_in_union *dest);
+struct homa_rpc *homa_rpc_new_server(struct homa_sock *hsk,
+				     const struct in6_addr *source,
+				     struct data_header *h,
+				     int *created);
+int              homa_rpc_reap(struct homa_sock *hsk, int count);
+
+/**
+ * homa_rpc_lock() - Acquire the lock for an RPC.
+ * @rpc:    RPC to lock. Note: this function is only safe under
+ *          limited conditions (in most cases homa_bucket_lock should be
+ *          used). The caller must ensure that the RPC cannot be reaped
+ *          before the lock is acquired. It cannot do that by acquirin
+ *          the socket lock, since that violates lock ordering constraints.
+ *          One approach is to use homa_protect_rpcs. Don't use this function
+ *          unless you are very sure what you are doing!  See sync.txt for
+ *          more info on locking.
+ * @locker: Static string identifying the locking code. Normally ignored,
+ *          but used occasionally for diagnostics and debugging.
+ */
+static inline void homa_rpc_lock(struct homa_rpc *rpc, const char *locker)
+{
+	homa_bucket_lock(rpc->bucket, rpc->id, locker);
+}
+
+/**
+ * homa_rpc_unlock() - Release the lock for an RPC.
+ * @rpc:   RPC to unlock.
+ */
+static inline void homa_rpc_unlock(struct homa_rpc *rpc)
+{
+	homa_bucket_unlock(rpc->bucket, rpc->id);
+}
+
+/**
+ * homa_protect_rpcs() - Ensures that no RPCs will be reaped for a given
+ * socket until homa_sock_unprotect is called. Typically used by functions
+ * that want to scan the active RPCs for a socket without holding the socket
+ * lock.  Multiple calls to this function may be in effect at once.
+ * @hsk:    Socket whose RPCs should be protected. Must not be locked
+ *          by the caller; will be locked here.
+ *
+ * Return:  1 for success, 0 if the socket has been shutdown, in which
+ *          case its RPCs cannot be protected.
+ */
+static inline int homa_protect_rpcs(struct homa_sock *hsk)
+{
+	int result;
+
+	homa_sock_lock(hsk, __func__);
+	result = !hsk->shutdown;
+	if (result)
+		atomic_inc(&hsk->protect_count);
+	homa_sock_unlock(hsk);
+	return result;
+}
+
+/**
+ * homa_unprotect_rpcs() - Cancel the effect of a previous call to
+ * homa_sock_protect(), so that RPCs can once again be reaped.
+ * @hsk:    Socket whose RPCs should be unprotected.
+ */
+static inline void homa_unprotect_rpcs(struct homa_sock *hsk)
+{
+	atomic_dec(&hsk->protect_count);
+}
+
+/**
+ * homa_is_client(): returns true if we are the client for a particular RPC,
+ * false if we are the server.
+ * @id:  Id of the RPC in question.
+ */
+static inline bool homa_is_client(__u64 id)
+{
+	return (id & 1) == 0;
+}
+
+#endif /* _HOMA_RPC_H */

From patchwork Mon Oct 28 21:35:33 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854175
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3F321E0DB6
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151383; cv=none;
 b=cSH27GehdyU9Sbedoo9JUCj9Y2u0vuEW/S3p8kBWAowO/6RwAS+gDYO4yQ16M4sNlSwzwgiBGPIz6xTI6BaGqN4AzmWdleVipE4pdtEhzCXwwg48ZMnNHhPy6b8XTArZsjwrOFgnWnin7fOtUdRp+cSbqpy+taTSbOl8Tj1jjt4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151383; c=relaxed/simple;
	bh=3qfO049mItcOyLvaaXnSrxsUTQ4d49yDJoWZ0AEmm2E=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=LmHqRUeb46WIYCq1riSXfuuBY9Jw1ZX8k7yT4FBMulGxP+RqClh1MJWZwqmG/JKCEauxbN8Lah2lM7mOwW9ferwjbMzYZN7+y2ejYsOQ2qKh9x21i0BExZNio1GhngPE/DgroqYSHWVZMtGPUykPY7P3Zfk4CzaGNwj8Mx5yoPI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=DfRH3ecX; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="DfRH3ecX"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=BzXKh3FTrfVPCIfsbceDvpAW0K/uCXZyviOQ6akMlOY=; t=1730151379; x=1731015379;
	b=DfRH3ecXeaTg4T2pzEsa6gwszvy5XrKJBHghz/lkDNceAhNuGkh8FGS0X7NhjbOEzKl7zL7iRw3
	z/pxqvLrzh4ZavEJXfDT0qZx8eKvAOnsCv+xVQhMWYZ6pSutvg5b1HSRqt3LgXE+C7LRC1dBrXlYn
	xIw+fUNC09ukrtHiKlhajOBJYd9lMZeuFj6eskzJkdvR2Siy74aEZiNOLClrWV38e4pA3DLXfApK7
	UjVFDDq8Lyqd+/qDK0yIfqF6Cp5W94loo5KZxqCMAui/+ZaIISVlRYRav9AYeKrLgUPDksCm/9wSY
	5SDG0ai2+9lt5zKQabMCuLIflgdz6HWVyp9g==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPO-0005xj-AC; Mon, 28 Oct 2024 14:36:19 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 06/12] net: homa: create homa_peer.h and homa_peer.c
Date: Mon, 28 Oct 2024 14:35:33 -0700
Message-ID: <20241028213541.1529-7-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: a500dc3e51996855778ea9062b0c2e27
X-Patchwork-Delegate: kuba@kernel.org

Homa needs to keep a small amount of information for each peer that
it has communicated with. These files define that state and provide
functions for storing and accessing it.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_peer.c | 314 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_peer.h | 227 +++++++++++++++++++++++++++++++
 2 files changed, 541 insertions(+)
 create mode 100644 net/homa/homa_peer.c
 create mode 100644 net/homa/homa_peer.h

diff --git a/net/homa/homa_peer.c b/net/homa/homa_peer.c
new file mode 100644
index 000000000000..a4d249b31ae4
--- /dev/null
+++ b/net/homa/homa_peer.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file provides functions related to homa_peer and homa_peertab
+ * objects.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+
+/**
+ * homa_peertab_init() - Constructor for homa_peertabs.
+ * @peertab:  The object to initialize; previous contents are discarded.
+ *
+ * Return:    0 in the normal case, or a negative errno if there was a problem.
+ */
+int homa_peertab_init(struct homa_peertab *peertab)
+{
+	/* Note: when we return, the object must be initialized so it's
+	 * safe to call homa_peertab_destroy, even if this function returns
+	 * an error.
+	 */
+	int i;
+
+	spin_lock_init(&peertab->write_lock);
+	INIT_LIST_HEAD(&peertab->dead_dsts);
+	peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
+				   sizeof(*peertab->buckets));
+	if (!peertab->buckets)
+		return -ENOMEM;
+	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++)
+		INIT_HLIST_HEAD(&peertab->buckets[i]);
+	return 0;
+}
+
+/**
+ * homa_peertab_destroy() - Destructor for homa_peertabs. After this
+ * function returns, it is unsafe to use any results from previous calls
+ * to homa_peer_find, since all existing homa_peer objects will have been
+ * destroyed.
+ * @peertab:  The table to destroy.
+ */
+void homa_peertab_destroy(struct homa_peertab *peertab)
+{
+	struct hlist_node *next;
+	struct homa_peer *peer;
+	int i;
+
+	if (!peertab->buckets)
+		return;
+
+	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
+		hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
+					  peertab_links) {
+			dst_release(peer->dst);
+			kfree(peer);
+		}
+	}
+	vfree(peertab->buckets);
+	homa_peertab_gc_dsts(peertab, ~0);
+}
+
+/**
+ * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
+ * safe to do so.
+ * @peertab:       The table in which to free entries.
+ * @now:           Current time, in get_cycles units; entries with expiration
+ *                 dates no later than this will be freed. Specify ~0 to
+ *                 free all entries.
+ */
+void homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now)
+{
+	while (!list_empty(&peertab->dead_dsts)) {
+		struct homa_dead_dst *dead = list_first_entry(&peertab->dead_dsts,
+							      struct homa_dead_dst,
+							dst_links);
+		if (dead->gc_time > now)
+			break;
+		dst_release(dead->dst);
+		list_del(&dead->dst_links);
+		kfree(dead);
+	}
+}
+
+/**
+ * homa_peer_find() - Returns the peer associated with a given host; creates
+ * a new homa_peer if one doesn't already exist.
+ * @peertab:    Peer table in which to perform lookup.
+ * @addr:       Address of the desired host: IPv4 addresses are represented
+ *              as IPv4-mapped IPv6 addresses.
+ * @inet:       Socket that will be used for sending packets.
+ *
+ * Return:      The peer associated with @addr, or a negative errno if an
+ *              error occurred. The caller can retain this pointer
+ *              indefinitely: peer entries are never deleted except in
+ *              homa_peertab_destroy.
+ */
+struct homa_peer *homa_peer_find(struct homa_peertab *peertab,
+				 const struct in6_addr *addr,
+				 struct inet_sock *inet)
+{
+	/* Note: this function uses RCU operators to ensure safety even
+	 * if a concurrent call is adding a new entry.
+	 */
+	struct homa_peer *peer;
+	struct dst_entry *dst;
+
+	// Should use siphash or jhash here:
+	__u32 bucket = hash_32(addr->in6_u.u6_addr32[0], HOMA_PEERTAB_BUCKET_BITS);
+
+	bucket ^= hash_32(addr->in6_u.u6_addr32[1], HOMA_PEERTAB_BUCKET_BITS);
+	bucket ^= hash_32(addr->in6_u.u6_addr32[2], HOMA_PEERTAB_BUCKET_BITS);
+	bucket ^= hash_32(addr->in6_u.u6_addr32[3], HOMA_PEERTAB_BUCKET_BITS);
+	hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
+				 peertab_links) {
+		if (ipv6_addr_equal(&peer->addr, addr))
+			return peer;
+	}
+
+	/* No existing entry; create a new one.
+	 *
+	 * Note: after we acquire the lock, we have to check again to
+	 * make sure the entry still doesn't exist (it might have been
+	 * created by a concurrent invocation of this function).
+	 */
+	spin_lock_bh(&peertab->write_lock);
+	hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
+				 peertab_links) {
+		if (ipv6_addr_equal(&peer->addr, addr))
+			goto done;
+	}
+	peer = kmalloc(sizeof(*peer), GFP_ATOMIC);
+	if (!peer) {
+		peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
+		goto done;
+	}
+	peer->addr = *addr;
+	dst = homa_peer_get_dst(peer, inet);
+	if (IS_ERR(dst)) {
+		kfree(peer);
+		peer = (struct homa_peer *)PTR_ERR(dst);
+		goto done;
+	}
+	peer->dst = dst;
+	INIT_LIST_HEAD(&peer->grantable_rpcs);
+	INIT_LIST_HEAD(&peer->grantable_links);
+	hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
+	peer->outstanding_resends = 0;
+	peer->most_recent_resend = 0;
+	peer->least_recent_rpc = NULL;
+	peer->least_recent_ticks = 0;
+	peer->current_ticks = -1;
+	peer->resend_rpc = NULL;
+	peer->num_acks = 0;
+	spin_lock_init(&peer->ack_lock);
+
+done:
+	spin_unlock_bh(&peertab->write_lock);
+	return peer;
+}
+
+/**
+ * homa_dst_refresh() - This method is called when the dst for a peer is
+ * obsolete; it releases that dst and creates a new one.
+ * @peertab:  Table containing the peer.
+ * @peer:     Peer whose dst is obsolete.
+ * @hsk:      Socket that will be used to transmit data to the peer.
+ */
+void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
+		      struct homa_sock *hsk)
+{
+	struct dst_entry *dst;
+
+	spin_lock_bh(&peertab->write_lock);
+	dst = homa_peer_get_dst(peer, &hsk->inet);
+	if (!IS_ERR(dst)) {
+		struct homa_dead_dst *dead = (struct homa_dead_dst *)
+				kmalloc(sizeof(*dead), GFP_KERNEL);
+		if (unlikely(!dead)) {
+			/* Can't allocate memory to keep track of the
+			 * dead dst; just free it immediately (a bit
+			 * risky, admittedly).
+			 */
+			dst_release(peer->dst);
+		} else {
+			__u64 now = get_cycles();
+
+			dead->dst = peer->dst;
+			dead->gc_time = now + (cpu_khz << 7);
+			list_add_tail(&dead->dst_links, &peertab->dead_dsts);
+			homa_peertab_gc_dsts(peertab, now);
+		}
+		peer->dst = dst;
+	}
+	spin_unlock_bh(&peertab->write_lock);
+}
+
+/**
+ * homa_peer_get_dst() - Find an appropriate dst structure (either IPv4
+ * or IPv6) for a peer.
+ * @peer:   The peer for which a dst is needed. Note: this peer's flow
+ *          struct will be overwritten.
+ * @inet:   Socket that will be used for sending packets.
+ * Return:  The dst structure (or an ERR_PTR).
+ */
+struct dst_entry *homa_peer_get_dst(struct homa_peer *peer,
+				    struct inet_sock *inet)
+{
+	memset(&peer->flow, 0, sizeof(peer->flow));
+	if (inet->sk.sk_family == AF_INET) {
+		struct rtable *rt;
+
+		flowi4_init_output(&peer->flow.u.ip4, inet->sk.sk_bound_dev_if,
+				   inet->sk.sk_mark, inet->tos, RT_SCOPE_UNIVERSE,
+				   inet->sk.sk_protocol, 0,
+				   peer->addr.in6_u.u6_addr32[3], inet->inet_saddr,
+				   0, 0, inet->sk.sk_uid);
+		security_sk_classify_flow(&inet->sk, &peer->flow.u.__fl_common);
+		rt = ip_route_output_flow(sock_net(&inet->sk),
+					  &peer->flow.u.ip4, &inet->sk);
+		if (IS_ERR(rt))
+			return (struct dst_entry *)(PTR_ERR(rt));
+		return &rt->dst;
+	}
+	peer->flow.u.ip6.flowi6_oif = inet->sk.sk_bound_dev_if;
+	peer->flow.u.ip6.flowi6_iif = LOOPBACK_IFINDEX;
+	peer->flow.u.ip6.flowi6_mark = inet->sk.sk_mark;
+	peer->flow.u.ip6.flowi6_scope = RT_SCOPE_UNIVERSE;
+	peer->flow.u.ip6.flowi6_proto = inet->sk.sk_protocol;
+	peer->flow.u.ip6.flowi6_flags = 0;
+	peer->flow.u.ip6.flowi6_secid = 0;
+	peer->flow.u.ip6.flowi6_tun_key.tun_id = 0;
+	peer->flow.u.ip6.flowi6_uid = inet->sk.sk_uid;
+	peer->flow.u.ip6.daddr = peer->addr;
+	peer->flow.u.ip6.saddr = inet->pinet6->saddr;
+	peer->flow.u.ip6.fl6_dport = 0;
+	peer->flow.u.ip6.fl6_sport = 0;
+	peer->flow.u.ip6.mp_hash = 0;
+	peer->flow.u.ip6.__fl_common.flowic_tos = inet->tos;
+	peer->flow.u.ip6.flowlabel = ip6_make_flowinfo(inet->tos, 0);
+	security_sk_classify_flow(&inet->sk, &peer->flow.u.__fl_common);
+	return ip6_dst_lookup_flow(sock_net(&inet->sk), &inet->sk,
+			&peer->flow.u.ip6, NULL);
+}
+
+/**
+ * homa_peer_lock_slow() - This function implements the slow path for
+ * acquiring a peer's @unacked_lock. It is invoked when the lock isn't
+ * immediately available. It waits for the lock, but also records statistics
+ * about the waiting time.
+ * @peer:    Peer to  lock.
+ */
+void homa_peer_lock_slow(struct homa_peer *peer)
+{
+	spin_lock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_peer_add_ack() - Add a given RPC to the list of unacked
+ * RPCs for its server. Once this method has been invoked, it's safe
+ * to delete the RPC, since it will eventually be acked to the server.
+ * @rpc:    Client RPC that has now completed.
+ */
+void homa_peer_add_ack(struct homa_rpc *rpc)
+{
+	struct homa_peer *peer = rpc->peer;
+	struct ack_header ack;
+
+	homa_peer_lock(peer);
+	if (peer->num_acks < HOMA_MAX_ACKS_PER_PKT) {
+		peer->acks[peer->num_acks].client_id = cpu_to_be64(rpc->id);
+		peer->acks[peer->num_acks].client_port = htons(rpc->hsk->port);
+		peer->acks[peer->num_acks].server_port = htons(rpc->dport);
+		peer->num_acks++;
+		homa_peer_unlock(peer);
+		return;
+	}
+
+	/* The peer has filled up; send an ACK message to empty it. The
+	 * RPC in the message header will also be considered ACKed.
+	 */
+	memcpy(ack.acks, peer->acks, sizeof(peer->acks));
+	ack.num_acks = htons(peer->num_acks);
+	peer->num_acks = 0;
+	homa_peer_unlock(peer);
+	homa_xmit_control(ACK, &ack, sizeof(ack), rpc);
+}
+
+/**
+ * homa_peer_get_acks() - Copy acks out of a peer, and remove them from the
+ * peer.
+ * @peer:    Peer to check for possible unacked RPCs.
+ * @count:   Maximum number of acks to return.
+ * @dst:     The acks are copied to this location.
+ *
+ * Return:   The number of acks extracted from the peer (<= count).
+ */
+int homa_peer_get_acks(struct homa_peer *peer, int count, struct homa_ack *dst)
+{
+	/* Don't waste time acquiring the lock if there are no ids available. */
+	if (peer->num_acks == 0)
+		return 0;
+
+	homa_peer_lock(peer);
+
+	if (count > peer->num_acks)
+		count = peer->num_acks;
+	memcpy(dst, &peer->acks[peer->num_acks - count],
+	       count * sizeof(peer->acks[0]));
+	peer->num_acks -= count;
+
+	homa_peer_unlock(peer);
+	return count;
+}
diff --git a/net/homa/homa_peer.h b/net/homa/homa_peer.h
new file mode 100644
index 000000000000..ba222a77c585
--- /dev/null
+++ b/net/homa/homa_peer.h
@@ -0,0 +1,227 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions related to managing peers (homa_peer
+ * and homa_peertab).
+ */
+
+#ifndef _HOMA_PEER_H
+#define _HOMA_PEER_H
+
+#include "homa_wire.h"
+#include "homa_sock.h"
+
+struct homa_rpc;
+
+/**
+ * struct homa_dead_dst - Used to retain dst_entries that are no longer
+ * needed, until it is safe to delete them (I'm not confident that the RCU
+ * mechanism will be safe for these: the reference count could get incremented
+ * after it's on the RCU list?).
+ */
+struct homa_dead_dst {
+	/** @dst: Entry that is no longer used by a struct homa_peer. */
+	struct dst_entry *dst;
+
+	/**
+	 * @gc_time: Time (in units of get_cycles) when it is safe
+	 * to free @dst.
+	 */
+	__u64 gc_time;
+
+	/** @dst_links: Used to link together entries in peertab->dead_dsts. */
+	struct list_head dst_links;
+};
+
+/**
+ * define HOMA_PEERTAB_BUCKETS - Number of bits in the bucket index for a
+ * homa_peertab.  Should be large enough to hold an entry for every server
+ * in a datacenter without long hash chains.
+ */
+#define HOMA_PEERTAB_BUCKET_BITS 16
+
+/** define HOME_PEERTAB_BUCKETS - Number of buckets in a homa_peertab. */
+#define HOMA_PEERTAB_BUCKETS BIT(HOMA_PEERTAB_BUCKET_BITS)
+
+/**
+ * struct homa_peertab - A hash table that maps from IPv6 addresses
+ * to homa_peer objects. IPv4 entries are encapsulated as IPv6 addresses.
+ * Entries are gradually added to this table, but they are never removed
+ * except when the entire table is deleted. We can't safely delete because
+ * results returned by homa_peer_find may be retained indefinitely.
+ *
+ * This table is managed exclusively by homa_peertab.c, using RCU to
+ * permit efficient lookups.
+ */
+struct homa_peertab {
+	/**
+	 * @write_lock: Synchronizes addition of new entries; not needed
+	 * for lookups (RCU is used instead).
+	 */
+	spinlock_t write_lock;
+
+	/**
+	 * @dead_dsts: List of dst_entries that are waiting to be deleted.
+	 * Hold @write_lock when manipulating.
+	 */
+	struct list_head dead_dsts;
+
+	/**
+	 * @buckets: Pointer to heads of chains of homa_peers for each bucket.
+	 * Malloc-ed, and must eventually be freed. NULL means this structure
+	 * has not been initialized.
+	 */
+	struct hlist_head *buckets;
+};
+
+/**
+ * struct homa_peer - One of these objects exists for each machine that we
+ * have communicated with (either as client or server).
+ */
+struct homa_peer {
+	/**
+	 * @addr: IPv6 address for the machine (IPv4 addresses are stored
+	 * as IPv4-mapped IPv6 addresses).
+	 */
+	struct in6_addr addr;
+
+	/** @flow: Addressing info needed to send packets. */
+	struct flowi flow;
+
+	/**
+	 * @dst: Used to route packets to this peer; we own a reference
+	 * to this, which we must eventually release.
+	 */
+	struct dst_entry *dst;
+
+	/**
+	 * grantable_rpcs: Contains all homa_rpcs (both requests and
+	 * responses) involving this peer whose msgins require (or required
+	 * them in the past) and have not been fully received. The list is
+	 * sorted in priority order (head has fewest bytes_remaining).
+	 * Locked with homa->grantable_lock.
+	 */
+	struct list_head grantable_rpcs;
+
+	/**
+	 * @grantable_links: Used to link this peer into homa->grantable_peers.
+	 * If this RPC is not linked into homa->grantable_peers, this is an
+	 * empty list pointing to itself.
+	 */
+	struct list_head grantable_links;
+
+	/**
+	 * @peertab_links: Links this object into a bucket of its
+	 * homa_peertab.
+	 */
+	struct hlist_node peertab_links;
+
+	/**
+	 * @outstanding_resends: the number of resend requests we have
+	 * sent to this server (spaced @homa.resend_interval apart) since
+	 * we received a packet from this peer.
+	 */
+	int outstanding_resends;
+
+	/**
+	 * @most_recent_resend: @homa->timer_ticks when the most recent
+	 * resend was sent to this peer.
+	 */
+	int most_recent_resend;
+
+	/**
+	 * @least_recent_rpc: of all the RPCs for this peer scanned at
+	 * @current_ticks, this is the RPC whose @resend_timer_ticks
+	 * is farthest in the past.
+	 */
+	struct homa_rpc *least_recent_rpc;
+
+	/**
+	 * @least_recent_ticks: the @resend_timer_ticks value for
+	 * @least_recent_rpc.
+	 */
+	__u32 least_recent_ticks;
+
+	/**
+	 * @current_ticks: the value of @homa->timer_ticks the last time
+	 * that @least_recent_rpc and @least_recent_ticks were computed.
+	 * Used to detect the start of a new homa_timer pass.
+	 */
+	__u32 current_ticks;
+
+	/**
+	 * @resend_rpc: the value of @least_recent_rpc computed in the
+	 * previous homa_timer pass. This RPC will be issued a RESEND
+	 * in the current pass, if it still needs one.
+	 */
+	struct homa_rpc *resend_rpc;
+
+	/**
+	 * @num_acks: the number of (initial) entries in @acks that
+	 * currently hold valid information.
+	 */
+	int num_acks;
+
+	/**
+	 * @acks: info about client RPCs whose results have been completely
+	 * received.
+	 */
+	struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+
+	/**
+	 * @ack_lock: used to synchronize access to @num_acks and @acks.
+	 */
+	spinlock_t ack_lock;
+};
+
+void              homa_dst_refresh(struct homa_peertab *peertab,
+				   struct homa_peer *peer,
+				   struct homa_sock *hsk);
+void              homa_peertab_destroy(struct homa_peertab *peertab);
+int               homa_peertab_init(struct homa_peertab *peertab);
+void              homa_peer_add_ack(struct homa_rpc *rpc);
+struct homa_peer *homa_peer_find(struct homa_peertab *peertab,
+				 const struct in6_addr *addr,
+				 struct inet_sock *inet);
+int               homa_peer_get_acks(struct homa_peer *peer, int count,
+				     struct homa_ack *dst);
+struct dst_entry *homa_peer_get_dst(struct homa_peer *peer,
+				    struct inet_sock *inet);
+void              homa_peer_lock_slow(struct homa_peer *peer);
+void              homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now);
+
+/**
+ * homa_peer_lock() - Acquire the lock for a peer's @unacked_lock. If the lock
+ * isn't immediately available, record stats on the waiting time.
+ * @peer:    Peer to lock.
+ */
+static inline void homa_peer_lock(struct homa_peer *peer)
+{
+	if (!spin_trylock_bh(&peer->ack_lock))
+		homa_peer_lock_slow(peer);
+}
+
+/**
+ * homa_peer_unlock() - Release the lock for a peer's @unacked_lock.
+ * @peer:   Peer to lock.
+ */
+static inline void homa_peer_unlock(struct homa_peer *peer)
+{
+	spin_unlock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_get_dst() - Returns destination information associated with a peer,
+ * updating it if the cached information is stale.
+ * @peer:   Peer whose destination information is desired.
+ * @hsk:    Homa socket; needed by lower-level code to recreate the dst.
+ * Return   Up-to-date destination for peer.
+ */
+static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
+					     struct homa_sock *hsk)
+{
+	if (unlikely(peer->dst->obsolete > 0))
+		homa_dst_refresh(hsk->homa->peers, peer, hsk);
+	return peer->dst;
+}
+
+#endif /* _HOMA_PEER_H */

From patchwork Mon Oct 28 21:35:34 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854177
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C30C51E22E6
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:20 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151384; cv=none;
 b=acSczFAHmkyUJd+pksRtkenID6oSOJUPgR1Ezyt5WgcYjqWIJrUUkPyHgpfilZXkupiVpUGV9qC2tUn7kQiwwUSfb9yyumyfRmI6aJowxss40JM/XDD8h1/Sc/uRLGNITNa7M7uTTsQgbJFalOj5x6ulXIHAPQcp2Vnr1Ege+y8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151384; c=relaxed/simple;
	bh=ZctzJyZbMZ94uCQFWLNgp2QdoPfjEIksWB8LzeH2z0I=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=fDto41WI9RqA6gVfm/4vqZ9xtYGQAn6Jno5QkRnldzWORtQLDUM5N4DVv7qONCEDxdDV7a1+Fd6F0OO1nL2yPea4oghOlZJU5f8xjL4aA+yxTnSSPGIv9H5GAM8adDbcqof39vlm2e6/mgJEyDQ2NyyjY68EoabyR8vvsdRuPj4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=TxpIqGv0; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="TxpIqGv0"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=a6tBn/tV56L1/dnkfZQauPx/bh2px+sv91sHFyxVfkU=; t=1730151380; x=1731015380;
	b=TxpIqGv0UY5fVg+ieHwz8cOQ8wUEUIR8Fog73VmoZLUzpWYuZBvLjrlb2e0H4kGhFctRtPZ0l+2
	ddA3rmZ2tfVPt38jajcI2wxaDLwHpJFSU96VHeXwLjGaneiYNQI/tLdphu4V9ubYfjGAqWsTUxpYB
	YYfL+UR9wjSJyhX4Xfz/6DCMSbCRT+lTHxUq5ghDIgGFN+ULczItV+ZvQkCfYP0neSFlscJlFX4fv
	48LxkasxuAdyh1I6IqDs9g04yREQJovYmP8fAJmHkTsGrgvSL0bBUR4gv3WNec6djvQZPf6hkCLIo
	HQfleEC/wtjpEhrUWhhRWVEG4tQdh2Ll0gGQ==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPP-0005xj-7v; Mon, 28 Oct 2024 14:36:20 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 07/12] net: homa: create homa_sock.h and homa_sock.c
Date: Mon, 28 Oct 2024 14:35:34 -0700
Message-ID: <20241028213541.1529-8-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 77334d8d872153c9350eeec3077c0d46
X-Patchwork-Delegate: kuba@kernel.org

These files provide functions for managing the state that Homa keeps
for each open Homa socket.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_sock.c | 329 +++++++++++++++++++++++++++++++++++
 net/homa/homa_sock.h | 399 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 728 insertions(+)
 create mode 100644 net/homa/homa_sock.c
 create mode 100644 net/homa/homa_sock.h

diff --git a/net/homa/homa_sock.c b/net/homa/homa_sock.c
new file mode 100644
index 000000000000..f02a0394fb00
--- /dev/null
+++ b/net/homa/homa_sock.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file manages homa_sock and homa_socktab objects. */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_socktab_init() - Constructor for homa_socktabs.
+ * @socktab:  The object to initialize; previous contents are discarded.
+ */
+void homa_socktab_init(struct homa_socktab *socktab)
+{
+	int i;
+
+	spin_lock_init(&socktab->write_lock);
+	for (i = 0; i < HOMA_SOCKTAB_BUCKETS; i++)
+		INIT_HLIST_HEAD(&socktab->buckets[i]);
+}
+
+/**
+ * homa_socktab_destroy() - Destructor for homa_socktabs.
+ * @socktab:  The object to destroy.
+ */
+void homa_socktab_destroy(struct homa_socktab *socktab)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+
+	for (hsk = homa_socktab_start_scan(socktab, &scan); hsk;
+			hsk = homa_socktab_next(&scan)) {
+		homa_sock_destroy(hsk);
+	}
+}
+
+/**
+ * homa_socktab_start_scan() - Begin an iteration over all of the sockets
+ * in a socktab.
+ * @socktab:   Socktab to scan.
+ * @scan:      Will hold the current state of the scan; any existing
+ *             contents are discarded.
+ *
+ * Return:     The first socket in the table, or NULL if the table is
+ *             empty.
+ *
+ * Each call to homa_socktab_next will return the next socket in the table.
+ * All sockets that are present in the table at the time this function is
+ * invoked will eventually be returned, as long as they are not removed
+ * from the table. It is safe to remove sockets from the table and/or
+ * delete them while the scan is in progress. If a socket is removed from
+ * the table during the scan, it may or may not be returned by
+ * homa_socktab_next. New entries added during the scan may or may not be
+ * returned. The caller should use RCU to prevent socket storage from
+ * being reclaimed during the scan.
+ */
+struct homa_sock *homa_socktab_start_scan(struct homa_socktab *socktab,
+					  struct homa_socktab_scan *scan)
+{
+	scan->socktab = socktab;
+	scan->current_bucket = -1;
+	scan->next = NULL;
+	return homa_socktab_next(scan);
+}
+
+/**
+ * homa_socktab_next() - Return the next socket in an iteration over a socktab.
+ * @scan:      State of the scan.
+ *
+ * Return:     The next socket in the table, or NULL if the iteration has
+ *             returned all of the sockets in the table. Sockets are not
+ *             returned in any particular order. It's possible that the
+ *             returned socket has been destroyed.
+ */
+struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
+{
+	struct homa_socktab_links *links;
+	struct homa_sock *hsk;
+
+	while (1) {
+		while (!scan->next) {
+			scan->current_bucket++;
+			if (scan->current_bucket >= HOMA_SOCKTAB_BUCKETS)
+				return NULL;
+			scan->next = (struct homa_socktab_links *)
+				      hlist_first_rcu(&scan->socktab->buckets
+				      [scan->current_bucket]);
+		}
+		links = scan->next;
+		hsk = links->sock;
+		scan->next = (struct homa_socktab_links *)hlist_next_rcu(&links
+				->hash_links);
+		return hsk;
+	}
+}
+
+/**
+ * homa_sock_init() - Constructor for homa_sock objects. This function
+ * initializes only the parts of the socket that are owned by Homa.
+ * @hsk:    Object to initialize.
+ * @homa:   Homa implementation that will manage the socket.
+ *
+ * Return: always 0 (success).
+ */
+void homa_sock_init(struct homa_sock *hsk, struct homa *homa)
+{
+	struct homa_socktab *socktab = homa->port_map;
+	int i;
+
+	spin_lock_bh(&socktab->write_lock);
+	atomic_set(&hsk->protect_count, 0);
+	spin_lock_init(&hsk->lock);
+	hsk->last_locker = "none";
+	atomic_set(&hsk->protect_count, 0);
+	hsk->homa = homa;
+	hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET)
+			? HOMA_IPV4_HEADER_LENGTH : HOMA_IPV6_HEADER_LENGTH;
+	hsk->shutdown = false;
+	while (1) {
+		if (homa->next_client_port < HOMA_MIN_DEFAULT_PORT)
+			homa->next_client_port = HOMA_MIN_DEFAULT_PORT;
+		if (!homa_sock_find(socktab, homa->next_client_port))
+			break;
+		homa->next_client_port++;
+	}
+	hsk->port = homa->next_client_port;
+	hsk->inet.inet_num = hsk->port;
+	hsk->inet.inet_sport = htons(hsk->port);
+	homa->next_client_port++;
+	hsk->socktab_links.sock = hsk;
+	hlist_add_head_rcu(&hsk->socktab_links.hash_links,
+			   &socktab->buckets[homa_port_hash(hsk->port)]);
+	INIT_LIST_HEAD(&hsk->active_rpcs);
+	INIT_LIST_HEAD(&hsk->dead_rpcs);
+	hsk->dead_skbs = 0;
+	INIT_LIST_HEAD(&hsk->waiting_for_bufs);
+	INIT_LIST_HEAD(&hsk->ready_requests);
+	INIT_LIST_HEAD(&hsk->ready_responses);
+	INIT_LIST_HEAD(&hsk->request_interests);
+	INIT_LIST_HEAD(&hsk->response_interests);
+	for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
+		struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
+
+		spin_lock_init(&bucket->lock);
+		INIT_HLIST_HEAD(&bucket->rpcs);
+		bucket->id = i;
+	}
+	for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
+		struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
+
+		spin_lock_init(&bucket->lock);
+		INIT_HLIST_HEAD(&bucket->rpcs);
+		bucket->id = i + 1000000;
+	}
+	hsk->buffer_pool = kzalloc(sizeof(*hsk->buffer_pool), GFP_KERNEL);
+	spin_unlock_bh(&socktab->write_lock);
+}
+
+/**
+ * homa_sock_shutdown() - Disable a socket so that it can no longer
+ * be used for either sending or receiving messages. Any system calls
+ * currently waiting to send or receive messages will be aborted.
+ * @hsk:       Socket to shut down.
+ */
+void homa_sock_shutdown(struct homa_sock *hsk)
+{
+	struct homa_interest *interest;
+	struct homa_rpc *rpc;
+	int i;
+
+	homa_sock_lock(hsk, "homa_socket_shutdown");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		return;
+	}
+
+	/* The order of cleanup is very important, because there could be
+	 * active operations that hold RPC locks but not the socket lock.
+	 * 1. Set @shutdown; this ensures that no new RPCs will be created for
+	 *    this socket (though some creations might already be in progress).
+	 * 2. Remove the socket from the port map: this ensures that
+	 *    incoming packets for the socket will be dropped.
+	 * 3. Go through all of the RPCs and delete them; this will
+	 *    synchronize with any operations in progress.
+	 * 4. Perform other socket cleanup: at this point we know that
+	 *    there will be no concurrent activities on individual RPCs.
+	 * 5. Don't delete the buffer pool until after all of the RPCs
+	 *    have been reaped.
+	 * See sync.txt for additional information about locking.
+	 */
+	hsk->shutdown = true;
+	spin_lock_bh(&hsk->homa->port_map->write_lock);
+	hlist_del_rcu(&hsk->socktab_links.hash_links);
+	spin_unlock_bh(&hsk->homa->port_map->write_lock);
+	homa_sock_unlock(hsk);
+
+	list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+		homa_rpc_lock(rpc, "homa_sock_shutdown");
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+	homa_sock_lock(hsk, "homa_socket_shutdown #2");
+	list_for_each_entry(interest, &hsk->request_interests, request_links)
+		wake_up_process(interest->thread);
+	list_for_each_entry(interest, &hsk->response_interests, response_links)
+		wake_up_process(interest->thread);
+	homa_sock_unlock(hsk);
+
+	i = 0;
+	while (!list_empty(&hsk->dead_rpcs)) {
+		homa_rpc_reap(hsk, 1000);
+		i++;
+	}
+
+	homa_pool_destroy(hsk->buffer_pool);
+	kfree(hsk->buffer_pool);
+	hsk->buffer_pool = NULL;
+}
+
+/**
+ * homa_sock_destroy() - Destructor for homa_sock objects. This function
+ * only cleans up the parts of the object that are owned by Homa.
+ * @hsk:       Socket to destroy.
+ */
+void homa_sock_destroy(struct homa_sock *hsk)
+{
+	homa_sock_shutdown(hsk);
+	sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
+}
+
+/**
+ * homa_sock_bind() - Associates a server port with a socket; if there
+ * was a previous server port assignment for @hsk, it is abandoned.
+ * @socktab:   Hash table in which the binding will be recorded.
+ * @hsk:       Homa socket.
+ * @port:      Desired server port for @hsk. If 0, then this call
+ *             becomes a no-op: the socket will continue to use
+ *             its randomly assigned client port.
+ *
+ * Return:  0 for success, otherwise a negative errno.
+ */
+int homa_sock_bind(struct homa_socktab *socktab, struct homa_sock *hsk,
+		   __u16 port)
+{
+	struct homa_sock *owner;
+	int result = 0;
+
+	if (port == 0)
+		return result;
+	if (port >= HOMA_MIN_DEFAULT_PORT)
+		return -EINVAL;
+	homa_sock_lock(hsk, "homa_sock_bind");
+	spin_lock_bh(&socktab->write_lock);
+	if (hsk->shutdown) {
+		result = -ESHUTDOWN;
+		goto done;
+	}
+
+	owner = homa_sock_find(socktab, port);
+	if (owner) {
+		if (owner != hsk)
+			result = -EADDRINUSE;
+		goto done;
+	}
+	hlist_del_rcu(&hsk->socktab_links.hash_links);
+	hsk->port = port;
+	hsk->inet.inet_num = port;
+	hsk->inet.inet_sport = htons(hsk->port);
+	hlist_add_head_rcu(&hsk->socktab_links.hash_links,
+			   &socktab->buckets[homa_port_hash(port)]);
+done:
+	spin_unlock_bh(&socktab->write_lock);
+	homa_sock_unlock(hsk);
+	return result;
+}
+
+/**
+ * homa_sock_find() - Returns the socket associated with a given port.
+ * @socktab:    Hash table in which to perform lookup.
+ * @port:       The port of interest.
+ * Return:      The socket that owns @port, or NULL if none.
+ *
+ * Note: this function uses RCU list-searching facilities, but it doesn't
+ * call rcu_read_lock. The caller should do that, if the caller cares (this
+ * way, the caller's use of the socket will also be protected).
+ */
+struct homa_sock *homa_sock_find(struct homa_socktab *socktab,  __u16 port)
+{
+	struct homa_socktab_links *link;
+	struct homa_sock *result = NULL;
+
+	hlist_for_each_entry_rcu(link, &socktab->buckets[homa_port_hash(port)],
+				 hash_links) {
+		struct homa_sock *hsk = link->sock;
+
+		if (hsk->port == port) {
+			result = hsk;
+			break;
+		}
+	}
+	return result;
+}
+
+/**
+ * homa_sock_lock_slow() - This function implements the slow path for
+ * acquiring a socketC lock. It is invoked when a socket lock isn't immediately
+ * available. It waits for the lock, but also records statistics about
+ * the waiting time.
+ * @hsk:    socket to  lock.
+ */
+void homa_sock_lock_slow(struct homa_sock *hsk)
+{
+	spin_lock_bh(&hsk->lock);
+}
+
+/**
+ * homa_bucket_lock_slow() - This function implements the slow path for
+ * locking a bucket in one of the hash tables of RPCs. It is invoked when a
+ * lock isn't immediately available. It waits for the lock, but also records
+ * statistics about the waiting time.
+ * @bucket:    The hash table bucket to lock.
+ * @id:        ID of the particular RPC being locked (multiple RPCs may
+ *             share a single bucket lock).
+ */
+void homa_bucket_lock_slow(struct homa_rpc_bucket *bucket, __u64 id)
+{
+	spin_lock_bh(&bucket->lock);
+}
diff --git a/net/homa/homa_sock.h b/net/homa/homa_sock.h
new file mode 100644
index 000000000000..bd56c7345a99
--- /dev/null
+++ b/net/homa/homa_sock.h
@@ -0,0 +1,399 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines structs and other things related to Homa sockets.  */
+
+#ifndef _HOMA_SOCK_H
+#define _HOMA_SOCK_H
+
+/* Forward declarations. */
+struct homa;
+struct homa_pool;
+
+void     homa_sock_lock_slow(struct homa_sock *hsk);
+
+/**
+ * define HOMA_SOCKTAB_BUCKETS - Number of hash buckets in a homa_socktab.
+ * Must be a power of 2.
+ */
+#define HOMA_SOCKTAB_BUCKETS 1024
+
+/**
+ * struct homa_socktab - A hash table that maps from port numbers (either
+ * client or server) to homa_sock objects.
+ *
+ * This table is managed exclusively by homa_socktab.c, using RCU to
+ * minimize synchronization during lookups.
+ */
+struct homa_socktab {
+	/**
+	 * @mutex: Controls all modifications to this object; not needed
+	 * for socket lookups (RCU is used instead). Also used to
+	 * synchronize port allocation.
+	 */
+	spinlock_t write_lock;
+
+	/**
+	 * @buckets: Heads of chains for hash table buckets. Chains
+	 * consist of homa_socktab_link objects.
+	 */
+	struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
+};
+
+/**
+ * struct homa_socktab_links - Used to link homa_socks into the hash chains
+ * of a homa_socktab.
+ */
+struct homa_socktab_links {
+	/* Must be the first element of the struct! */
+	struct hlist_node hash_links;
+	struct homa_sock *sock;
+};
+
+/**
+ * struct homa_socktab_scan - Records the state of an iteration over all
+ * the entries in a homa_socktab, in a way that permits RCU-safe deletion
+ * of entries.
+ */
+struct homa_socktab_scan {
+	/** @socktab: The table that is being scanned. */
+	struct homa_socktab *socktab;
+
+	/**
+	 * @current_bucket: the index of the bucket in socktab->buckets
+	 * currently being scanned. If >= HOMA_SOCKTAB_BUCKETS, the scan
+	 * is complete.
+	 */
+	int current_bucket;
+
+	/**
+	 * @next: the next socket to return from homa_socktab_next (this
+	 * socket has not yet been returned). NULL means there are no
+	 * more sockets in the current bucket.
+	 */
+	struct homa_socktab_links *next;
+};
+
+/**
+ * struct homa_rpc_bucket - One bucket in a hash table of RPCs.
+ */
+
+struct homa_rpc_bucket {
+	/**
+	 * @lock: serves as a lock both for this bucket (e.g., when
+	 * adding and removing RPCs) and also for all of the RPCs in
+	 * the bucket. Must be held whenever manipulating an RPC in
+	 * this bucket. This dual purpose permits clean and safe
+	 * deletion and garbage collection of RPCs.
+	 */
+	spinlock_t lock;
+
+	/** @rpcs: list of RPCs that hash to this bucket. */
+	struct hlist_head rpcs;
+
+	/** @id: identifier for this bucket, used in error messages etc.
+	 * It's the index of the bucket within its hash table bucket
+	 * array, with an additional offset to separate server and
+	 * client RPCs.
+	 */
+	int id;
+};
+
+/**
+ * define HOMA_CLIENT_RPC_BUCKETS - Number of buckets in hash tables for
+ * client RPCs. Must be a power of 2.
+ */
+#define HOMA_CLIENT_RPC_BUCKETS 1024
+
+/**
+ * define HOMA_SERVER_RPC_BUCKETS - Number of buckets in hash tables for
+ * server RPCs. Must be a power of 2.
+ */
+#define HOMA_SERVER_RPC_BUCKETS 1024
+
+/**
+ * struct homa_sock - Information about an open socket.
+ */
+struct homa_sock {
+	/* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
+	 * comes at the very end of the struct, *after* Homa's data, if this
+	 * socket uses IPv6).
+	 */
+	union {
+		/** @sock: generic socket data; must be the first field. */
+		struct sock sock;
+
+		/**
+		 * @inet: generic Internet socket data; must also be the
+		 first field (contains sock as its first member).
+		 */
+		struct inet_sock inet;
+	};
+
+	/**
+	 * @lock: Must be held when modifying fields such as interests
+	 * and lists of RPCs. This lock is used in place of sk->sk_lock
+	 * because it's used differently (it's always used as a simple
+	 * spin lock).  See sync.txt for more on Homa's synchronization
+	 * strategy.
+	 */
+	spinlock_t lock;
+
+	/**
+	 * @last_locker: identifies the code that most recently acquired
+	 * @lock successfully. Occasionally used for debugging.
+	 */
+	char *last_locker;
+
+	/**
+	 * @protect_count: counts the number of calls to homa_protect_rpcs
+	 * for which there have not yet been calls to homa_unprotect_rpcs.
+	 * See sync.txt for more info.
+	 */
+	atomic_t protect_count;
+
+	/**
+	 * @homa: Overall state about the Homa implementation. NULL
+	 * means this socket has been deleted.
+	 */
+	struct homa *homa;
+
+	/** @shutdown: True means the socket is no longer usable. */
+	bool shutdown;
+
+	/**
+	 * @port: Port number: identifies this socket uniquely among all
+	 * those on this node.
+	 */
+	__u16 port;
+
+	/**
+	 * @ip_header_length: Length of IP headers for this socket (depends
+	 * on IPv4 vs. IPv6).
+	 */
+	int ip_header_length;
+
+	/**
+	 * @client_socktab_links: Links this socket into the homa_socktab
+	 * based on @port.
+	 */
+	struct homa_socktab_links socktab_links;
+
+	/**
+	 * @active_rpcs: List of all existing RPCs related to this socket,
+	 * including both client and server RPCs. This list isn't strictly
+	 * needed, since RPCs are already in one of the hash tables below,
+	 * but it's more efficient for homa_timer to have this list
+	 * (so it doesn't have to scan large numbers of hash buckets).
+	 * The list is sorted, with the oldest RPC first. Manipulate with
+	 * RCU so timer can access without locking.
+	 */
+	struct list_head active_rpcs;
+
+	/**
+	 * @dead_rpcs: Contains RPCs for which homa_rpc_free has been
+	 * called, but their packet buffers haven't yet been freed.
+	 */
+	struct list_head dead_rpcs;
+
+	/** @dead_skbs: Total number of socket buffers in RPCs on dead_rpcs. */
+	int dead_skbs;
+
+	/**
+	 * @waiting_for_bufs: Contains RPCs that are blocked because there
+	 * wasn't enough space in the buffer pool region for their incoming
+	 * messages. Sorted in increasing order of message length.
+	 */
+	struct list_head waiting_for_bufs;
+
+	/**
+	 * @ready_requests: Contains server RPCs whose request message is
+	 * in a state requiring attention from  a user process. The head is
+	 * oldest, i.e. next to return.
+	 */
+	struct list_head ready_requests;
+
+	/**
+	 * @ready_responses: Contains client RPCs whose response message is
+	 * in a state requiring attention from a user process. The head is
+	 * oldest, i.e. next to return.
+	 */
+	struct list_head ready_responses;
+
+	/**
+	 * @request_interests: List of threads that want to receive incoming
+	 * request messages.
+	 */
+	struct list_head request_interests;
+
+	/**
+	 * @response_interests: List of threads that want to receive incoming
+	 * response messages.
+	 */
+	struct list_head response_interests;
+
+	/**
+	 * @client_rpc_buckets: Hash table for fast lookup of client RPCs.
+	 * Modifications are synchronized with bucket locks, not
+	 * the socket lock.
+	 */
+	struct homa_rpc_bucket client_rpc_buckets[HOMA_CLIENT_RPC_BUCKETS];
+
+	/**
+	 * @server_rpc_buckets: Hash table for fast lookup of server RPCs.
+	 * Modifications are synchronized with bucket locks, not
+	 * the socket lock.
+	 */
+	struct homa_rpc_bucket server_rpc_buckets[HOMA_SERVER_RPC_BUCKETS];
+
+	/**
+	 * @buffer_pool: used to allocate buffer space for incoming messages.
+	 * Storage is dynamically allocated.
+	 */
+	struct homa_pool *buffer_pool;
+};
+
+void     homa_bucket_lock_slow(struct homa_rpc_bucket *bucket, __u64 id);
+int      homa_sock_bind(struct homa_socktab *socktab,
+			struct homa_sock *hsk, __u16 port);
+void     homa_sock_destroy(struct homa_sock *hsk);
+struct homa_sock *
+		    homa_sock_find(struct homa_socktab *socktab, __u16 port);
+void     homa_sock_init(struct homa_sock *hsk, struct homa *homa);
+void     homa_sock_shutdown(struct homa_sock *hsk);
+int      homa_socket(struct sock *sk);
+void     homa_socktab_destroy(struct homa_socktab *socktab);
+void     homa_socktab_init(struct homa_socktab *socktab);
+struct homa_sock
+	       *homa_socktab_next(struct homa_socktab_scan *scan);
+struct homa_sock
+	       *homa_socktab_start_scan(struct homa_socktab *socktab,
+		    struct homa_socktab_scan *scan);
+
+/**
+ * homa_sock_lock() - Acquire the lock for a socket. If the socket
+ * isn't immediately available, record stats on the waiting time.
+ * @hsk:     Socket to lock.
+ * @locker:  Static string identifying where the socket was locked;
+ *           used to track down deadlocks.
+ */
+static inline void homa_sock_lock(struct homa_sock *hsk, const char *locker)
+{
+	if (!spin_trylock_bh(&hsk->lock)) {
+//		printk(KERN_NOTICE "Slow path for socket %d, last locker %s",
+//				hsk->client_port, hsk->last_locker);
+		homa_sock_lock_slow(hsk);
+	}
+//	hsk->last_locker = locker;
+}
+
+/**
+ * homa_sock_unlock() - Release the lock for a socket.
+ * @hsk:   Socket to lock.
+ */
+static inline void homa_sock_unlock(struct homa_sock *hsk)
+{
+	spin_unlock_bh(&hsk->lock);
+}
+
+/**
+ * port_hash() - Hash function for port numbers.
+ * @port:   Port number being looked up.
+ *
+ * Return:  The index of the bucket in which this port will be found (if
+ *          it exists.
+ */
+static inline int homa_port_hash(__u16 port)
+{
+	/* We can use a really simple hash function here because client
+	 * port numbers are allocated sequentially and server port numbers
+	 * are unpredictable.
+	 */
+	return port & (HOMA_SOCKTAB_BUCKETS - 1);
+}
+
+/**
+ * homa_client_rpc_bucket() - Find the bucket containing a given
+ * client RPC.
+ * @hsk:      Socket associated with the RPC.
+ * @id:       Id of the desired RPC.
+ *
+ * Return:    The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket *homa_client_rpc_bucket(struct homa_sock *hsk,
+							     __u64 id)
+{
+	/* We can use a really simple hash function here because RPC ids
+	 * are allocated sequentially.
+	 */
+	return &hsk->client_rpc_buckets[(id >> 1)
+			& (HOMA_CLIENT_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_server_rpc_bucket() - Find the bucket containing a given
+ * server RPC.
+ * @hsk:         Socket associated with the RPC.
+ * @id:          Id of the desired RPC.
+ *
+ * Return:    The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket *homa_server_rpc_bucket(struct homa_sock *hsk,
+							     __u64 id)
+{
+	/* Each client allocates RPC ids sequentially, so they will
+	 * naturally distribute themselves across the hash space.
+	 * Thus we can use the id directly as hash.
+	 */
+	return &hsk->server_rpc_buckets[(id >> 1)
+			& (HOMA_SERVER_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_bucket_lock() - Acquire the lock for an RPC hash table bucket.
+ * @bucket:    Bucket to lock
+ * @id:        ID of the RPC that is requesting the lock. Normally ignored,
+ *             but used occasionally for diagnostics and debugging.
+ * @locker:    Static string identifying the locking code. Normally ignored,
+ *             but used occasionally for diagnostics and debugging.
+ */
+static inline void homa_bucket_lock(struct homa_rpc_bucket *bucket,
+				    __u64 id, const char *locker)
+{
+	if (!spin_trylock_bh(&bucket->lock))
+		homa_bucket_lock_slow(bucket, id);
+}
+
+/**
+ * homa_bucket_try_lock() - Acquire the lock for an RPC hash table bucket if
+ * it is available.
+ * @bucket:    Bucket to lock
+ * @id:        ID of the RPC that is requesting the lock.
+ * @locker:    Static string identifying the locking code. Normally ignored,
+ *             but used when debugging deadlocks.
+ * Return:     Nonzero if lock was successfully acquired, zero if it is
+ *             currently owned by someone else.
+ */
+static inline int homa_bucket_try_lock(struct homa_rpc_bucket *bucket,
+				       __u64 id, const char *locker)
+{
+	if (!spin_trylock_bh(&bucket->lock))
+		return 0;
+	return 1;
+}
+
+/**
+ * homa_bucket_unlock() - Release the lock for an RPC hash table bucket.
+ * @bucket:   Bucket to unlock.
+ * @id:       ID of the RPC that was using the lock.
+ */
+static inline void homa_bucket_unlock(struct homa_rpc_bucket *bucket, __u64 id)
+{
+	spin_unlock_bh(&bucket->lock);
+}
+
+static inline struct homa_sock *homa_sk(const struct sock *sk)
+{
+	return (struct homa_sock *)sk;
+}
+
+#endif /* _HOMA_SOCK_H */

From patchwork Mon Oct 28 21:35:35 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854179
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 370881E0E09
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151386; cv=none;
 b=Jz2x3ObmGlbUY6RxURbEOAWwJokTnrScv9Id+X8+IwLyNPfDAnXQI+mPmnji5kTr64LNt2yiPjVMzoQ3SvaNVkzVjueeBdxwD4Zemn1DCwNIZcTaWHxpmE7plDotEvfxz8ZrgVEUOCw6oa2ieeD2us2ByDRojCfaK2T8tekTCpY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151386; c=relaxed/simple;
	bh=I1JapYntQF8jpxEyVTDV9EJXVlXg89hOgLcAGnFR4t4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=J/UuXMSWqU+AcnY0Ah5tZuW5AOyFaSsllILnNOCfD16s8cAYZr7YJhnOq1N0JYKescG8nX0DKrnu0mFf0f+CipAf30UQ60WhEobxa/j0qlj5LoncZeoF+V3Z3R+MNtlMzHtH6i8u8/50r8fojLVj0pZtVW2vRo2mwYJxr3VI9pg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=RmLebACJ; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="RmLebACJ"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=NXApKGXIATZE+g/rN/XedxaiKnuRhRaiT4k4CAXcVNI=; t=1730151382; x=1731015382;
	b=RmLebACJY99dbYMtWY+zms5J8pRrmU3w4nMMXV27Jm8JXZmPfZx12SX9hLm297AFOuQfM1ddpbj
	3Xb4FfShmmQjz0I0N/BLPNGtu+4SgcBHLDar+X5EHCtPzJIbVu98w8ya5e66uQ5J9cJ4xncI+otUX
	f6n+SVkZ7rKtwnuVsjdUEQVn5w+SJqCvqFaEl8ViCKEu8La1BWIF2Oqs+IolP3uDbW4Qo/U5oSRoM
	ykCMnyrFwXy9r6lVbeoGVbEa3habhI7Ic5tz9G6cvA8FOHxJAMnalrikthtZXE5nn+mGvGilrXqmF
	k+r9e8FRze/zASo0asypVQWq8ZbBDmnnYzhw==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPQ-0005xj-Ar; Mon, 28 Oct 2024 14:36:21 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 08/12] net: homa: create homa_incoming.c
Date: Mon, 28 Oct 2024 14:35:35 -0700
Message-ID: <20241028213541.1529-9-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 15d5a4467fa81e6efcf55de4c2c34fa6
X-Patchwork-Delegate: kuba@kernel.org

This file contains most of the code for handling incoming packets,
including top-level dispatching code plus specific handlers for each
pack type. It also contains code for dispatching fully-received
messages to waiting application threads.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_incoming.c | 1088 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 1088 insertions(+)
 create mode 100644 net/homa/homa_incoming.c

diff --git a/net/homa/homa_incoming.c b/net/homa/homa_incoming.c
new file mode 100644
index 000000000000..c61e5e250da1
--- /dev/null
+++ b/net/homa/homa_incoming.c
@@ -0,0 +1,1088 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions that handle incoming Homa messages, including
+ * both receiving information for those messages and sending grants.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_message_in_init() - Constructor for homa_message_in.
+ * @rpc:          RPC whose msgin structure should be initialized.
+ * @length:       Total number of bytes in message.
+ * Return:        Zero for successful initialization, or a negative errno
+ *                if rpc->msgin could not be initialized.
+ */
+int homa_message_in_init(struct homa_rpc *rpc, int length)
+{
+	int err;
+
+	rpc->msgin.length = length;
+	skb_queue_head_init(&rpc->msgin.packets);
+	rpc->msgin.recv_end = 0;
+	INIT_LIST_HEAD(&rpc->msgin.gaps);
+	rpc->msgin.bytes_remaining = length;
+	rpc->msgin.resend_all = 0;
+	rpc->msgin.num_bpages = 0;
+	err = homa_pool_allocate(rpc);
+	if (err != 0)
+		return err;
+	return 0;
+}
+
+/**
+ * homa_gap_new() - Create a new gap and add it to a list.
+ * @next:   Add the new gap just before this list element.
+ * @start:  Offset of first byte covered by the gap.
+ * @end:    Offset of byte just after the last one covered by the gap.
+ * Return:  Pointer to the new gap.
+ */
+struct homa_gap *homa_gap_new(struct list_head *next, int start, int end)
+{
+	struct homa_gap *gap;
+
+	gap = kmalloc(sizeof(*gap), GFP_KERNEL);
+	gap->start = start;
+	gap->end = end;
+	gap->time = get_cycles();
+	list_add_tail(&gap->links, next);
+	return gap;
+}
+
+/**
+ * homa_gap_retry() - Send RESEND requests for all of the unreceived
+ * gaps in a message.
+ * @rpc:     RPC to check; must be locked by caller.
+ */
+void homa_gap_retry(struct homa_rpc *rpc)
+{
+	struct resend_header resend;
+	struct homa_gap *gap;
+
+	list_for_each_entry(gap, &rpc->msgin.gaps, links) {
+		resend.offset = htonl(gap->start);
+		resend.length = htonl(gap->end - gap->start);
+		homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+	}
+}
+
+/**
+ * homa_add_packet() - Add an incoming packet to the contents of a
+ * partially received message.
+ * @rpc:   Add the packet to the msgin for this RPC.
+ * @skb:   The new packet. This function takes ownership of the packet
+ *         (the packet will either be freed or added to rpc->msgin.packets).
+ */
+void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb)
+{
+	struct data_header *h = (struct data_header *)skb->data;
+	struct homa_gap *gap, *dummy, *gap2;
+	int start = ntohl(h->seg.offset);
+	int length = homa_data_len(skb);
+	int end = start + length;
+
+	if ((start + length) > rpc->msgin.length)
+		goto discard;
+
+	if (start == rpc->msgin.recv_end) {
+		/* Common case: packet is sequential. */
+		rpc->msgin.recv_end += length;
+		goto keep;
+	}
+
+	if (start > rpc->msgin.recv_end) {
+		/* Packet creates a new gap. */
+		homa_gap_new(&rpc->msgin.gaps, rpc->msgin.recv_end, start);
+		rpc->msgin.recv_end = end;
+		goto keep;
+	}
+
+	/* Must now check to see if the packet fills in part or all of
+	 * an existing gap.
+	 */
+	list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
+		/* Is packet at the start of this gap? */
+		if (start <= gap->start) {
+			if (end <= gap->start)
+				continue;
+			if (start < gap->start)
+				goto discard;
+			if (end > gap->end)
+				goto discard;
+			gap->start = end;
+			if (gap->start >= gap->end) {
+				list_del(&gap->links);
+				kfree(gap);
+			}
+			goto keep;
+		}
+
+		/* Is packet at the end of this gap? BTW, at this point we know
+		 * the packet can't cover the entire gap.
+		 */
+		if (end >= gap->end) {
+			if (start >= gap->end)
+				continue;
+			if (end > gap->end)
+				goto discard;
+			gap->end = start;
+			goto keep;
+		}
+
+		/* Packet is in the middle of the gap; must split the gap. */
+		gap2 = homa_gap_new(&gap->links, gap->start, start);
+		gap2->time = gap->time;
+		gap->start = end;
+		goto keep;
+	}
+
+discard:
+	kfree_skb(skb);
+	return;
+
+keep:
+	__skb_queue_tail(&rpc->msgin.packets, skb);
+	rpc->msgin.bytes_remaining -= length;
+}
+
+/**
+ * homa_copy_to_user() - Copy as much data as possible from incoming
+ * packet buffers to buffers in user space.
+ * @rpc:     RPC for which data should be copied. Must be locked by caller.
+ * Return:   Zero for success or a negative errno if there is an error.
+ *           It is possible for the RPC to be freed while this function
+ *           executes (it releases and reacquires the RPC lock). If that
+ *           happens, -EINVAL will be returned and the state of @rpc
+ *           will be RPC_DEAD.
+ */
+int homa_copy_to_user(struct homa_rpc *rpc)
+{
+#ifdef __UNIT_TEST__
+#define MAX_SKBS 3
+#else
+#define MAX_SKBS 20
+#endif
+	struct sk_buff *skbs[MAX_SKBS];
+	int error = 0;
+	int n = 0;             /* Number of filled entries in skbs. */
+	int i;
+
+	/* Tricky note: we can't hold the RPC lock while we're actually
+	 * copying to user space, because (a) it's illegal to hold a spinlock
+	 * while copying to user space and (b) we'd like for homa_softirq
+	 * to add more packets to the RPC while we're copying these out.
+	 * So, collect a bunch of packets to copy, then release the lock,
+	 * copy them, and reacquire the lock.
+	 */
+	while (true) {
+		struct sk_buff *skb;
+
+		if (rpc->state == RPC_DEAD) {
+			error = -EINVAL;
+			break;
+		}
+
+		skb = __skb_dequeue(&rpc->msgin.packets);
+		if (skb) {
+			skbs[n] = skb;
+			n++;
+			if (n < MAX_SKBS)
+				continue;
+		}
+		if (n == 0)
+			break;
+
+		/* At this point we've collected a batch of packets (or
+		 * run out of packets); copy any available packets out to
+		 * user space.
+		 */
+		atomic_or(RPC_COPYING_TO_USER, &rpc->flags);
+		homa_rpc_unlock(rpc);
+
+		/* Each iteration of this loop copies out one skb. */
+		for (i = 0; i < n; i++) {
+			struct data_header *h = (struct data_header *)
+					skbs[i]->data;
+			int pkt_length = homa_data_len(skbs[i]);
+			int offset = ntohl(h->seg.offset);
+			int buf_bytes, chunk_size;
+			struct iov_iter iter;
+			int copied = 0;
+			char *dst;
+
+			/* Each iteration of this loop copies to one
+			 * user buffer.
+			 */
+			while (copied < pkt_length) {
+				chunk_size = pkt_length - copied;
+				dst = homa_pool_get_buffer(rpc, offset + copied,
+							   &buf_bytes);
+				if (buf_bytes < chunk_size) {
+					if (buf_bytes == 0)
+						/* skb has data beyond message
+						 * end?
+						 */
+						break;
+					chunk_size = buf_bytes;
+				}
+				error = import_ubuf(READ, dst, chunk_size,
+						    &iter);
+				if (error)
+					goto free_skbs;
+				error = skb_copy_datagram_iter(skbs[i],
+							       sizeof(*h) + copied,
+							       &iter,
+							       chunk_size);
+				if (error)
+					goto free_skbs;
+				copied += chunk_size;
+			}
+		}
+
+free_skbs:
+		for (i = 0; i < n; i++)
+			kfree_skb(skbs[i]);
+		n = 0;
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc, "homa_copy_to_user");
+		atomic_andnot(APP_NEEDS_LOCK | RPC_COPYING_TO_USER, &rpc->flags);
+		if (error)
+			break;
+	}
+	return error;
+}
+
+/**
+ * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
+ * all related to the same RPC.
+ * @skb:       First packet in the batch, linked through skb->next.
+ * @homa:      Overall information about the Homa transport.
+ */
+void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)
+{
+#ifdef __UNIT_TEST__
+#define MAX_ACKS 2
+#else
+#define MAX_ACKS 10
+#endif
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct data_header *h = (struct data_header *)skb->data;
+	__u64 id = homa_local_id(h->common.sender_id);
+	int dport = ntohs(h->common.dport);
+
+	/* Used to collect acks from data packets so we can process them
+	 * all at the end (can't process them inline because that may
+	 * require locking conflicting RPCs). If we run out of space just
+	 * ignore the extra acks; they'll be regenerated later through the
+	 * explicit mechanism.
+	 */
+	struct homa_ack acks[MAX_ACKS];
+	struct homa_rpc *rpc = NULL;
+	struct homa_sock *hsk;
+	struct sk_buff *next;
+	int num_acks = 0;
+
+	/* Find the appropriate socket.*/
+	hsk = homa_sock_find(homa->port_map, dport);
+	if (!hsk) {
+		if (skb_is_ipv6(skb))
+			icmp6_send(skb, ICMPV6_DEST_UNREACH,
+				   ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
+		else
+			icmp_send(skb, ICMP_DEST_UNREACH,
+				  ICMP_PORT_UNREACH, 0);
+		while (skb) {
+			next = skb->next;
+			kfree_skb(skb);
+			skb = next;
+		}
+		return;
+	}
+
+	/* Each iteration through the following loop processes one packet. */
+	for (; skb; skb = next) {
+		h = (struct data_header *)skb->data;
+		next = skb->next;
+
+		/* Relinquish the RPC lock temporarily if it's needed
+		 * elsewhere.
+		 */
+		if (rpc) {
+			int flags = atomic_read(&rpc->flags);
+
+			if (flags & APP_NEEDS_LOCK) {
+				homa_rpc_unlock(rpc);
+				homa_spin(200);
+				rpc = NULL;
+			}
+		}
+
+		/* Find and lock the RPC if we haven't already done so. */
+		if (!rpc) {
+			if (!homa_is_client(id)) {
+				/* We are the server for this RPC. */
+				if (h->common.type == DATA) {
+					int created;
+
+					/* Create a new RPC if one doesn't
+					 * already exist.
+					 */
+					rpc = homa_rpc_new_server(hsk, &saddr,
+								  h, &created);
+					if (IS_ERR(rpc)) {
+						pr_warn("homa_pkt_dispatch couldn't create server rpc: error %lu",
+							-PTR_ERR(rpc));
+						rpc = NULL;
+						goto discard;
+					}
+				} else {
+					rpc = homa_find_server_rpc(hsk, &saddr,
+								   ntohs(h->common.sport),
+								   id);
+				}
+			} else {
+				rpc = homa_find_client_rpc(hsk, id);
+			}
+		}
+		if (unlikely(!rpc)) {
+			if (h->common.type != NEED_ACK &&
+			    h->common.type != ACK && h->common.type != RESEND)
+				goto discard;
+		} else {
+			if (h->common.type == DATA || h->common.type == BUSY ||
+			    h->common.type == NEED_ACK)
+				rpc->silent_ticks = 0;
+			rpc->peer->outstanding_resends = 0;
+		}
+
+		switch (h->common.type) {
+		case DATA:
+			if (h->ack.client_id) {
+				/* Save the ack for processing later, when we
+				 * have released the RPC lock.
+				 */
+				if (num_acks < MAX_ACKS) {
+					acks[num_acks] = h->ack;
+					num_acks++;
+				}
+			}
+			homa_data_pkt(skb, rpc);
+			break;
+		case RESEND:
+			homa_resend_pkt(skb, rpc, hsk);
+			break;
+		case UNKNOWN:
+			homa_unknown_pkt(skb, rpc);
+			break;
+		case BUSY:
+			/* Nothing to do for these packets except reset
+			 * silent_ticks, which happened above.
+			 */
+			goto discard;
+		case NEED_ACK:
+			homa_need_ack_pkt(skb, hsk, rpc);
+			break;
+		case ACK:
+			homa_ack_pkt(skb, hsk, rpc);
+			rpc = NULL;
+
+			/* It isn't safe to process more packets once we've
+			 * released the RPC lock (this should never happen).
+			 */
+			BUG_ON(next);
+			break;
+		default:
+			goto discard;
+		}
+		continue;
+
+discard:
+		kfree_skb(skb);
+	}
+	if (rpc)
+		homa_rpc_unlock(rpc);
+
+	while (num_acks > 0) {
+		num_acks--;
+		homa_rpc_acked(hsk, &saddr, &acks[num_acks]);
+	}
+
+	if (hsk->dead_skbs >= 2 * hsk->homa->dead_buffs_limit)
+		/* We get here if neither homa_wait_for_message
+		 * nor homa_timer can keep up with reaping dead
+		 * RPCs. See reap.txt for details.
+		 */
+
+		homa_rpc_reap(hsk, hsk->homa->reap_limit);
+}
+
+/**
+ * homa_data_pkt() - Handler for incoming DATA packets
+ * @skb:     Incoming packet; size known to be large enough for the header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet.
+ *           Must be locked by the caller.
+ */
+void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	struct data_header *h = (struct data_header *)skb->data;
+
+	if (rpc->state != RPC_INCOMING && homa_is_client(rpc->id)) {
+		if (unlikely(rpc->state != RPC_OUTGOING))
+			goto discard;
+		rpc->state = RPC_INCOMING;
+		if (homa_message_in_init(rpc, ntohl(h->message_length)) != 0)
+			goto discard;
+	} else if (rpc->state != RPC_INCOMING) {
+		/* Must be server; note that homa_rpc_new_server already
+		 * initialized msgin and allocated buffers.
+		 */
+		if (unlikely(rpc->msgin.length >= 0))
+			goto discard;
+	}
+
+	if (rpc->msgin.num_bpages == 0)
+		/* Drop packets that arrive when we can't allocate buffer
+		 * space. If we keep them around, packet buffer usage can
+		 * exceed available cache space, resulting in poor
+		 * performance.
+		 */
+		goto discard;
+
+	homa_add_packet(rpc, skb);
+
+	if (skb_queue_len(&rpc->msgin.packets) != 0 &&
+	    !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
+		atomic_or(RPC_PKTS_READY, &rpc->flags);
+		homa_sock_lock(rpc->hsk, "homa_data_pkt");
+		homa_rpc_handoff(rpc);
+		homa_sock_unlock(rpc->hsk);
+	}
+	return;
+
+discard:
+	kfree_skb(skb);
+	UNIT_LOG("; ", "homa_data_pkt discarded packet");
+}
+
+/**
+ * homa_resend_pkt() - Handler for incoming RESEND packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet; must
+ *           be locked by caller, but may be NULL if there is no RPC matching
+ *           this packet
+ * @hsk:     Socket on which the packet was received.
+ */
+void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+		     struct homa_sock *hsk)
+{
+	struct resend_header *h = (struct resend_header *)skb->data;
+	struct busy_header busy;
+
+	if (!rpc) {
+		homa_xmit_unknown(skb, hsk);
+		goto done;
+	}
+
+	if (!homa_is_client(rpc->id) && rpc->state != RPC_OUTGOING) {
+		/* We are the server for this RPC and don't yet have a
+		 * response packet, so just send BUSY.
+		 */
+		homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+		goto done;
+	}
+	if (ntohl(h->length) == 0)
+		/* This RESEND is from a server just trying to determine
+		 * whether the client still cares about the RPC; return
+		 * BUSY so the server doesn't time us out.
+		 */
+		homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+	homa_resend_data(rpc, ntohl(h->offset),
+			 ntohl(h->offset) + ntohl(h->length));
+
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_unknown_pkt() - Handler for incoming UNKNOWN packets.
+ * @skb:     Incoming packet; size known to be large enough for the header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet.
+ */
+void homa_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	if (homa_is_client(rpc->id)) {
+		if (rpc->state == RPC_OUTGOING) {
+			/* It appears that everything we've already transmitted
+			 * has been lost; retransmit it.
+			 */
+			homa_resend_data(rpc, 0, rpc->msgout.next_xmit_offset);
+			goto done;
+		}
+
+	} else {
+		homa_rpc_free(rpc);
+	}
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_need_ack_pkt() - Handler for incoming NEED_ACK packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @hsk:     Socket on which the packet was received.
+ * @rpc:     The RPC named in the packet header, or NULL if no such
+ *           RPC exists. The RPC has been locked by the caller.
+ */
+void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		       struct homa_rpc *rpc)
+{
+	struct common_header *h = (struct common_header *)skb->data;
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	__u64 id = homa_local_id(h->sender_id);
+	struct homa_peer *peer;
+	struct ack_header ack;
+
+	/* Return if it's not safe for the peer to purge its state
+	 * for this RPC (the RPC still exists and we haven't received
+	 * the entire response), or if we can't find peer info.
+	 */
+	if (rpc && (rpc->state != RPC_INCOMING ||
+		    rpc->msgin.bytes_remaining)) {
+		goto done;
+	} else {
+		peer = homa_peer_find(hsk->homa->peers, &saddr, &hsk->inet);
+		if (IS_ERR(peer))
+			goto done;
+	}
+
+	/* Send an ACK for this RPC. At the same time, include all of the
+	 * other acks available for the peer. Note: can't use rpc below,
+	 * since it may be NULL.
+	 */
+	ack.common.type = ACK;
+	ack.common.sport = h->dport;
+	ack.common.dport = h->sport;
+	ack.common.sender_id = cpu_to_be64(id);
+	ack.num_acks = htons(homa_peer_get_acks(peer,
+						HOMA_MAX_ACKS_PER_PKT,
+						ack.acks));
+	__homa_xmit_control(&ack, sizeof(ack), peer, hsk);
+
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_ack_pkt() - Handler for incoming ACK packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @hsk:     Socket on which the packet was received.
+ * @rpc:     The RPC named in the packet header, or NULL if no such
+ *           RPC exists. The RPC has been locked by the caller but will
+ *           be unlocked here.
+ */
+void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		  struct homa_rpc *rpc)
+{
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct ack_header *h = (struct ack_header *)skb->data;
+	int i, count;
+
+	if (rpc) {
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+	count = ntohs(h->num_acks);
+	for (i = 0; i < count; i++)
+		homa_rpc_acked(hsk, &saddr, &h->acks[i]);
+	kfree_skb(skb);
+}
+
+/**
+ * homa_rpc_abort() - Terminate an RPC.
+ * @rpc:     RPC to be terminated.  Must be locked by caller.
+ * @error:   A negative errno value indicating the error that caused the abort.
+ *           If this is a client RPC, the error will be returned to the
+ *           application; if it's a server RPC, the error is ignored and
+ *           we just free the RPC.
+ */
+void homa_rpc_abort(struct homa_rpc *rpc, int error)
+{
+	if (!homa_is_client(rpc->id)) {
+		homa_rpc_free(rpc);
+		return;
+	}
+	rpc->error = error;
+	homa_sock_lock(rpc->hsk, "homa_rpc_abort");
+	if (!rpc->hsk->shutdown)
+		homa_rpc_handoff(rpc);
+	homa_sock_unlock(rpc->hsk);
+}
+
+/**
+ * homa_abort_rpcs() - Abort all RPCs to/from a particular peer.
+ * @homa:    Overall data about the Homa protocol implementation.
+ * @addr:    Address (network order) of the destination whose RPCs are
+ *           to be aborted.
+ * @port:    If nonzero, then RPCs will only be aborted if they were
+ *	     targeted at this server port.
+ * @error:   Negative errno value indicating the reason for the abort.
+ */
+void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+		     int port, int error)
+{
+	struct homa_socktab_scan scan;
+	struct homa_rpc *rpc, *tmp;
+	struct homa_sock *hsk;
+
+	rcu_read_lock();
+	for (hsk = homa_socktab_start_scan(homa->port_map, &scan); hsk;
+	     hsk = homa_socktab_next(&scan)) {
+		/* Skip the (expensive) lock acquisition if there's no
+		 * work to do.
+		 */
+		if (list_empty(&hsk->active_rpcs))
+			continue;
+		if (!homa_protect_rpcs(hsk))
+			continue;
+		list_for_each_entry_safe(rpc, tmp, &hsk->active_rpcs,
+					 active_links) {
+			if (!ipv6_addr_equal(&rpc->peer->addr, addr))
+				continue;
+			if (port && rpc->dport != port)
+				continue;
+			homa_rpc_lock(rpc, "rpc_abort_rpcs");
+			homa_rpc_abort(rpc, error);
+			homa_rpc_unlock(rpc);
+		}
+		homa_unprotect_rpcs(hsk);
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * homa_abort_sock_rpcs() - Abort all outgoing (client-side) RPCs on a given
+ * socket.
+ * @hsk:         Socket whose RPCs should be aborted.
+ * @error:       Zero means that the aborted RPCs should be freed immediately.
+ *               A nonzero value means that the RPCs should be marked
+ *               complete, so that they can be returned to the application;
+ *               this value (a negative errno) will be returned from
+ *               recvmsg.
+ */
+void homa_abort_sock_rpcs(struct homa_sock *hsk, int error)
+{
+	struct homa_rpc *rpc, *tmp;
+
+	rcu_read_lock();
+	if (list_empty(&hsk->active_rpcs))
+		goto done;
+	if (!homa_protect_rpcs(hsk))
+		goto done;
+	list_for_each_entry_safe(rpc, tmp, &hsk->active_rpcs, active_links) {
+		if (!homa_is_client(rpc->id))
+			continue;
+		homa_rpc_lock(rpc, "homa_abort_sock_rpcs");
+		if (rpc->state == RPC_DEAD) {
+			homa_rpc_unlock(rpc);
+			continue;
+		}
+		if (error)
+			homa_rpc_abort(rpc, error);
+		else
+			homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+	homa_unprotect_rpcs(hsk);
+done:
+	rcu_read_unlock();
+}
+
+/**
+ * homa_register_interests() - Records information in various places so
+ * that a thread will be woken up if an RPC that it cares about becomes
+ * available.
+ * @interest:     Used to record information about the messages this thread is
+ *                waiting on. The initial contents of the structure are
+ *                assumed to be undefined.
+ * @hsk:          Socket on which relevant messages will arrive.  Must not be
+ *                locked.
+ * @flags:        Flags field from homa_recvmsg_args; see manual entry for
+ *                details.
+ * @id:           If non-zero, then the caller is interested in receiving
+ *                the response for this RPC (@id must be a client request).
+ * Return:        Either zero or a negative errno value. If a matching RPC
+ *                is already available, information about it will be stored in
+ *                interest.
+ */
+int homa_register_interests(struct homa_interest *interest,
+			    struct homa_sock *hsk, int flags, __u64 id)
+{
+	struct homa_rpc *rpc = NULL;
+
+	homa_interest_init(interest);
+	interest->locked = 1;
+	if (id != 0) {
+		if (!homa_is_client(id))
+			return -EINVAL;
+		rpc = homa_find_client_rpc(hsk, id);
+		if (!rpc)
+			return -EINVAL;
+		if (rpc->interest && rpc->interest != interest) {
+			homa_rpc_unlock(rpc);
+			return -EINVAL;
+		}
+	}
+
+	/* Need both the RPC lock (acquired above) and the socket lock to
+	 * avoid races.
+	 */
+	homa_sock_lock(hsk, "homa_register_interests");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		if (rpc)
+			homa_rpc_unlock(rpc);
+		return -ESHUTDOWN;
+	}
+
+	if (id != 0) {
+		if ((atomic_read(&rpc->flags) & RPC_PKTS_READY) || rpc->error)
+			goto claim_rpc;
+		rpc->interest = interest;
+		interest->reg_rpc = rpc;
+		homa_rpc_unlock(rpc);
+	}
+
+	interest->locked = 0;
+	if (flags & HOMA_RECVMSG_RESPONSE) {
+		if (!list_empty(&hsk->ready_responses)) {
+			rpc = list_first_entry(&hsk->ready_responses,
+					       struct homa_rpc,
+					       ready_links);
+			goto claim_rpc;
+		}
+		/* Insert this thread at the *front* of the list;
+		 * we'll get better cache locality if we reuse
+		 * the same thread over and over, rather than
+		 * round-robining between threads.  Same below.
+		 */
+		list_add(&interest->response_links,
+			 &hsk->response_interests);
+	}
+	if (flags & HOMA_RECVMSG_REQUEST) {
+		if (!list_empty(&hsk->ready_requests)) {
+			rpc = list_first_entry(&hsk->ready_requests,
+					       struct homa_rpc, ready_links);
+			/* Make sure the interest isn't on the response list;
+			 * otherwise it might receive a second RPC.
+			 */
+			if (interest->response_links.next != LIST_POISON1)
+				list_del(&interest->response_links);
+			goto claim_rpc;
+		}
+		list_add(&interest->request_links, &hsk->request_interests);
+	}
+	homa_sock_unlock(hsk);
+	return 0;
+
+claim_rpc:
+	list_del_init(&rpc->ready_links);
+	if (!list_empty(&hsk->ready_requests) ||
+	    !list_empty(&hsk->ready_responses)) {
+		// There are still more RPCs available, so let Linux know.
+		hsk->sock.sk_data_ready(&hsk->sock);
+	}
+
+	/* This flag is needed to keep the RPC from being reaped during the
+	 * gap between when we release the socket lock and we acquire the
+	 * RPC lock.
+	 */
+	atomic_or(RPC_HANDING_OFF, &rpc->flags);
+	homa_sock_unlock(hsk);
+	if (!interest->locked) {
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc, "homa_register_interests");
+		atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+		interest->locked = 1;
+	}
+	atomic_andnot(RPC_HANDING_OFF, &rpc->flags);
+	atomic_long_set_release(&interest->ready_rpc, (long)rpc);
+	return 0;
+}
+
+/**
+ * homa_wait_for_message() - Wait for receipt of an incoming message
+ * that matches the parameters. Various other activities can occur while
+ * waiting, such as reaping dead RPCs and copying data to user space.
+ * @hsk:          Socket where messages will arrive.
+ * @flags:        Flags field from homa_recvmsg_args; see manual entry for
+ *                details.
+ * @id:           If non-zero, then a response message matching this id may
+ *                be returned (@id must refer to a client request).
+ *
+ * Return:   Pointer to an RPC that matches @flags and @id, or a negative
+ *           errno value. The RPC will be locked; the caller must unlock.
+ */
+struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
+				       __u64 id)
+{
+	int error;
+	struct homa_rpc *result = NULL;
+	struct homa_interest interest;
+	struct homa_rpc *rpc = NULL;
+
+	/* Each iteration of this loop finds an RPC, but it might not be
+	 * in a state where we can return it (e.g., there might be packets
+	 * ready to transfer to user space, but the incoming message isn't yet
+	 * complete). Thus it could take many iterations of this loop
+	 * before we have an RPC with a complete message.
+	 */
+	while (1) {
+		error = homa_register_interests(&interest, hsk, flags, id);
+		rpc = (struct homa_rpc *)atomic_long_read(&interest.ready_rpc);
+		if (rpc)
+			goto found_rpc;
+		if (error < 0) {
+			result = ERR_PTR(error);
+			goto found_rpc;
+		}
+
+//		tt_record3("Preparing to poll, socket %d, flags 0x%x, pid %d",
+//				hsk->client_port, flags, current->pid);
+
+		/* There is no ready RPC so far. Clean up dead RPCs before
+		 * going to sleep (or returning, if in nonblocking mode).
+		 */
+		while (1) {
+			int reaper_result;
+
+			rpc = (struct homa_rpc *)atomic_long_read(&interest
+								  .ready_rpc);
+			if (rpc)
+				goto found_rpc;
+			reaper_result = homa_rpc_reap(hsk,
+						      hsk->homa->reap_limit);
+			if (reaper_result == 0)
+				break;
+
+			/* Give NAPI and SoftIRQ tasks a chance to run. */
+			schedule();
+		}
+		if (flags & HOMA_RECVMSG_NONBLOCKING) {
+			result = ERR_PTR(-EAGAIN);
+			goto found_rpc;
+		}
+
+		/* Now it's time to sleep. */
+		set_current_state(TASK_INTERRUPTIBLE);
+		rpc = (struct homa_rpc *)atomic_long_read(&interest.ready_rpc);
+		if (!rpc && !hsk->shutdown)
+			schedule();
+		__set_current_state(TASK_RUNNING);
+
+found_rpc:
+		/* If we get here, it means either an RPC is ready for our
+		 * attention or an error occurred.
+		 *
+		 * First, clean up all of the interests. Must do this before
+		 * making any other decisions, because until we do, an incoming
+		 * message could still be passed to us. Note: if we went to
+		 * sleep, then this info was already cleaned up by whoever
+		 * woke us up. Also, values in the interest may change between
+		 * when we test them below and when we acquire the socket lock,
+		 * so they have to be checked again after locking the socket.
+		 */
+		UNIT_HOOK("found_rpc");
+		if (interest.reg_rpc ||
+		    interest.request_links.next != LIST_POISON1 ||
+		    interest.response_links.next != LIST_POISON1) {
+			homa_sock_lock(hsk, "homa_wait_for_message");
+			if (interest.reg_rpc)
+				interest.reg_rpc->interest = NULL;
+			if (interest.request_links.next != LIST_POISON1)
+				list_del(&interest.request_links);
+			if (interest.response_links.next != LIST_POISON1)
+				list_del(&interest.response_links);
+			homa_sock_unlock(hsk);
+		}
+
+		/* Now check to see if we received an RPC handoff (note that
+		 * this could have happened anytime up until we reset the
+		 * interests above).
+		 */
+		rpc = (struct homa_rpc *)atomic_long_read(&interest.ready_rpc);
+		if (rpc) {
+			if (!interest.locked) {
+				atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+				homa_rpc_lock(rpc, "homa_wait_for_message");
+				atomic_andnot(APP_NEEDS_LOCK | RPC_HANDING_OFF,
+					      &rpc->flags);
+			} else {
+				atomic_andnot(RPC_HANDING_OFF, &rpc->flags);
+			}
+			if (!rpc->error)
+				rpc->error = homa_copy_to_user(rpc);
+			if (rpc->state == RPC_DEAD) {
+				homa_rpc_unlock(rpc);
+				continue;
+			}
+			if (rpc->error)
+				goto done;
+			atomic_andnot(RPC_PKTS_READY, &rpc->flags);
+			if (rpc->msgin.bytes_remaining == 0 &&
+			    !skb_queue_len(&rpc->msgin.packets))
+				goto done;
+			homa_rpc_unlock(rpc);
+		}
+
+		/* A complete message isn't available: check for errors. */
+		if (IS_ERR(result))
+			return result;
+		if (signal_pending(current))
+			return ERR_PTR(-EINTR);
+
+		/* No message and no error; try again. */
+	}
+
+done:
+	return rpc;
+}
+
+/**
+ * homa_choose_interest() - Given a list of interests for an incoming
+ * message, choose the best one to handle it (if any).
+ * @homa:        Overall information about the Homa transport.
+ * @head:        Head pointers for the list of interest: either
+ *		 hsk->request_interests or hsk->response_interests.
+ * @offset:      Offset of "next" pointers in the list elements (either
+ *               offsetof(request_links) or offsetof(response_links).
+ * Return:       An interest to use for the incoming message, or NULL if none
+ *               is available. If possible, this function tries to pick an
+ *               interest whose thread is running on a core that isn't
+ *               currently busy doing Homa transport work.
+ */
+struct homa_interest *homa_choose_interest(struct homa *homa,
+					   struct list_head *head, int offset)
+{
+	struct homa_interest *backup = NULL;
+	struct homa_interest *interest;
+	struct list_head *pos;
+
+	list_for_each(pos, head) {
+		interest = (struct homa_interest *)(((char *)pos) - offset);
+		if (!backup)
+			backup = interest;
+	}
+
+	/* All interested threads are on busy cores; return the first. */
+	return backup;
+}
+
+/**
+ * homa_rpc_handoff() - This function is called when the input message for
+ * an RPC is ready for attention from a user thread. It either notifies
+ * a waiting reader or queues the RPC.
+ * @rpc:                RPC to handoff; must be locked. The caller must
+ *			also have locked the socket for this RPC.
+ */
+void homa_rpc_handoff(struct homa_rpc *rpc)
+{
+	struct homa_sock *hsk = rpc->hsk;
+	struct homa_interest *interest;
+
+	if ((atomic_read(&rpc->flags) & RPC_HANDING_OFF) ||
+	    !list_empty(&rpc->ready_links))
+		return;
+
+	/* First, see if someone is interested in this RPC specifically.
+	 */
+	if (rpc->interest) {
+		interest = rpc->interest;
+		goto thread_waiting;
+	}
+
+	/* Second, check the interest list for this type of RPC. */
+	if (homa_is_client(rpc->id)) {
+		interest = homa_choose_interest(hsk->homa,
+						&hsk->response_interests,
+						offsetof(struct homa_interest,
+							 response_links));
+		if (interest)
+			goto thread_waiting;
+		list_add_tail(&rpc->ready_links, &hsk->ready_responses);
+	} else {
+		interest = homa_choose_interest(hsk->homa,
+						&hsk->request_interests,
+						offsetof(struct homa_interest,
+							 request_links));
+		if (interest)
+			goto thread_waiting;
+		list_add_tail(&rpc->ready_links, &hsk->ready_requests);
+	}
+
+	/* If we get here, no-one is waiting for the RPC, so it has been
+	 * queued.
+	 */
+
+	/* Notify the poll mechanism. */
+	hsk->sock.sk_data_ready(&hsk->sock);
+	return;
+
+thread_waiting:
+	/* We found a waiting thread. The following 3 lines must be here,
+	 * before clearing the interest, in order to avoid a race with
+	 * homa_wait_for_message (which won't acquire the socket lock if
+	 * the interest is clear).
+	 */
+	atomic_or(RPC_HANDING_OFF, &rpc->flags);
+	interest->locked = 0;
+	atomic_long_set_release(&interest->ready_rpc, (long)rpc);
+
+	/* Clear the interest. This serves two purposes. First, it saves
+	 * the waking thread from acquiring the socket lock again, which
+	 * reduces contention on that lock). Second, it ensures that
+	 * no-one else attempts to give this interest a different RPC.
+	 */
+	if (interest->reg_rpc) {
+		interest->reg_rpc->interest = NULL;
+		interest->reg_rpc = NULL;
+	}
+	if (interest->request_links.next != LIST_POISON1)
+		list_del(&interest->request_links);
+	if (interest->response_links.next != LIST_POISON1)
+		list_del(&interest->response_links);
+	wake_up_process(interest->thread);
+}
+
+/**
+ * homa_incoming_sysctl_changed() - Invoked whenever a sysctl value is changed;
+ * any input-related parameters that depend on sysctl-settable values.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_incoming_sysctl_changed(struct homa *homa)
+{
+	__u64 tmp;
+
+	/* Code below is written carefully to avoid integer underflow or
+	 * overflow under expected usage patterns. Be careful when changing!
+	 */
+
+	tmp = homa->busy_usecs;
+	tmp = (tmp * cpu_khz) / 1000;
+	homa->busy_cycles = tmp;
+
+	tmp = homa->gro_busy_usecs;
+	tmp = (tmp * cpu_khz) / 1000;
+	homa->gro_busy_cycles = tmp;
+
+	tmp = homa->bpage_lease_usecs;
+	tmp = (tmp * cpu_khz) / 1000;
+	homa->bpage_lease_cycles = tmp;
+}

From patchwork Mon Oct 28 21:35:36 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854180
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 879631EE005
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151387; cv=none;
 b=Fkpw1aiJRjC470jN5WdGHeDBGqYv9gOg/z/q1UG23UDO0azcJFjZmOQXwL1Nbv+SVrnu+jRIKqg34Y+0dOdwGao+oX70PKZK9Fzi5xNahhwnql2K7mRTJRqOOjMk0aftr/sLWku9uLhxPjUC9CsnZuE6IISULY9AP65akeypRBA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151387; c=relaxed/simple;
	bh=5sjahoeGVYVn7mj2GITT5+TyUoZPYFmECkYeFxh0VIU=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ENQqTV2DlrAcgKfqLr2a8pM0LE98AMFvtQ7YCjJJzPxG99blaQfZT/mwD1T1EuoXumPmpa7VpAIzZ8FrpSL4xtZCnyKTQdq1gZ+EUnEF9PeeNUj42xj6LlNrlxiy1LoysWGGuBtwQ1/KI53xhaLOLXpIgJoqYo30Dc2MkfUtLwA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=BCTshAuZ; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="BCTshAuZ"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=ZnGSEUK3G0NiTty4o8p8iPQRf+0VlNIrWQea/nv4Vsg=; t=1730151383; x=1731015383;
	b=BCTshAuZ90HhCZy86SrweKF+y7C75wyisNIkuwlKxDJlwEa5j9R8FyBHNaUlK3Js1hbKtReHYTw
	b5eJuWLmPERd3RFHSYHhgNYlGYAx+Be7MhYAqJrFZwEJ++npjPsOMqLGp66KEgGJUIAyOKGs4r8Li
	TnsijLQDFxm+BJF0sny5s6tK8Q4qxv0iTYMHUMh54HG47rCv5HA5B8C4xEruU12tFf587XKmI+N7R
	KbNaXuXO9GY3VA51otUogdF6m/7yM0lhnPK69zf45VYOfcjzKRxPwi823vbP2KdeYiNxQb5cqjaEm
	r+GamG7lgGGgVs5XN40sJ2Zc4GWzUtFwXV2Q==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPR-0005xj-Mt; Mon, 28 Oct 2024 14:36:22 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 09/12] net: homa: create homa_outgoing.c
Date: Mon, 28 Oct 2024 14:35:36 -0700
Message-ID: <20241028213541.1529-10-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 27c13d64e41639afc1d27d74b763b79e
X-Patchwork-Delegate: kuba@kernel.org

This file does most of the work of transmitting outgoing messages.
It is responsible for copying data from user space into skbs and
it also implements the "pacer", which throttles output if necessary
to prevent queue buildup in the NIC. Note: the pacer eventually
needs to be replaced with a Homa-specific qdisc, which can better
manage simultaneous transmissions by Homa and TCP. The current
implementation can coexist with TCP and doesn't harm TCP, but
Homa's latency suffers when TCP runs concurrently.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_outgoing.c | 857 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 857 insertions(+)
 create mode 100644 net/homa/homa_outgoing.c

diff --git a/net/homa/homa_outgoing.c b/net/homa/homa_outgoing.c
new file mode 100644
index 000000000000..7c8f5c21b041
--- /dev/null
+++ b/net/homa/homa_outgoing.c
@@ -0,0 +1,857 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions related to the sender side of message
+ * transmission. It also contains utility functions for sending packets.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+#include "homa_wire.h"
+
+/**
+ * homa_message_out_init() - Initialize rpc->msgout.
+ * @rpc:       RPC whose output message should be initialized.
+ * @length:    Number of bytes that will eventually be in rpc->msgout.
+ */
+void homa_message_out_init(struct homa_rpc *rpc, int length)
+{
+	rpc->msgout.length = length;
+	rpc->msgout.num_skbs = 0;
+	rpc->msgout.copied_from_user = 0;
+	rpc->msgout.packets = NULL;
+	rpc->msgout.next_xmit = &rpc->msgout.packets;
+	rpc->msgout.next_xmit_offset = 0;
+	atomic_set(&rpc->msgout.active_xmits, 0);
+	rpc->msgout.init_cycles = get_cycles();
+}
+
+/**
+ * homa_fill_data_interleaved() - This function is invoked to fill in the
+ * part of a data packet after the initial header, when GSO is being used.
+ * As a result, seg_headers must be interleaved with the data to provide the
+ * correct offset for each segment.
+ * @rpc:            RPC whose output message is being created.
+ * @skb:            The packet being filled. The initial data_header was
+ *                  created and initialized by the caller and the
+ *                  homa_skb_info has been filled in with the packet geometry.
+ * @iter:           Describes location(s) of (remaining) message data in user
+ *                  space.
+ */
+int homa_fill_data_interleaved(struct homa_rpc *rpc, struct sk_buff *skb,
+			       struct iov_iter *iter)
+{
+	struct homa_skb_info *homa_info = homa_get_skb_info(skb);
+	int seg_length = homa_info->seg_length;
+	int bytes_left = homa_info->data_bytes;
+	int offset = homa_info->offset;
+	int err;
+
+	/* Each iteration of the following loop adds info for one packet,
+	 * which includes a seg_header followed by the data for that
+	 * segment. The first seg_header was already added by the caller.
+	 */
+	while (1) {
+		struct seg_header seg;
+
+		if (bytes_left < seg_length)
+			seg_length = bytes_left;
+		err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+						seg_length);
+		if (err != 0)
+			return err;
+		bytes_left -= seg_length;
+		offset += seg_length;
+
+		if (bytes_left == 0)
+			break;
+
+		seg.offset = htonl(offset);
+		err = homa_skb_append_to_frag(rpc->hsk->homa, skb, &seg,
+					      sizeof(seg));
+		if (err != 0)
+			return err;
+	}
+	return 0;
+}
+
+/**
+ * homa_new_data_packet() - Allocate a new sk_buff and fill it with a Homa
+ * data packet. The resulting packet will be a GSO packet that will eventually
+ * be segmented by the NIC.
+ * @rpc:          RPC that packet will belong to (msgout must have been
+ *                initialized).
+ * @iter:         Describes location(s) of (remaining) message data in user
+ *                space.
+ * @offset:       Offset in the message of the first byte of data in this
+ *                packet.
+ * @length:       How many bytes of data to include in the skb. Caller must
+ *                ensure that this amount of data isn't too much for a
+ *                well-formed GSO packet, and that iter has at least this
+ *                much data.
+ * @max_seg_data: Maximum number of bytes of message data that can go in
+ *                a single segment of the GSO packet.
+ * Return: A pointer to the new packet, or a negative errno.
+ */
+struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
+				     struct iov_iter *iter, int offset,
+				     int length, int max_seg_data)
+{
+	struct homa_skb_info *homa_info;
+	int segs, err, gso_size;
+	struct data_header *h;
+	struct sk_buff *skb;
+
+	/* Initialize the overall skb. */
+	segs = (length + max_seg_data - 1) / max_seg_data;
+	skb = homa_skb_new_tx(sizeof32(*h) + length
+			+ segs * sizeof32(struct seg_header));
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	/* Fill in the Homa header (which will be replicated in every
+	 * network packet by GSO).
+	 */
+	h = (struct data_header *)skb_put(skb, sizeof(struct data_header));
+	h->common.sport = htons(rpc->hsk->port);
+	h->common.dport = htons(rpc->dport);
+	h->common.sequence = htonl(offset);
+	h->common.type = DATA;
+	homa_set_doff(h, sizeof(struct data_header));
+	h->common.checksum = 0;
+	h->common.sender_id = cpu_to_be64(rpc->id);
+	h->message_length = htonl(rpc->msgout.length);
+	h->incoming = h->message_length;
+	h->ack.client_id = 0;
+	homa_peer_get_acks(rpc->peer, 1, &h->ack);
+	h->retransmit = 0;
+	h->seg.offset = htonl(offset);
+
+	homa_info = homa_get_skb_info(skb);
+	homa_info->next_skb = NULL;
+	homa_info->wire_bytes = length + segs * (sizeof(struct data_header)
+			+  rpc->hsk->ip_header_length + HOMA_ETH_OVERHEAD);
+	homa_info->data_bytes = length;
+	homa_info->seg_length = max_seg_data;
+	homa_info->offset = offset;
+
+	if (segs > 1) {
+		homa_set_doff(h, sizeof(struct data_header)  -
+				sizeof32(struct seg_header));
+		gso_size = max_seg_data + sizeof(struct seg_header);
+		err = homa_fill_data_interleaved(rpc, skb, iter);
+	} else {
+		gso_size = max_seg_data;
+		err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+						length);
+	}
+	if (err)
+		goto error;
+
+	if (segs > 1) {
+		skb_shinfo(skb)->gso_segs = segs;
+		skb_shinfo(skb)->gso_size = gso_size;
+
+		/* It's unclear what gso_type should be used to force software
+		 * GSO; the value below seems to work...
+		 */
+		skb_shinfo(skb)->gso_type =
+		    rpc->hsk->homa->gso_force_software ? 0xd : SKB_GSO_TCPV6;
+	}
+	return skb;
+
+error:
+	homa_skb_free_tx(rpc->hsk->homa, skb);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_message_out_fill() - Initializes information for sending a message
+ * for an RPC (either request or response); copies the message data from
+ * user space and (possibly) begins transmitting the message.
+ * @rpc:     RPC for which to send message; this function must not
+ *           previously have been called for the RPC. Must be locked. The RPC
+ *           will be unlocked while copying data, but will be locked again
+ *           before returning.
+ * @iter:    Describes location(s) of message data in user space.
+ * @xmit:    Nonzero means this method should start transmitting packets;
+ *           transmission will be overlapped with copying from user space.
+ *           Zero means the caller will initiate transmission after this
+ *           function returns.
+ *
+ * Return:   0 for success, or a negative errno for failure. It is possible
+ *           for the RPC to be freed while this function is active. If that
+ *           happens, copying will cease, -EINVAL will be returned, and
+ *           rpc->state will be RPC_DEAD.
+ */
+int homa_message_out_fill(struct homa_rpc *rpc, struct iov_iter *iter, int xmit)
+{
+	/* Geometry information for packets:
+	 * mtu:              largest size for an on-the-wire packet (including
+	 *                   all headers through IP header, but not Ethernet
+	 *                   header).
+	 * max_seg_data:     largest amount of Homa message data that fits
+	 *                   in an on-the-wire packet (after segmentation).
+	 * max_gso_data:     largest amount of Homa message data that fits
+	 *                   in a GSO packet (before segmentation).
+	 */
+	int mtu, max_seg_data, max_gso_data;
+
+	int overlap_xmit, segs_per_gso;
+	struct sk_buff **last_link;
+	struct dst_entry *dst;
+
+	/* Bytes of the message that haven't yet been copied into skbs. */
+	int bytes_left;
+
+	int gso_size;
+	int err;
+
+	homa_message_out_init(rpc, iter->count);
+	if (unlikely(rpc->msgout.length > HOMA_MAX_MESSAGE_LENGTH ||
+		     rpc->msgout.length == 0)) {
+		err = -EINVAL;
+		goto error;
+	}
+
+	/* Compute the geometry of packets. */
+	dst = homa_get_dst(rpc->peer, rpc->hsk);
+	mtu = dst_mtu(dst);
+	max_seg_data = mtu - rpc->hsk->ip_header_length
+			- sizeof(struct data_header);
+	gso_size = dst->dev->gso_max_size;
+	if (gso_size > rpc->hsk->homa->max_gso_size)
+		gso_size = rpc->hsk->homa->max_gso_size;
+
+	/* Round gso_size down to an even # of mtus. */
+	segs_per_gso = (gso_size - rpc->hsk->ip_header_length
+			- sizeof(struct data_header)) / max_seg_data;
+	if (segs_per_gso == 0)
+		segs_per_gso = 1;
+	max_gso_data = segs_per_gso * max_seg_data;
+	UNIT_LOG("; ", "mtu %d, max_seg_data %d, max_gso_data %d",
+		 mtu, max_seg_data, max_gso_data);
+
+	overlap_xmit = rpc->msgout.length > 2 * max_gso_data;
+	atomic_or(RPC_COPYING_FROM_USER, &rpc->flags);
+	homa_skb_stash_pages(rpc->hsk->homa, rpc->msgout.length);
+
+	/* Each iteration of the loop below creates one GSO packet. */
+	last_link = &rpc->msgout.packets;
+	for (bytes_left = rpc->msgout.length; bytes_left > 0; ) {
+		int skb_data_bytes, offset;
+		struct sk_buff *skb;
+
+		homa_rpc_unlock(rpc);
+		skb_data_bytes = max_gso_data;
+		offset = rpc->msgout.length - bytes_left;
+		if (skb_data_bytes > bytes_left)
+			skb_data_bytes = bytes_left;
+		skb = homa_new_data_packet(rpc, iter, offset, skb_data_bytes,
+					   max_seg_data);
+		if (unlikely(!skb)) {
+			err = PTR_ERR(skb);
+			homa_rpc_lock(rpc, "homa_message_out_fill");
+			goto error;
+		}
+		bytes_left -= skb_data_bytes;
+
+		homa_rpc_lock(rpc, "homa_message_out_fill2");
+		if (rpc->state == RPC_DEAD) {
+			/* RPC was freed while we were copying. */
+			err = -EINVAL;
+			homa_skb_free_tx(rpc->hsk->homa, skb);
+			goto error;
+		}
+		*last_link = skb;
+		last_link = &(homa_get_skb_info(skb)->next_skb);
+		*last_link = NULL;
+		rpc->msgout.num_skbs++;
+		rpc->msgout.copied_from_user = rpc->msgout.length - bytes_left;
+		if (overlap_xmit && list_empty(&rpc->throttled_links) && xmit)
+			homa_add_to_throttled(rpc);
+	}
+	atomic_andnot(RPC_COPYING_FROM_USER, &rpc->flags);
+	if (!overlap_xmit && xmit)
+		homa_xmit_data(rpc, false);
+	return 0;
+
+error:
+	atomic_andnot(RPC_COPYING_FROM_USER, &rpc->flags);
+	return err;
+}
+
+/**
+ * homa_xmit_control() - Send a control packet to the other end of an RPC.
+ * @type:      Packet type, such as DATA.
+ * @contents:  Address of buffer containing the contents of the packet.
+ *             Only information after the common header must be valid;
+ *             the common header will be filled in by this function.
+ * @length:    Length of @contents (including the common header).
+ * @rpc:       The packet will go to the socket that handles the other end
+ *             of this RPC. Addressing info for the packet, including all of
+ *             the fields of common_header except type, will be set from this.
+ *
+ * Return:     Either zero (for success), or a negative errno value if there
+ *             was a problem.
+ */
+int homa_xmit_control(enum homa_packet_type type, void *contents,
+		      size_t length, struct homa_rpc *rpc)
+{
+	struct common_header *h = contents;
+
+	h->type = type;
+	h->sport = htons(rpc->hsk->port);
+	h->dport = htons(rpc->dport);
+	h->sender_id = cpu_to_be64(rpc->id);
+	return __homa_xmit_control(contents, length, rpc->peer, rpc->hsk);
+}
+
+/**
+ * __homa_xmit_control() - Lower-level version of homa_xmit_control: sends
+ * a control packet.
+ * @contents:  Address of buffer containing the contents of the packet.
+ *             The caller must have filled in all of the information,
+ *             including the common header.
+ * @length:    Length of @contents.
+ * @peer:      Destination to which the packet will be sent.
+ * @hsk:       Socket via which the packet will be sent.
+ *
+ * Return:     Either zero (for success), or a negative errno value if there
+ *             was a problem.
+ */
+int __homa_xmit_control(void *contents, size_t length, struct homa_peer *peer,
+			struct homa_sock *hsk)
+{
+	struct common_header *h;
+	struct dst_entry *dst;
+	struct sk_buff *skb;
+	int extra_bytes;
+	int result;
+
+	dst = homa_get_dst(peer, hsk);
+	skb = homa_skb_new_tx(HOMA_MAX_HEADER);
+	if (unlikely(!skb))
+		return -ENOBUFS;
+	dst_hold(dst);
+	skb_dst_set(skb, dst);
+
+	h = skb_put(skb, length);
+	memcpy(h, contents, length);
+	extra_bytes = HOMA_MIN_PKT_LENGTH - length;
+	if (extra_bytes > 0) {
+		memset(skb_put(skb, extra_bytes), 0, extra_bytes);
+		UNIT_LOG(",", "padded control packet with %d bytes",
+			 extra_bytes);
+	}
+	skb->ooo_okay = 1;
+	skb_get(skb);
+	if (hsk->inet.sk.sk_family == AF_INET6) {
+		result = ip6_xmit(&hsk->inet.sk, skb, &peer->flow.u.ip6, 0,
+				  NULL, 0, 0);
+	} else {
+		result = ip_queue_xmit(&hsk->inet.sk, skb, &peer->flow);
+	}
+	if (unlikely(result != 0)) {
+		/* It appears that ip*_xmit frees skbuffs after
+		 * errors; the following code is to raise an alert if
+		 * this isn't actually the case. The extra skb_get above
+		 * and kfree_skb call below are needed to do the check
+		 * accurately (otherwise the buffer could be freed and
+		 * its memory used for some other purpose, resulting in
+		 * a bogus "reference count").
+		 */
+		if (refcount_read(&skb->users) > 1) {
+			if (hsk->inet.sk.sk_family == AF_INET6)
+				pr_notice("ip6_xmit didn't free Homa control packet (type %d) after error %d\n",
+					  h->type, result);
+			else
+				pr_notice("ip_queue_xmit didn't free Homa control packet (type %d) after error %d\n",
+					  h->type, result);
+		}
+	}
+	kfree_skb(skb);
+	return result;
+}
+
+/**
+ * homa_xmit_unknown() - Send an UNKNOWN packet to a peer.
+ * @skb:         Buffer containing an incoming packet; identifies the peer to
+ *               which the UNKNOWN packet should be sent.
+ * @hsk:         Socket that should be used to send the UNKNOWN packet.
+ */
+void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk)
+{
+	struct common_header *h = (struct common_header *)skb->data;
+	struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct unknown_header unknown;
+	struct homa_peer *peer;
+
+	unknown.common.sport = h->dport;
+	unknown.common.dport = h->sport;
+	unknown.common.type = UNKNOWN;
+	unknown.common.sender_id = cpu_to_be64(homa_local_id(h->sender_id));
+	peer = homa_peer_find(hsk->homa->peers, &saddr, &hsk->inet);
+	if (!IS_ERR(peer))
+		__homa_xmit_control(&unknown, sizeof(unknown), peer, hsk);
+}
+
+/**
+ * homa_xmit_data() - If an RPC has outbound data packets that are permitted
+ * to be transmitted according to the scheduling mechanism, arrange for
+ * them to be sent (some may be sent immediately; others may be sent
+ * later by the pacer thread).
+ * @rpc:       RPC to check for transmittable packets. Must be locked by
+ *             caller. Note: this function will release the RPC lock while
+ *             passing packets through the RPC stack, then reacquire it
+ *             before returning. It is possible that the RPC gets freed
+ *             when the lock isn't held, in which case the state will
+ *             be RPC_DEAD on return.
+ * @force:     True means send at least one packet, even if the NIC queue
+ *             is too long. False means that zero packets may be sent, if
+ *             the NIC queue is sufficiently long.
+ */
+void homa_xmit_data(struct homa_rpc *rpc, bool force)
+{
+	struct homa *homa = rpc->hsk->homa;
+
+	atomic_inc(&rpc->msgout.active_xmits);
+	while (*rpc->msgout.next_xmit) {
+		struct sk_buff *skb = *rpc->msgout.next_xmit;
+
+		if ((rpc->msgout.length - rpc->msgout.next_xmit_offset)
+				>= homa->throttle_min_bytes) {
+			if (!homa_check_nic_queue(homa, skb, force)) {
+				homa_add_to_throttled(rpc);
+				break;
+			}
+		}
+
+		rpc->msgout.next_xmit = &(homa_get_skb_info(skb)->next_skb);
+		rpc->msgout.next_xmit_offset +=
+				homa_get_skb_info(skb)->data_bytes;
+
+		homa_rpc_unlock(rpc);
+		skb_get(skb);
+		__homa_xmit_data(skb, rpc);
+		force = false;
+		homa_rpc_lock(rpc, "homa_xmit_data");
+		if (rpc->state == RPC_DEAD)
+			break;
+	}
+	atomic_dec(&rpc->msgout.active_xmits);
+}
+
+/**
+ * __homa_xmit_data() - Handles packet transmission stuff that is common
+ * to homa_xmit_data and homa_resend_data.
+ * @skb:      Packet to be sent. The packet will be freed after transmission
+ *            (and also if errors prevented transmission).
+ * @rpc:      Information about the RPC that the packet belongs to.
+ */
+void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	struct dst_entry *dst;
+
+	dst = homa_get_dst(rpc->peer, rpc->hsk);
+	dst_hold(dst);
+	skb_dst_set(skb, dst);
+
+	skb->ooo_okay = 1;
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct common_header, checksum);
+	if (rpc->hsk->inet.sk.sk_family == AF_INET6)
+		ip6_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow.u.ip6,
+			 0, NULL, 0, 0);
+	else
+		ip_queue_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow);
+}
+
+/**
+ * homa_resend_data() - This function is invoked as part of handling RESEND
+ * requests. It retransmits the packet(s) containing a given range of bytes
+ * from a message.
+ * @rpc:      RPC for which data should be resent.
+ * @start:    Offset within @rpc->msgout of the first byte to retransmit.
+ * @end:      Offset within @rpc->msgout of the byte just after the last one
+ *            to retransmit.
+ */
+void homa_resend_data(struct homa_rpc *rpc, int start, int end)
+{
+	struct homa_skb_info *homa_info;
+	struct sk_buff *skb;
+
+	if (end <= start)
+		return;
+
+	/* Each iteration of this loop checks one packet in the message
+	 * to see if it contains segments that need to be retransmitted.
+	 */
+	for (skb = rpc->msgout.packets; skb; skb = homa_info->next_skb) {
+		int seg_offset, offset, seg_length, data_left;
+		struct data_header *h;
+
+		homa_info = homa_get_skb_info(skb);
+		offset = homa_info->offset;
+		if (offset >= end)
+			break;
+		if (start >= (offset + homa_info->data_bytes))
+			continue;
+
+		offset = homa_info->offset;
+		seg_offset = sizeof32(struct data_header);
+		data_left = homa_info->data_bytes;
+		if (skb_shinfo(skb)->gso_segs <= 1) {
+			seg_length = data_left;
+		} else {
+			seg_length = homa_info->seg_length;
+			h = (struct data_header *)skb_transport_header(skb);
+		}
+		for ( ; data_left > 0; data_left -= seg_length,
+		     offset += seg_length,
+		     seg_offset += skb_shinfo(skb)->gso_size) {
+			struct homa_skb_info *new_homa_info;
+			struct sk_buff *new_skb;
+			int err;
+
+			if (seg_length > data_left)
+				seg_length = data_left;
+
+			if (end <= offset)
+				goto resend_done;
+			if ((offset + seg_length) <= start)
+				continue;
+
+			/* This segment must be retransmitted. */
+			new_skb = homa_skb_new_tx(sizeof(struct data_header)
+					+ seg_length);
+			if (unlikely(!new_skb)) {
+				UNIT_LOG("; ", "skb allocation error");
+				goto resend_done;
+			}
+			h = __skb_put_data(new_skb, skb_transport_header(skb),
+					   sizeof32(struct data_header));
+			h->common.sequence = htonl(offset);
+			h->seg.offset = htonl(offset);
+			h->retransmit = 1;
+			h->incoming = htonl(rpc->msgout.length);
+			err = homa_skb_append_from_skb(rpc->hsk->homa, new_skb,
+						       skb, seg_offset,
+						       seg_length);
+			if (err != 0) {
+				pr_err("%s got error %d from homa_skb_append_from_skb\n",
+				       __func__, err);
+				UNIT_LOG("; ", "%s got error %d while copying data",
+					 __func__, -err);
+				kfree_skb(new_skb);
+				goto resend_done;
+			}
+
+			new_homa_info = homa_get_skb_info(new_skb);
+			new_homa_info->wire_bytes = rpc->hsk->ip_header_length
+					+ sizeof(struct data_header)
+					+ seg_length + HOMA_ETH_OVERHEAD;
+			new_homa_info->data_bytes = seg_length;
+			new_homa_info->seg_length = seg_length;
+			new_homa_info->offset = offset;
+			homa_check_nic_queue(rpc->hsk->homa, new_skb, true);
+			__homa_xmit_data(new_skb, rpc);
+		}
+	}
+
+resend_done:
+}
+
+/**
+ * homa_outgoing_sysctl_changed() - Invoked whenever a sysctl value is changed;
+ * any output-related parameters that depend on sysctl-settable values.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_outgoing_sysctl_changed(struct homa *homa)
+{
+	__u64 tmp;
+
+	/* Code below is written carefully to avoid integer underflow or
+	 * overflow under expected usage patterns. Be careful when changing!
+	 */
+	homa->cycles_per_kbyte = (8 * (__u64)cpu_khz) / homa->link_mbps;
+	homa->cycles_per_kbyte = (101 * homa->cycles_per_kbyte) / 100;
+	tmp = homa->max_nic_queue_ns;
+	tmp = (tmp * cpu_khz) / 1000000;
+	homa->max_nic_queue_cycles = tmp;
+}
+
+/**
+ * homa_check_nic_queue() - This function is invoked before passing a packet
+ * to the NIC for transmission. It serves two purposes. First, it maintains
+ * an estimate of the NIC queue length. Second, it indicates to the caller
+ * whether the NIC queue is so full that no new packets should be queued
+ * (Homa's SRPT depends on keeping the NIC queue short).
+ * @homa:     Overall data about the Homa protocol implementation.
+ * @skb:      Packet that is about to be transmitted.
+ * @force:    True means this packet is going to be transmitted
+ *            regardless of the queue length.
+ * Return:    Nonzero is returned if either the NIC queue length is
+ *            acceptably short or @force was specified. 0 means that the
+ *            NIC queue is at capacity or beyond, so the caller should delay
+ *            the transmission of @skb. If nonzero is returned, then the
+ *            queue estimate is updated to reflect the transmission of @skb.
+ */
+int homa_check_nic_queue(struct homa *homa, struct sk_buff *skb, bool force)
+{
+	int cycles_for_packet, bytes;
+	__u64 idle, new_idle, clock;
+
+	bytes = homa_get_skb_info(skb)->wire_bytes;
+	cycles_for_packet = (bytes * homa->cycles_per_kbyte) / 1000;
+	while (1) {
+		clock = get_cycles();
+		idle = atomic64_read(&homa->link_idle_time);
+		if ((clock + homa->max_nic_queue_cycles) < idle && !force &&
+		    !(homa->flags & HOMA_FLAG_DONT_THROTTLE))
+			return 0;
+		if (idle < clock)
+			new_idle = clock + cycles_for_packet;
+		else
+			new_idle = idle + cycles_for_packet;
+
+		/* This method must be thread-safe. */
+		if (atomic64_cmpxchg_relaxed(&homa->link_idle_time, idle,
+					     new_idle) == idle)
+			break;
+	}
+	return 1;
+}
+
+/**
+ * homa_pacer_main() - Top-level function for the pacer thread.
+ * @transport:  Pointer to struct homa.
+ *
+ * Return:         Always 0.
+ */
+int homa_pacer_main(void *transport)
+{
+	struct homa *homa = (struct homa *)transport;
+
+	homa->pacer_wake_time = get_cycles();
+	while (1) {
+		if (homa->pacer_exit) {
+			homa->pacer_wake_time = 0;
+			break;
+		}
+		homa_pacer_xmit(homa);
+
+		/* Sleep this thread if the throttled list is empty. Even
+		 * if the throttled list isn't empty, call the scheduler
+		 * to give other processes a chance to run (if we don't,
+		 * softirq handlers can get locked out, which prevents
+		 * incoming packets from being handled).
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (list_first_or_null_rcu(&homa->throttled_rpcs,
+					   struct homa_rpc,
+					   throttled_links) != NULL)
+			__set_current_state(TASK_RUNNING);
+		homa->pacer_wake_time = 0;
+		schedule();
+		homa->pacer_wake_time = get_cycles();
+		__set_current_state(TASK_RUNNING);
+	}
+	kthread_complete_and_exit(&homa_pacer_kthread_done, 0);
+	return 0;
+}
+
+/**
+ * homa_pacer_xmit() - Transmit packets from  the throttled list. Note:
+ * this function may be invoked from either process context or softirq (BH)
+ * level. This function is invoked from multiple places, not just in the
+ * pacer thread. The reason for this is that (as of 10/2019) Linux's scheduling
+ * of the pacer thread is unpredictable: the thread may block for long periods
+ * of time (e.g., because it is assigned to the same CPU as a busy interrupt
+ * handler). This can result in poor utilization of the network link. So,
+ * this method gets invoked from other places as well, to increase the
+ * likelihood that we keep the link busy. Those other invocations are not
+ * guaranteed to happen, so the pacer thread provides a backstop.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_pacer_xmit(struct homa *homa)
+{
+	struct homa_rpc *rpc;
+	int i;
+
+	/* Make sure only one instance of this function executes at a
+	 * time.
+	 */
+	if (!spin_trylock_bh(&homa->pacer_mutex))
+		return;
+
+	/* Each iteration through the following loop sends one packet. We
+	 * limit the number of passes through this loop in order to cap the
+	 * time spent in one call to this function (see note in
+	 * homa_pacer_main about interfering with softirq handlers).
+	 */
+	for (i = 0; i < 5; i++) {
+		__u64 idle_time, now;
+
+		/* If the NIC queue is too long, wait until it gets shorter. */
+		now = get_cycles();
+		idle_time = atomic64_read(&homa->link_idle_time);
+		while ((now + homa->max_nic_queue_cycles) < idle_time) {
+			/* If we've xmitted at least one packet then
+			 * return (this helps with testing and also
+			 * allows homa_pacer_main to yield the core).
+			 */
+			if (i != 0)
+				goto done;
+			now = get_cycles();
+		}
+		/* Note: when we get here, it's possible that the NIC queue is
+		 * still too long because other threads have queued packets,
+		 * but we transmit anyway so we don't starve (see perf.text
+		 * for more info).
+		 */
+
+		/* Lock the first throttled RPC. This may not be possible
+		 * because we have to hold throttle_lock while locking
+		 * the RPC; that means we can't wait for the RPC lock because
+		 * of lock ordering constraints (see sync.txt). Thus, if
+		 * the RPC lock isn't available, do nothing. Holding the
+		 * throttle lock while locking the RPC is important because
+		 * it keeps the RPC from being deleted before it can be locked.
+		 */
+		homa_throttle_lock(homa);
+		homa->pacer_fifo_count -= homa->pacer_fifo_fraction;
+		if (homa->pacer_fifo_count <= 0) {
+			struct homa_rpc *cur;
+			__u64 oldest = ~0;
+
+			homa->pacer_fifo_count += 1000;
+			rpc = NULL;
+			list_for_each_entry_rcu(cur, &homa->throttled_rpcs,
+						throttled_links) {
+				if (cur->msgout.init_cycles < oldest) {
+					rpc = cur;
+					oldest = cur->msgout.init_cycles;
+				}
+			}
+		} else {
+			rpc = list_first_or_null_rcu(&homa->throttled_rpcs,
+						     struct homa_rpc,
+						     throttled_links);
+		}
+		if (!rpc) {
+			homa_throttle_unlock(homa);
+			break;
+		}
+		if (!homa_bucket_try_lock(rpc->bucket, rpc->id,
+					  "homa_pacer_xmit")) {
+			homa_throttle_unlock(homa);
+			break;
+		}
+		homa_throttle_unlock(homa);
+
+		homa_xmit_data(rpc, true);
+
+		/* Note: rpc->state could be RPC_DEAD here, but the code
+		 * below should work anyway.
+		 */
+		if (!*rpc->msgout.next_xmit) {
+			/* Nothing more to transmit from this message (right now),
+			 * so remove it from the throttled list.
+			 */
+			homa_throttle_lock(homa);
+			if (!list_empty(&rpc->throttled_links)) {
+				list_del_rcu(&rpc->throttled_links);
+
+				/* Note: this reinitialization is only safe
+				 * because the pacer only looks at the first
+				 * element of the list, rather than traversing
+				 * it (and besides, we know the pacer isn't
+				 * active concurrently, since this code *is*
+				 * the pacer). It would not be safe under more
+				 * general usage patterns.
+				 */
+				INIT_LIST_HEAD_RCU(&rpc->throttled_links);
+			}
+			homa_throttle_unlock(homa);
+		}
+		homa_rpc_unlock(rpc);
+	}
+done:
+	spin_unlock_bh(&homa->pacer_mutex);
+}
+
+/**
+ * homa_pacer_stop() - Will cause the pacer thread to exit (waking it up
+ * if necessary); doesn't return until after the pacer thread has exited.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_pacer_stop(struct homa *homa)
+{
+	homa->pacer_exit = true;
+	wake_up_process(homa->pacer_kthread);
+	kthread_stop(homa->pacer_kthread);
+	homa->pacer_kthread = NULL;
+}
+
+/**
+ * homa_add_to_throttled() - Make sure that an RPC is on the throttled list
+ * and wake up the pacer thread if necessary.
+ * @rpc:     RPC with outbound packets that have been granted but can't be
+ *           sent because of NIC queue restrictions.
+ */
+void homa_add_to_throttled(struct homa_rpc *rpc)
+{
+	struct homa *homa = rpc->hsk->homa;
+	struct homa_rpc *candidate;
+	int bytes_left;
+	int checks = 0;
+	__u64 now;
+
+	if (!list_empty(&rpc->throttled_links))
+		return;
+	now = get_cycles();
+	homa->throttle_add = now;
+	bytes_left = rpc->msgout.length - rpc->msgout.next_xmit_offset;
+	homa_throttle_lock(homa);
+	list_for_each_entry_rcu(candidate, &homa->throttled_rpcs,
+				throttled_links) {
+		int bytes_left_cand;
+
+		checks++;
+
+		/* Watch out: the pacer might have just transmitted the last
+		 * packet from candidate.
+		 */
+		bytes_left_cand = candidate->msgout.length -
+				candidate->msgout.next_xmit_offset;
+		if (bytes_left_cand > bytes_left) {
+			list_add_tail_rcu(&rpc->throttled_links,
+					  &candidate->throttled_links);
+			goto done;
+		}
+	}
+	list_add_tail_rcu(&rpc->throttled_links, &homa->throttled_rpcs);
+done:
+	homa_throttle_unlock(homa);
+	wake_up_process(homa->pacer_kthread);
+//	tt_record("woke up pacer thread");
+}
+
+/**
+ * homa_remove_from_throttled() - Make sure that an RPC is not on the
+ * throttled list.
+ * @rpc:     RPC of interest.
+ */
+void homa_remove_from_throttled(struct homa_rpc *rpc)
+{
+	if (unlikely(!list_empty(&rpc->throttled_links))) {
+		UNIT_LOG("; ", "removing id %llu from throttled list", rpc->id);
+		homa_throttle_lock(rpc->hsk->homa);
+		list_del(&rpc->throttled_links);
+		homa_throttle_unlock(rpc->hsk->homa);
+		INIT_LIST_HEAD(&rpc->throttled_links);
+	}
+}

From patchwork Mon Oct 28 21:35:37 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854178
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 089931EE00F
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151386; cv=none;
 b=DtZ3CLexEo6M8vCrmGTPLlN8qS/UO/7NLKl3P5jbf2LFvNXOooVxpdcMP/Jx4nHnfr7XlN72ruURsGHJXZ5/5HYXir1BfnGcS3vVODZfYhlU/Heud1eDgy2ntyG/a1C/JMUeYbakp1GdM7UabZfuFMPiWdhfPBtfwawiFrHIhs4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151386; c=relaxed/simple;
	bh=P6t3xzWi/gWSVvJ49SLmYgfY1Z1X4R6tRCbS5Ilj/Io=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=JLgRSmDqRfDxDBMccWpLQh9QM22nkLkbZdAMVPoHdVI04j2K3Amo1rQY+pbAbHwyjUctRC9lQNCz+3BJuRjdk/Yhtk96vBs2M1+kDczoVK6GonndeVMEzE4EOLNs/wm5cm5jfLKe5ZW6Mreps6EI2qh2PCdkMTw3CR7JKjxAM2Q=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=snMO7nk5; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="snMO7nk5"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=CnwDCQWkPH0K+jB7tLLOeVecnM9nTUcy9BLIZuTMeMY=; t=1730151384; x=1731015384;
	b=snMO7nk5slAjZj2OGc62f1Bp414IEKCHqFGLNqlnmNWYstBAWZ4iICdRTa5x25eOx9aGgFO6atn
	PgH/MPdhDcsYr7IvB61Ayx5EW/LHZqM+Ah25oEmrG29t4UlUw8LydCTU9Ws4G54Pop/M1mOl7+i0Z
	jfqZtNdj4DR6goVXNEdeUjlDewMeKQE1BBpG6mGoPsnQM2KmOeBWejSTpCoxtzbIhHfUTfRfPGt4W
	iLAmc0WsaNOGtalLdHU9LFHoZoCrH2HnXC5j7cvdEyza/T9P4UtamW4OdMXVFllWbvc7PEeZSFFgq
	DzRppeB5I+UR5hp6MjptXqmjNCTZzkmNZ9QQ==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPS-0005xj-Vp; Mon, 28 Oct 2024 14:36:23 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 10/12] net: homa: create homa_timer.c
Date: Mon, 28 Oct 2024 14:35:37 -0700
Message-ID: <20241028213541.1529-11-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 8d916073a6e03d0ab55799977e12a57e
X-Patchwork-Delegate: kuba@kernel.org

This file contains code that wakes up periodically to check for
missing data, initiate retransmissions, and declare peer nodes
"dead".

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_timer.c | 158 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 net/homa/homa_timer.c

diff --git a/net/homa/homa_timer.c b/net/homa/homa_timer.c
new file mode 100644
index 000000000000..bce7c02bb1cd
--- /dev/null
+++ b/net/homa/homa_timer.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file handles timing-related functions for Homa, such as retries
+ * and timeouts.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+
+/**
+ * homa_check_rpc() -  Invoked for each RPC during each timer pass; does
+ * most of the work of checking for time-related actions such as sending
+ * resends, aborting RPCs for which there is no response, and sending
+ * requests for acks. It is separate from homa_timer because homa_timer
+ * got too long and deeply indented.
+ * @rpc:     RPC to check; must be locked by the caller.
+ */
+void homa_check_rpc(struct homa_rpc *rpc)
+{
+	struct homa *homa = rpc->hsk->homa;
+	struct resend_header resend;
+
+	/* See if we need to request an ack for this RPC. */
+	if (!homa_is_client(rpc->id) && rpc->state == RPC_OUTGOING &&
+	    rpc->msgout.next_xmit_offset >= rpc->msgout.length) {
+		if (rpc->done_timer_ticks == 0) {
+			rpc->done_timer_ticks = homa->timer_ticks;
+		} else {
+			/* >= comparison that handles tick wrap-around. */
+			if ((rpc->done_timer_ticks + homa->request_ack_ticks
+					- 1 - homa->timer_ticks) & 1 << 31) {
+				struct need_ack_header h;
+
+				homa_xmit_control(NEED_ACK, &h, sizeof(h), rpc);
+			}
+		}
+	}
+
+	if (rpc->state == RPC_INCOMING) {
+		if (rpc->msgin.num_bpages == 0) {
+			/* Waiting for buffer space, so no problem. */
+			rpc->silent_ticks = 0;
+			return;
+		}
+	} else if (!homa_is_client(rpc->id)) {
+		/* We're the server and we've received the input message;
+		 * no need to worry about retries.
+		 */
+		rpc->silent_ticks = 0;
+		return;
+	}
+
+	if (rpc->state == RPC_OUTGOING) {
+		if (rpc->msgout.next_xmit_offset < rpc->msgout.length) {
+			/* There are bytes that we haven't transmitted,
+			 * so no need to be concerned; the ball is in our court.
+			 */
+			rpc->silent_ticks = 0;
+			return;
+		}
+	}
+
+	if (rpc->silent_ticks < homa->resend_ticks)
+		return;
+	if (rpc->silent_ticks >= homa->timeout_ticks) {
+		homa_rpc_abort(rpc, -ETIMEDOUT);
+		return;
+	}
+	if (((rpc->silent_ticks - homa->resend_ticks) % homa->resend_interval)
+			!= 0)
+		return;
+
+	/* Issue a resend for the bytes just after the last ones received
+	 * (gaps in the middle were already handled by homa_gap_retry above).
+	 */
+	if (rpc->msgin.length < 0) {
+		/* Haven't received any data for this message; request
+		 * retransmission of just the first packet (the sender
+		 * will send at least one full packet, regardless of
+		 * the length below).
+		 */
+		resend.offset = htonl(0);
+		resend.length = htonl(100);
+	} else {
+		homa_gap_retry(rpc);
+		resend.offset = htonl(rpc->msgin.recv_end);
+		resend.length = htonl(rpc->msgin.length - rpc->msgin.recv_end);
+		if (resend.length == 0)
+			return;
+	}
+	homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+}
+
+/**
+ * homa_timer() - This function is invoked at regular intervals ("ticks")
+ * to implement retries and aborts for Homa.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_timer(struct homa *homa)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+	struct homa_rpc *rpc;
+	int total_rpcs = 0;
+	int rpc_count = 0;
+
+	homa->timer_ticks++;
+
+	/* Scan all existing RPCs in all sockets.  The rcu_read_lock
+	 * below prevents sockets from being deleted during the scan.
+	 */
+	rcu_read_lock();
+	for (hsk = homa_socktab_start_scan(homa->port_map, &scan);
+			hsk; hsk = homa_socktab_next(&scan)) {
+		while (hsk->dead_skbs >= homa->dead_buffs_limit)
+			/* If we get here, it means that homa_wait_for_message
+			 * isn't keeping up with RPC reaping, so we'll help
+			 * out.  See reap.txt for more info.
+			 */
+			if (homa_rpc_reap(hsk, hsk->homa->reap_limit) == 0)
+				break;
+
+		if (list_empty(&hsk->active_rpcs) || hsk->shutdown)
+			continue;
+
+		if (!homa_protect_rpcs(hsk))
+			continue;
+		list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+			total_rpcs++;
+			homa_rpc_lock(rpc, "homa_timer");
+			if (rpc->state == RPC_IN_SERVICE) {
+				rpc->silent_ticks = 0;
+				homa_rpc_unlock(rpc);
+				continue;
+			}
+			rpc->silent_ticks++;
+			homa_check_rpc(rpc);
+			homa_rpc_unlock(rpc);
+			rpc_count++;
+			if (rpc_count >= 10) {
+				/* Give other kernel threads a chance to run
+				 * on this core. Must release the RCU read lock
+				 * while doing this.
+				 */
+				rcu_read_unlock();
+				schedule();
+				rcu_read_lock();
+				rpc_count = 0;
+			}
+		}
+		homa_unprotect_rpcs(hsk);
+	}
+	rcu_read_unlock();
+
+//	if (total_rpcs > 0)
+//		tt_record1("homa_timer finished scanning %d RPCs", total_rpcs);
+}

From patchwork Mon Oct 28 21:35:38 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854182
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6FE361EE033
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151390; cv=none;
 b=VeEj9wfzGG/LJL8Ce463IzzwESwXEQdbXKWBHN9B7rnFFizeNOqZc/IfukuSm9vXClGf+4qeFDPJYWuZz2Q3B62kbuw1vihq/39fpE0l+0RWS0LkH9/jCejymfrYlJFkwfyE7YeVsmeFNsTx3Kv96ruJt5UQJHVLCtnSQmqnzCA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151390; c=relaxed/simple;
	bh=2tNTDOSYDdMumO5GrefUvkkWBevtO3gNBz34o0b9v/M=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=dYyDyl07i9EeFKX2IQCXuK1P7ue5aO10a3IZYtW63sBhsMoWfc+XiiLne/XP7A40LGKsKJUzrkKMKxkxdPdKFTx6si9hoGa0J+C76mG3t6veVpUniU0Vm4sveh/9KAB0pYMx+ZrET32OUSKin94PJQfPMokeX7TtysAS1Huto8s=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=T0ib9wzf; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="T0ib9wzf"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=2kW9qOdXFS1gcc65Q0g3P/P0ruvkSoG6sY38tNGOaB0=; t=1730151385; x=1731015385;
	b=T0ib9wzfuzJZrr3CsGznPR2W8YHhszIYwNOteYx0jBVjk+7Px+HxC9Sbx3HBhznMGOyuzM485GL
	PsKkFkHmtCikseRJ9lsIAgBh6lRTn1M8AQSv4OdNqGDDDpr1tCbDg1PPbyLukFIJGz9npj+jXOnBZ
	WhqlHvPuLODIm23u+N0vgZ12HAY7b45v95a4ou7gZ+rqO7rwftQMMxs6euTuic7ONy6pDldnwr6zq
	xYQ+hUKLQwRFWH2euq4d6g5tsYV3A9jsMeD7BI+vJ+5o5TWjxqhKI7bmdv9689t6SgDnAhr2TusWw
	3uKIlFAAHKCC22uqmr448Z4ksb8XRQ7Wx70Q==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPT-0005xj-HM; Mon, 28 Oct 2024 14:36:24 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 11/12] net: homa: create homa_plumbing.c homa_utils.c
Date: Mon, 28 Oct 2024 14:35:38 -0700
Message-ID: <20241028213541.1529-12-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: 73c8156d6d585648a2e51bfbdcc3b618
X-Patchwork-Delegate: kuba@kernel.org

homa_plumbing.c contains functions that connect Homa to the rest of
the Linux kernel, such as dispatch tables used by Linux and the
top-level functions that Linux invokes from those dispatch tables.

homa_utils.c contains a few odds and ends, such as code to initialize
and destroy struct homa's.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_plumbing.c | 997 +++++++++++++++++++++++++++++++++++++++
 net/homa/homa_utils.c    | 138 ++++++
 2 files changed, 1135 insertions(+)
 create mode 100644 net/homa/homa_plumbing.c
 create mode 100644 net/homa/homa_utils.c

diff --git a/net/homa/homa_plumbing.c b/net/homa/homa_plumbing.c
new file mode 100644
index 000000000000..824531bd43f9
--- /dev/null
+++ b/net/homa/homa_plumbing.c
@@ -0,0 +1,997 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file consists mostly of "glue" that hooks Homa into the rest of
+ * the Linux kernel. The guts of the protocol are in other files.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+#ifndef __UNIT_TEST__
+MODULE_LICENSE("Dual MIT/GPL");
+#endif
+MODULE_AUTHOR("John Ousterhout");
+MODULE_DESCRIPTION("Homa transport protocol");
+MODULE_VERSION("0.01");
+
+/* Not yet sure what these variables are for */
+long sysctl_homa_mem[3] __read_mostly;
+int sysctl_homa_rmem_min __read_mostly;
+int sysctl_homa_wmem_min __read_mostly;
+
+/* Global data for Homa. Never reference homa_data directory. Always use
+ * the homa variable instead; this allows overriding during unit tests.
+ */
+struct homa homa_data;
+struct homa *homa = &homa_data;
+
+/* True means that the Homa module is in the process of unloading itself,
+ * so everyone should clean up.
+ */
+static bool exiting;
+
+/* Thread that runs timer code to detect lost packets and crashed peers. */
+static struct task_struct *timer_kthread;
+
+/* This structure defines functions that handle various operations on
+ * Homa sockets. These functions are relatively generic: they are called
+ * to implement top-level system calls. Many of these operations can
+ * be implemented by PF_INET6 functions that are independent of the
+ * Homa protocol.
+ */
+const struct proto_ops homa_proto_ops = {
+	.family		   = PF_INET,
+	.owner		   = THIS_MODULE,
+	.release	   = inet_release,
+	.bind		   = homa_bind,
+	.connect	   = inet_dgram_connect,
+	.socketpair	   = sock_no_socketpair,
+	.accept		   = sock_no_accept,
+	.getname	   = inet_getname,
+	.poll		   = homa_poll,
+	.ioctl		   = inet_ioctl,
+	.listen		   = sock_no_listen,
+	.shutdown	   = homa_shutdown,
+	.setsockopt	   = sock_common_setsockopt,
+	.getsockopt	   = sock_common_getsockopt,
+	.sendmsg	   = inet_sendmsg,
+	.recvmsg	   = inet_recvmsg,
+	.mmap		   = sock_no_mmap,
+	.set_peek_off	   = sk_set_peek_off,
+};
+
+const struct proto_ops homav6_proto_ops = {
+	.family		   = PF_INET6,
+	.owner		   = THIS_MODULE,
+	.release	   = inet6_release,
+	.bind		   = homa_bind,
+	.connect	   = inet_dgram_connect,
+	.socketpair	   = sock_no_socketpair,
+	.accept		   = sock_no_accept,
+	.getname	   = inet6_getname,
+	.poll		   = homa_poll,
+	.ioctl		   = inet6_ioctl,
+	.listen		   = sock_no_listen,
+	.shutdown	   = homa_shutdown,
+	.setsockopt	   = sock_common_setsockopt,
+	.getsockopt	   = sock_common_getsockopt,
+	.sendmsg	   = inet_sendmsg,
+	.recvmsg	   = inet_recvmsg,
+	.mmap		   = sock_no_mmap,
+	.set_peek_off	   = sk_set_peek_off,
+};
+
+/* This structure also defines functions that handle various operations
+ * on Homa sockets. However, these functions are lower-level than those
+ * in homa_proto_ops: they are specific to the PF_INET or PF_INET6
+ * protocol family, and in many cases they are invoked by functions in
+ * homa_proto_ops. Most of these functions have Homa-specific implementations.
+ */
+struct proto homa_prot = {
+	.name		   = "HOMA",
+	.owner		   = THIS_MODULE,
+	.close		   = homa_close,
+	.connect	   = ip4_datagram_connect,
+	.disconnect	   = homa_disconnect,
+	.ioctl		   = homa_ioctl,
+	.init		   = homa_socket,
+	.destroy	   = 0,
+	.setsockopt	   = homa_setsockopt,
+	.getsockopt	   = homa_getsockopt,
+	.sendmsg	   = homa_sendmsg,
+	.recvmsg	   = homa_recvmsg,
+	.backlog_rcv       = homa_backlog_rcv,
+	.hash		   = homa_hash,
+	.unhash		   = homa_unhash,
+	.get_port	   = homa_get_port,
+	.sysctl_mem	   = sysctl_homa_mem,
+	.sysctl_wmem	   = &sysctl_homa_wmem_min,
+	.sysctl_rmem	   = &sysctl_homa_rmem_min,
+	.obj_size	   = sizeof(struct homa_sock),
+	.no_autobind       = 1,
+};
+
+struct proto homav6_prot = {
+	.name		   = "HOMAv6",
+	.owner		   = THIS_MODULE,
+	.close		   = homa_close,
+	.connect	   = ip6_datagram_connect,
+	.disconnect	   = homa_disconnect,
+	.ioctl		   = homa_ioctl,
+	.init		   = homa_socket,
+	.destroy	   = 0,
+	.setsockopt	   = homa_setsockopt,
+	.getsockopt	   = homa_getsockopt,
+	.sendmsg	   = homa_sendmsg,
+	.recvmsg	   = homa_recvmsg,
+	.backlog_rcv       = homa_backlog_rcv,
+	.hash		   = homa_hash,
+	.unhash		   = homa_unhash,
+	.get_port	   = homa_get_port,
+	.sysctl_mem	   = sysctl_homa_mem,
+	.sysctl_wmem	   = &sysctl_homa_wmem_min,
+	.sysctl_rmem	   = &sysctl_homa_rmem_min,
+
+	/* IPv6 data comes *after* Homa's data, and isn't included in
+	 * struct homa_sock.
+	 */
+	.obj_size	   = sizeof(struct homa_sock) + sizeof(struct ipv6_pinfo),
+	.no_autobind       = 1,
+};
+
+/* Top-level structure describing the Homa protocol. */
+struct inet_protosw homa_protosw = {
+	.type              = SOCK_DGRAM,
+	.protocol          = IPPROTO_HOMA,
+	.prot              = &homa_prot,
+	.ops               = &homa_proto_ops,
+	.flags             = INET_PROTOSW_REUSE,
+};
+
+struct inet_protosw homav6_protosw = {
+	.type              = SOCK_DGRAM,
+	.protocol          = IPPROTO_HOMA,
+	.prot              = &homav6_prot,
+	.ops               = &homav6_proto_ops,
+	.flags             = INET_PROTOSW_REUSE,
+};
+
+/* This structure is used by IP to deliver incoming Homa packets to us. */
+static struct net_protocol homa_protocol = {
+	.handler =	homa_softirq,
+	.err_handler =	homa_err_handler_v4,
+	.no_policy =     1,
+};
+
+static struct inet6_protocol homav6_protocol = {
+	.handler =	homa_softirq,
+	.err_handler =	homa_err_handler_v6,
+	.flags =        INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL,
+};
+
+/* Sizes of the headers for each Homa packet type, in bytes. */
+static __u16 header_lengths[] = {
+	sizeof32(struct data_header),
+	0,
+	sizeof32(struct resend_header),
+	sizeof32(struct unknown_header),
+	sizeof32(struct busy_header),
+	0,
+	sizeof32(struct common_header),
+	sizeof32(struct need_ack_header),
+	sizeof32(struct ack_header)
+};
+
+static DECLARE_COMPLETION(timer_thread_done);
+
+/**
+ * homa_load() - invoked when this module is loaded into the Linux kernel
+ * Return: 0 on success, otherwise a negative errno.
+ */
+static int __init homa_load(void)
+{
+	int status;
+
+	pr_notice("Homa module loading\n");
+	pr_notice("Homa structure sizes: data_header %u, seg_header %u, ack %u, peer %u, ip_hdr %u flowi %u ipv6_hdr %u, flowi6 %u tcp_sock %u homa_rpc %u sk_buff %u rcvmsg_control %u union sockaddr_in_union %u HOMA_MAX_BPAGES %u NR_CPUS %u nr_cpu_ids %u, MAX_NUMNODES %d\n",
+		  sizeof32(struct data_header),
+		  sizeof32(struct seg_header),
+		  sizeof32(struct homa_ack),
+		  sizeof32(struct homa_peer),
+		  sizeof32(struct iphdr),
+		  sizeof32(struct flowi),
+		  sizeof32(struct ipv6hdr),
+		  sizeof32(struct flowi6),
+		  sizeof32(struct tcp_sock),
+		  sizeof32(struct homa_rpc),
+		  sizeof32(struct sk_buff),
+		  sizeof32(struct homa_recvmsg_args),
+		  sizeof32(union sockaddr_in_union),
+		  HOMA_MAX_BPAGES,
+		  NR_CPUS,
+		  nr_cpu_ids,
+		  MAX_NUMNODES);
+	status = proto_register(&homa_prot, 1);
+	if (status != 0) {
+		pr_err("proto_register failed for homa_prot: %d\n", status);
+		goto out;
+	}
+	status = proto_register(&homav6_prot, 1);
+	if (status != 0) {
+		pr_err("proto_register failed for homav6_prot: %d\n", status);
+		goto out;
+	}
+	inet_register_protosw(&homa_protosw);
+	inet6_register_protosw(&homav6_protosw);
+	status = inet_add_protocol(&homa_protocol, IPPROTO_HOMA);
+	if (status != 0) {
+		pr_err("inet_add_protocol failed in %s: %d\n", __func__,
+		       status);
+		goto out_cleanup;
+	}
+	status = inet6_add_protocol(&homav6_protocol, IPPROTO_HOMA);
+	if (status != 0) {
+		pr_err("inet6_add_protocol failed in %s: %d\n",  __func__,
+		       status);
+		goto out_cleanup;
+	}
+
+	status = homa_init(homa);
+	if (status)
+		goto out_cleanup;
+
+	timer_kthread = kthread_run(homa_timer_main, homa, "homa_timer");
+	if (IS_ERR(timer_kthread)) {
+		status = PTR_ERR(timer_kthread);
+		pr_err("couldn't create homa pacer thread: error %d\n",
+		       status);
+		timer_kthread = NULL;
+		goto out_cleanup;
+	}
+
+	return 0;
+
+out_cleanup:
+	homa_destroy(homa);
+	inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+	inet_unregister_protosw(&homa_protosw);
+	inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+	inet6_unregister_protosw(&homav6_protosw);
+	proto_unregister(&homa_prot);
+	proto_unregister(&homav6_prot);
+out:
+	return status;
+}
+
+/**
+ * homa_unload() - invoked when this module is unloaded from the Linux kernel.
+ */
+static void __exit homa_unload(void)
+{
+	pr_notice("Homa module unloading\n");
+	exiting = true;
+
+	if (timer_kthread)
+		wake_up_process(timer_kthread);
+	wait_for_completion(&timer_thread_done);
+	homa_destroy(homa);
+	inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+	inet_unregister_protosw(&homa_protosw);
+	inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+	inet6_unregister_protosw(&homav6_protosw);
+	proto_unregister(&homa_prot);
+	proto_unregister(&homav6_prot);
+}
+
+module_init(homa_load);
+module_exit(homa_unload);
+
+/**
+ * homa_bind() - Implements the bind system call for Homa sockets: associates
+ * a well-known service port with a socket. Unlike other AF_INET6 protocols,
+ * there is no need to invoke this system call for sockets that are only
+ * used as clients.
+ * @sock:     Socket on which the system call was invoked.
+ * @addr:    Contains the desired port number.
+ * @addr_len: Number of bytes in uaddr.
+ * Return:    0 on success, otherwise a negative errno.
+ */
+int homa_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	struct homa_sock *hsk = homa_sk(sock->sk);
+	union sockaddr_in_union *addr_in = (union sockaddr_in_union *)addr;
+	int port = 0;
+
+	if (unlikely(addr->sa_family != sock->sk->sk_family))
+		return -EAFNOSUPPORT;
+	if (addr_in->in6.sin6_family == AF_INET6) {
+		if (addr_len < sizeof(struct sockaddr_in6))
+			return -EINVAL;
+		port = ntohs(addr_in->in4.sin_port);
+	} else if (addr_in->in4.sin_family == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return -EINVAL;
+		port = ntohs(addr_in->in6.sin6_port);
+	}
+	return homa_sock_bind(homa->port_map, hsk, port);
+}
+
+/**
+ * homa_close() - Invoked when close system call is invoked on a Homa socket.
+ * @sk:      Socket being closed
+ * @timeout: ??
+ */
+void homa_close(struct sock *sk, long timeout)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+
+	homa_sock_destroy(hsk);
+	sk_common_release(sk);
+}
+
+/**
+ * homa_shutdown() - Implements the shutdown system call for Homa sockets.
+ * @sock:    Socket to shut down.
+ * @how:     Ignored: for other sockets, can independently shut down
+ *           sending and receiving, but for Homa any shutdown will
+ *           shut down everything.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_shutdown(struct socket *sock, int how)
+{
+	homa_sock_shutdown(homa_sk(sock->sk));
+	return 0;
+}
+
+/**
+ * homa_disconnect() - Invoked when disconnect system call is invoked on a
+ * Homa socket.
+ * @sk:    Socket to disconnect
+ * @flags: ??
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_disconnect(struct sock *sk, int flags)
+{
+	pr_warn("unimplemented disconnect invoked on Homa socket\n");
+	return -EINVAL;
+}
+
+/**
+ * homa_ioctl() - Implements the ioctl system call for Homa sockets.
+ * @sk:    Socket on which the system call was invoked.
+ * @cmd:   Identifier for a particular ioctl operation.
+ * @karg:  Operation-specific argument; typically the address of a block
+ *         of data in user address space.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	return -EINVAL;
+}
+
+/**
+ * homa_socket() - Implements the socket(2) system call for sockets.
+ * @sk:    Socket on which the system call was invoked. The non-Homa
+ *         parts have already been initialized.
+ *
+ * Return: always 0 (success).
+ */
+int homa_socket(struct sock *sk)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+
+	homa_sock_init(hsk, homa);
+	return 0;
+}
+
+/**
+ * homa_setsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk:      Socket on which the system call was invoked.
+ * @level:   Level at which the operation should be handled; will always
+ *           be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval:  Address in user space of information about the option.
+ * @optlen:  Number of bytes of data at @optval.
+ * Return:   0 on success, otherwise a negative errno.
+ */
+int homa_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+		    unsigned int optlen)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_set_buf_args args;
+	int ret;
+
+	if (level != IPPROTO_HOMA || optname != SO_HOMA_SET_BUF ||
+	    optlen != sizeof(struct homa_set_buf_args))
+		return -EINVAL;
+
+	if (copy_from_sockptr(&args, optval, optlen))
+		return -EFAULT;
+
+	/* Do a trivial test to make sure we can at least write the first
+	 * page of the region.
+	 */
+	if (copy_to_user(args.start, &args, sizeof(args)))
+		return -EFAULT;
+
+	homa_sock_lock(hsk, "homa_setsockopt SO_HOMA_SET_BUF");
+	ret = homa_pool_init(hsk, args.start, args.length);
+	homa_sock_unlock(hsk);
+	return ret;
+}
+
+/**
+ * homa_getsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk:      Socket on which the system call was invoked.
+ * @level:   ??
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval:  Address in user space where the option's value should be stored.
+ * @option:  ??.
+ * Return:   0 on success, otherwise a negative errno.
+ */
+int homa_getsockopt(struct sock *sk, int level, int optname,
+		    char __user *optval, int __user *option)
+{
+	pr_warn("unimplemented getsockopt invoked on Homa socket: level %d, optname %d\n",
+		level, optname);
+	return -EINVAL;
+}
+
+/**
+ * homa_sendmsg() - Send a request or response message on a Homa socket.
+ * @sk:     Socket on which the system call was invoked.
+ * @msg:    Structure describing the message to send; the msg_control
+ *          field points to additional information.
+ * @length: Number of bytes of the message.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t length)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_sendmsg_args args;
+	int result = 0;
+	struct homa_rpc *rpc = NULL;
+	union sockaddr_in_union *addr = (union sockaddr_in_union *)msg->msg_name;
+
+	if (unlikely(!msg->msg_control_is_user)) {
+		result = -EINVAL;
+		goto error;
+	}
+	if (unlikely(copy_from_user(&args, msg->msg_control, sizeof(args)))) {
+		result = -EFAULT;
+		goto error;
+	}
+	if (addr->in6.sin6_family != sk->sk_family) {
+		result = -EAFNOSUPPORT;
+		goto error;
+	}
+	if (msg->msg_namelen < sizeof(struct sockaddr_in) ||
+	    (msg->msg_namelen < sizeof(struct sockaddr_in6) &&
+	     addr->in6.sin6_family == AF_INET6)) {
+		result = -EINVAL;
+		goto error;
+	}
+
+	if (!args.id) {
+		/* This is a request message. */
+		rpc = homa_rpc_new_client(hsk, addr);
+		if (IS_ERR(rpc)) {
+			result = PTR_ERR(rpc);
+			rpc = NULL;
+			goto error;
+		}
+		rpc->completion_cookie = args.completion_cookie;
+		result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+		if (result)
+			goto error;
+		args.id = rpc->id;
+		homa_rpc_unlock(rpc);
+		rpc = NULL;
+
+		if (unlikely(copy_to_user(msg->msg_control, &args,
+					  sizeof(args)))) {
+			rpc = homa_find_client_rpc(hsk, args.id);
+			result = -EFAULT;
+			goto error;
+		}
+	} else {
+		/* This is a response message. */
+		struct in6_addr canonical_dest;
+
+		if (args.completion_cookie != 0) {
+			result = -EINVAL;
+			goto error;
+		}
+		canonical_dest = canonical_ipv6_addr(addr);
+
+		rpc = homa_find_server_rpc(hsk, &canonical_dest,
+					   ntohs(addr->in6.sin6_port), args.id);
+		if (!rpc)
+			/* Return without an error if the RPC doesn't exist;
+			 * this could be totally valid (e.g. client is
+			 * no longer interested in it).
+			 */
+			return 0;
+		if (rpc->error) {
+			result = rpc->error;
+			goto error;
+		}
+		if (rpc->state != RPC_IN_SERVICE) {
+			homa_rpc_unlock(rpc);
+			rpc = 0;
+			result = -EINVAL;
+			goto error;
+		}
+		rpc->state = RPC_OUTGOING;
+
+		result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+		if (result && rpc->state != RPC_DEAD)
+			goto error;
+		homa_rpc_unlock(rpc);
+	}
+	return 0;
+
+error:
+	if (rpc) {
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+	return result;
+}
+
+/**
+ * homa_recvmsg() - Receive a message from a Homa socket.
+ * @sk:          Socket on which the system call was invoked.
+ * @msg:         Controlling information for the receive.
+ * @len:         Total bytes of space available in msg->msg_iov; not used.
+ * @flags:       Flags from system call; only MSG_DONTWAIT is used.
+ * @addr_len:    Store the length of the sender address here
+ * Return:       The length of the message on success, otherwise a negative
+ *               errno.
+ */
+int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
+		 int *addr_len)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_recvmsg_args control;
+	struct homa_rpc *rpc;
+	int result;
+
+	if (unlikely(!msg->msg_control)) {
+		/* This test isn't strictly necessary, but it provides a
+		 * hook for testing kernel call times.
+		 */
+		return -EINVAL;
+	}
+	if (msg->msg_controllen != sizeof(control)) {
+		result = -EINVAL;
+		goto done;
+	}
+	if (unlikely(copy_from_user(&control, msg->msg_control,
+				    sizeof(control)))) {
+		result = -EFAULT;
+		goto done;
+	}
+	control.completion_cookie = 0;
+	if (control._pad[0]) {
+		result = -EINVAL;
+		goto done;
+	}
+
+	if (control.num_bpages > HOMA_MAX_BPAGES ||
+	    (control.flags & ~HOMA_RECVMSG_VALID_FLAGS)) {
+		result = -EINVAL;
+		goto done;
+	}
+	homa_pool_release_buffers(hsk->buffer_pool, control.num_bpages,
+				  control.bpage_offsets);
+	control.num_bpages = 0;
+
+	rpc = homa_wait_for_message(hsk, (flags & MSG_DONTWAIT)
+			? (control.flags | HOMA_RECVMSG_NONBLOCKING)
+			: control.flags, control.id);
+	if (IS_ERR(rpc)) {
+		/* If we get here, it means there was an error that prevented
+		 * us from finding an RPC to return. If there's an error in
+		 * the RPC itself we won't get here.
+		 */
+		result = PTR_ERR(rpc);
+		goto done;
+	}
+	result = rpc->error ? rpc->error : rpc->msgin.length;
+
+	/* Collect result information. */
+	control.id = rpc->id;
+	control.completion_cookie = rpc->completion_cookie;
+	if (likely(rpc->msgin.length >= 0)) {
+		control.num_bpages = rpc->msgin.num_bpages;
+		memcpy(control.bpage_offsets, rpc->msgin.bpage_offsets,
+		       sizeof(control.bpage_offsets));
+	}
+	if (sk->sk_family == AF_INET6) {
+		struct sockaddr_in6 *in6 = msg->msg_name;
+
+		in6->sin6_family = AF_INET6;
+		in6->sin6_port = htons(rpc->dport);
+		in6->sin6_addr = rpc->peer->addr;
+		*addr_len = sizeof(*in6);
+	} else {
+		struct sockaddr_in *in4 = msg->msg_name;
+
+		in4->sin_family = AF_INET;
+		in4->sin_port = htons(rpc->dport);
+		in4->sin_addr.s_addr = ipv6_to_ipv4(rpc->peer->addr);
+		*addr_len = sizeof(*in4);
+	}
+	memcpy(&control.peer_addr, msg->msg_name, *addr_len);
+	/* This indicates that the application now owns the buffers, so
+	 * we won't free them in homa_rpc_free.
+	 */
+	rpc->msgin.num_bpages = 0;
+
+	/* Must release the RPC lock (and potentially free the RPC) before
+	 * copying the results back to user space.
+	 */
+	if (homa_is_client(rpc->id)) {
+		homa_peer_add_ack(rpc);
+		homa_rpc_free(rpc);
+	} else {
+		if (result < 0)
+			homa_rpc_free(rpc);
+		else
+			rpc->state = RPC_IN_SERVICE;
+	}
+	homa_rpc_unlock(rpc);
+
+done:
+	if (unlikely(copy_to_user(msg->msg_control, &control, sizeof(control)))) {
+		/* Note: in this case the message's buffers will be leaked. */
+		pr_notice("%s couldn't copy back args\n", __func__);
+		result = -EFAULT;
+	}
+
+	return result;
+}
+
+/**
+ * homa_hash() - Not needed for Homa.
+ * @sk:    Socket for the operation
+ * Return: ??
+ */
+int homa_hash(struct sock *sk)
+{
+	return 0;
+}
+
+/**
+ * homa_unhash() - Not needed for Homa.
+ * @sk:    Socket for the operation
+ */
+void homa_unhash(struct sock *sk)
+{
+}
+
+/**
+ * homa_get_port() - It appears that this function is called to assign a
+ * default port for a socket.
+ * @sk:    Socket for the operation
+ * @snum:  Unclear what this is.
+ * Return: Zero for success, or a negative errno for an error.
+ */
+int homa_get_port(struct sock *sk, unsigned short snum)
+{
+	/* Homa always assigns ports immediately when a socket is created,
+	 * so there is nothing to do here.
+	 */
+	return 0;
+}
+
+/**
+ * homa_softirq() - This function is invoked at SoftIRQ level to handle
+ * incoming packets.
+ * @skb:   The incoming packet.
+ * Return: Always 0
+ */
+int homa_softirq(struct sk_buff *skb)
+{
+	struct sk_buff *packets, *other_pkts, *next;
+	struct sk_buff **prev_link, **other_link;
+	struct common_header *h;
+	int first_packet = 1;
+	static __u64 last;
+	int header_offset;
+	int pull_length;
+	__u64 start;
+
+	start = get_cycles();
+	if ((start - last) > 1000000) {
+		int scaled_ms = (int)(10 * (start - last) / cpu_khz);
+
+		if (scaled_ms >= 50 && scaled_ms < 10000) {
+//			tt_record3("Gap in incoming packets: %d cycles "
+//					"(%d.%1d ms)",
+//					(int) (start - last), scaled_ms/10,
+//					scaled_ms%10);
+//			pr_notice("Gap in incoming packets: %llu "
+//					"cycles, (%d.%1d ms)", (start - last),
+//					scaled_ms/10, scaled_ms%10);
+		}
+	}
+	last = start;
+
+	/* skb may actually contain many distinct packets, linked through
+	 * skb_shinfo(skb)->frag_list by the Homa GRO mechanism. Make a
+	 * pass through the list to process all of the short packets,
+	 * leaving the longer packets in the list. Also, perform various
+	 * prep/cleanup/error checking functions.
+	 */
+	skb->next = skb_shinfo(skb)->frag_list;
+	skb_shinfo(skb)->frag_list = NULL;
+	packets = skb;
+	prev_link = &packets;
+	for (skb = packets; skb; skb = next) {
+		next = skb->next;
+
+		/* Make the header available at skb->data, even if the packet
+		 * is fragmented. One complication: it's possible that the IP
+		 * header hasn't yet been removed (this happens for GRO packets
+		 * on the frag_list, since they aren't handled explicitly by IP.
+		 */
+		header_offset = skb_transport_header(skb) - skb->data;
+		pull_length = HOMA_MAX_HEADER + header_offset;
+		if (pull_length > skb->len)
+			pull_length = skb->len;
+		if (!pskb_may_pull(skb, pull_length)) {
+			UNIT_LOG("", "pskb discard");
+			goto discard;
+		}
+		if (header_offset)
+			__skb_pull(skb, header_offset);
+
+		/* Reject packets that are too short or have bogus types. */
+		h = (struct common_header *)skb->data;
+		if (unlikely(skb->len < sizeof(struct common_header) ||
+			     h->type < DATA || h->type >= BOGUS ||
+			     skb->len < header_lengths[h->type - DATA]))
+			goto discard;
+
+		if (first_packet)
+			first_packet = 0;
+
+		/* Process the packet now if it is a control packet or
+		 * if it contains an entire short message.
+		 */
+		if (h->type != DATA || ntohl(((struct data_header *)h)
+				->message_length) < 1400) {
+			UNIT_LOG("; ", "homa_softirq shortcut type 0x%x",
+				 h->type);
+			*prev_link = skb->next;
+			skb->next = NULL;
+			homa_dispatch_pkts(skb, homa);
+		} else {
+			prev_link = &skb->next;
+		}
+		continue;
+
+discard:
+		*prev_link = skb->next;
+		kfree_skb(skb);
+	}
+
+	/* Now process the longer packets. Each iteration of this loop
+	 * collects all of the packets for a particular RPC and dispatches
+	 * them.
+	 */
+	while (packets) {
+		struct in6_addr saddr, saddr2;
+		struct common_header *h2;
+		struct sk_buff *skb2;
+
+		skb = packets;
+		prev_link = &skb->next;
+		saddr = skb_canonical_ipv6_saddr(skb);
+		other_pkts = NULL;
+		other_link = &other_pkts;
+		h = (struct common_header *)skb->data;
+		for (skb2 = skb->next; skb2; skb2 = next) {
+			next = skb2->next;
+			h2 = (struct common_header *)skb2->data;
+			if (h2->sender_id == h->sender_id) {
+				saddr2 = skb_canonical_ipv6_saddr(skb2);
+				if (ipv6_addr_equal(&saddr, &saddr2)) {
+					*prev_link = skb2;
+					prev_link = &skb2->next;
+					continue;
+				}
+			}
+			*other_link = skb2;
+			other_link = &skb2->next;
+		}
+		*prev_link = NULL;
+		*other_link = NULL;
+#ifdef __UNIT_TEST__
+		UNIT_LOG("; ", "id %lld, offsets", homa_local_id(h->sender_id));
+		for (skb2 = packets; skb2; skb2 = skb2->next) {
+			struct data_header *h3 = (struct data_header *)
+					skb2->data;
+			UNIT_LOG("", " %d", ntohl(h3->seg.offset));
+		}
+#endif
+		homa_dispatch_pkts(packets, homa);
+		packets = other_pkts;
+	}
+
+	return 0;
+}
+
+/**
+ * homa_backlog_rcv() - Invoked to handle packets saved on a socket's
+ * backlog because it was locked when the packets first arrived.
+ * @sk:     Homa socket that owns the packet's destination port.
+ * @skb:    The incoming packet. This function takes ownership of the packet
+ *          (we'll delete it).
+ *
+ * Return:  Always returns 0.
+ */
+int homa_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	pr_warn("unimplemented backlog_rcv invoked on Homa socket\n");
+	kfree_skb(skb);
+	return 0;
+}
+
+/**
+ * homa_err_handler_v4() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb:   The incoming packet.
+ * @info:  Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v4(struct sk_buff *skb, u32 info)
+{
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	const struct iphdr *iph = ip_hdr(skb);
+	int type = icmp_hdr(skb)->type;
+	int code = icmp_hdr(skb)->code;
+
+	if (type == ICMP_DEST_UNREACH && code == ICMP_PORT_UNREACH) {
+		char *icmp = (char *)icmp_hdr(skb);
+		struct common_header *h;
+
+		iph = (struct iphdr *)(icmp + sizeof(struct icmphdr));
+		h = (struct common_header *)(icmp + sizeof(struct icmphdr)
+				+ iph->ihl * 4);
+		homa_abort_rpcs(homa, &saddr, htons(h->dport), -ENOTCONN);
+	} else if (type == ICMP_DEST_UNREACH) {
+		int error;
+
+		if (code == ICMP_PROT_UNREACH)
+			error = -EPROTONOSUPPORT;
+		else
+			error = -EHOSTUNREACH;
+		homa_abort_rpcs(homa, &saddr, 0, error);
+	} else {
+		pr_notice("%s invoked with info %x, ICMP type %d, ICMP code %d\n",
+			  __func__, info, type, code);
+	}
+	return 0;
+}
+
+/**
+ * homa_err_handler_v6() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb:    The incoming packet.
+ * @opt:    Not used.
+ * @type:   Type of ICMP packet.
+ * @code:   Additional information about the error.
+ * @offset: Not used.
+ * @info:   Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v6(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			u8 type,  u8 code,  int offset,  __be32 info)
+{
+	const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data;
+
+	if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_PORT_UNREACH) {
+		char *icmp = (char *)icmp_hdr(skb);
+		struct common_header *h;
+
+		iph = (struct ipv6hdr *)(icmp + sizeof(struct icmphdr));
+		h = (struct common_header *)(icmp + sizeof(struct icmphdr)
+				+ HOMA_IPV6_HEADER_LENGTH);
+		homa_abort_rpcs(homa, &iph->daddr, htons(h->dport), -ENOTCONN);
+	} else if (type == ICMPV6_DEST_UNREACH) {
+		int error;
+
+		if (code == ICMP_PROT_UNREACH)
+			error = -EPROTONOSUPPORT;
+		else
+			error = -EHOSTUNREACH;
+		homa_abort_rpcs(homa, &iph->daddr, 0, error);
+	}
+	return 0;
+}
+
+/**
+ * homa_poll() - Invoked by Linux as part of implementing select, poll,
+ * epoll, etc.
+ * @file:  Open file that is participating in a poll, select, etc.
+ * @sock:  A Homa socket, associated with @file.
+ * @wait:  This table will be registered with the socket, so that it
+ *         is notified when the socket's ready state changes.
+ *
+ * Return: A mask of bits such as EPOLLIN, which indicate the current
+ *         state of the socket.
+ */
+__poll_t homa_poll(struct file *file, struct socket *sock,
+		   struct poll_table_struct *wait)
+{
+	struct sock *sk = sock->sk;
+	__poll_t mask;
+
+	/* It seems to be standard practice for poll functions *not* to
+	 * acquire the socket lock, so we don't do it here; not sure
+	 * why...
+	 */
+
+	sock_poll_wait(file, sock, wait);
+	mask = POLLOUT | POLLWRNORM;
+
+	if (!list_empty(&homa_sk(sk)->ready_requests) ||
+	    !list_empty(&homa_sk(sk)->ready_responses))
+		mask |= POLLIN | POLLRDNORM;
+	return mask;
+}
+
+/**
+ * homa_hrtimer() - This function is invoked by the hrtimer mechanism to
+ * wake up the timer thread. Runs at IRQ level.
+ * @timer:   The timer that triggered; not used.
+ *
+ * Return:   Always HRTIMER_RESTART.
+ */
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer)
+{
+	wake_up_process(timer_kthread);
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * homa_timer_main() - Top-level function for the timer thread.
+ * @transport:  Pointer to struct homa.
+ *
+ * Return:         Always 0.
+ */
+int homa_timer_main(void *transport)
+{
+	struct homa *homa = (struct homa *)transport;
+	struct hrtimer hrtimer;
+	ktime_t tick_interval;
+	u64 nsec;
+
+	hrtimer_init(&hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer.function = &homa_hrtimer;
+	nsec = 1000000;                   /* 1 ms */
+	tick_interval = ns_to_ktime(nsec);
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!exiting) {
+			hrtimer_start(&hrtimer, tick_interval, HRTIMER_MODE_REL);
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+		if (exiting)
+			break;
+		homa_timer(homa);
+	}
+	hrtimer_cancel(&hrtimer);
+	kthread_complete_and_exit(&timer_thread_done, 0);
+	return 0;
+}
diff --git a/net/homa/homa_utils.c b/net/homa/homa_utils.c
new file mode 100644
index 000000000000..16cb1cd80717
--- /dev/null
+++ b/net/homa/homa_utils.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains miscellaneous utility functions for Homa, such
+ * as initializing and destroying homa structs.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+struct completion homa_pacer_kthread_done;
+
+/**
+ * homa_init() - Constructor for homa objects.
+ * @homa:   Object to initialize.
+ *
+ * Return:  0 on success, or a negative errno if there was an error. Even
+ *          if an error occurs, it is safe (and necessary) to call
+ *          homa_destroy at some point.
+ */
+int homa_init(struct homa *homa)
+{
+	int err;
+
+	_Static_assert(HOMA_MAX_PRIORITIES >= 8,
+		       "homa_init assumes at least 8 priority levels");
+
+	homa->pacer_kthread = NULL;
+	init_completion(&homa_pacer_kthread_done);
+	atomic64_set(&homa->next_outgoing_id, 2);
+	atomic64_set(&homa->link_idle_time, get_cycles());
+	spin_lock_init(&homa->pacer_mutex);
+	homa->pacer_fifo_fraction = 50;
+	homa->pacer_fifo_count = 1;
+	homa->pacer_wake_time = 0;
+	spin_lock_init(&homa->throttle_lock);
+	INIT_LIST_HEAD_RCU(&homa->throttled_rpcs);
+	homa->throttle_add = 0;
+	homa->throttle_min_bytes = 200;
+	homa->next_client_port = HOMA_MIN_DEFAULT_PORT;
+	homa->port_map = kmalloc(sizeof(*homa->port_map), GFP_KERNEL);
+	homa_socktab_init(homa->port_map);
+	homa->peers = kmalloc(sizeof(*homa->peers), GFP_KERNEL);
+	err = homa_peertab_init(homa->peers);
+	if (err) {
+		pr_err("Couldn't initialize peer table (errno %d)\n", -err);
+		return err;
+	}
+
+	/* Wild guesses to initialize configuration values... */
+	homa->unsched_bytes = 40000;
+	homa->window_param = 100000;
+	homa->link_mbps = 25000;
+	homa->fifo_grant_increment = 10000;
+	homa->grant_fifo_fraction = 50;
+	homa->max_overcommit = 8;
+	homa->max_incoming = 400000;
+	homa->max_rpcs_per_peer = 1;
+	homa->resend_ticks = 5;
+	homa->resend_interval = 5;
+	homa->timeout_ticks = 100;
+	homa->timeout_resends = 5;
+	homa->request_ack_ticks = 2;
+	homa->reap_limit = 10;
+	homa->dead_buffs_limit = 5000;
+	homa->max_dead_buffs = 0;
+	homa->pacer_kthread = kthread_run(homa_pacer_main, homa,
+					  "homa_pacer");
+	if (IS_ERR(homa->pacer_kthread)) {
+		err = PTR_ERR(homa->pacer_kthread);
+		homa->pacer_kthread = NULL;
+		pr_err("couldn't create homa pacer thread: error %d\n", err);
+		return err;
+	}
+	homa->pacer_exit = false;
+	homa->max_nic_queue_ns = 5000;
+	homa->cycles_per_kbyte = 0;
+	homa->max_gso_size = 10000;
+	homa->gso_force_software = 0;
+	homa->max_gro_skbs = 20;
+	homa->gro_policy = HOMA_GRO_NORMAL;
+	homa->busy_usecs = 100;
+	homa->gro_busy_usecs = 5;
+	homa->timer_ticks = 0;
+	homa->flags = 0;
+	homa->bpage_lease_usecs = 10000;
+	homa->next_id = 0;
+	homa_outgoing_sysctl_changed(homa);
+	homa_incoming_sysctl_changed(homa);
+	return 0;
+}
+
+/**
+ * homa_destroy() -  Destructor for homa objects.
+ * @homa:      Object to destroy.
+ */
+void homa_destroy(struct homa *homa)
+{
+	if (homa->pacer_kthread) {
+		homa_pacer_stop(homa);
+		wait_for_completion(&homa_pacer_kthread_done);
+	}
+
+	/* The order of the following 2 statements matters! */
+	homa_socktab_destroy(homa->port_map);
+	kfree(homa->port_map);
+	homa->port_map = NULL;
+	homa_peertab_destroy(homa->peers);
+	kfree(homa->peers);
+	homa->peers = NULL;
+}
+
+/**
+ * homa_spin() - Delay (without sleeping) for a given time interval.
+ * @ns:   How long to delay (in nanoseconds)
+ */
+void homa_spin(int ns)
+{
+	__u64 end;
+
+	end = get_cycles() + (ns * cpu_khz) / 1000000;
+	while (get_cycles() < end)
+		/* Empty loop body.*/
+		;
+}
+
+/**
+ * homa_throttle_lock_slow() - This function implements the slow path for
+ * acquiring the throttle lock. It is invoked when the lock isn't immediately
+ * available. It waits for the lock, but also records statistics about
+ * the waiting time.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_throttle_lock_slow(struct homa *homa)
+{
+	spin_lock_bh(&homa->throttle_lock);
+}

From patchwork Mon Oct 28 21:35:39 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Ousterhout <ouster@cs.stanford.edu>
X-Patchwork-Id: 13854181
X-Patchwork-Delegate: kuba@kernel.org
Received: from smtp1.cs.Stanford.EDU (smtp1.cs.stanford.edu [171.64.64.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2222C1EBA14
	for <netdev@vger.kernel.org>; Mon, 28 Oct 2024 21:36:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=171.64.64.25
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730151388; cv=none;
 b=gViMPZvh8++qeMnqFvyTyZKDE/lKxPnE+6DK9t+3R6nVREyBMRufaic5Z8mYw3AJEaIsYJP1aa/rV0Y3Ux7ZJsvWRmkqwDrt6Iald/121HfycBkPLVJAtvJD1+xk5lfHuEXY7gxuhjtpbQWC1kJd5tj1dDQX2wjYDBtVAepfiQU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730151388; c=relaxed/simple;
	bh=3FoIrL1BYiboyneRffpsHNS19haySGrc5oudXrSMnTk=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Kbr7Oo7U2fKfbpMLYQJwgmP7piPMR/2A/CTjW8YnfT4i1JKiGjnihUilAk/06DyhoKsy0dEVS2oO4vp4J6FCQA7bktbIUBcSr9YGmev/KWHrHfjh9NbaSnoo8pHoOFder4TYRHUhvr9JuQM4TX1mx52OJhfjdD9Qd/VuE+QB0V0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu;
 spf=pass smtp.mailfrom=cs.stanford.edu;
 dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b=UY6630/y; arc=none smtp.client-ip=171.64.64.25
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=cs.stanford.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cs.stanford.edu header.i=@cs.stanford.edu
 header.b="UY6630/y"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=cs.stanford.edu; s=cs2308; h=Content-Transfer-Encoding:MIME-Version:
	References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:
	Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
	List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=81hCTWFa1xsaHq5MhPO4AD1lFB4/xda8wfjd7lC8EH4=; t=1730151386; x=1731015386;
	b=UY6630/yrd7lUtP3e2jtUMQPjwnIcb47WQWPvXVmKg0VgLaM+CwIkjOQA/2qMZTNbVyC8kMZ16p
	09GXQAlcA4JgcQo8jVuz7oBv1/Zj5Tb+1jQXRJOPfrnq8VKDH9VyjUnol+KinGqorLOF1lBo4ScVH
	QhygpOYbJVO3PYBXdaWj+uJtBkHb1wWvwhG3Om7vMFwZ5Kte61Ygo2RD8OMD/AfGC8tvBToBnkTbk
	h67CEKdWT3/jriKguK+CJq1IKcx4h37zyRWzaTA9w9O+pom1wYFHlttrqzwG+uddfgw052M52z/z/
	WQOK7dhMK5l+APKPV6TBNzV10o5rYF+Do7MA==;
Received: from ouster2016.stanford.edu ([172.24.72.71]:54106
 helo=localhost.localdomain)
	by smtp1.cs.Stanford.EDU with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <ouster@cs.stanford.edu>)
	id 1t5XPU-0005xj-TN; Mon, 28 Oct 2024 14:36:25 -0700
From: John Ousterhout <ouster@cs.stanford.edu>
To: netdev@vger.kernel.org
Cc: John Ousterhout <ouster@cs.stanford.edu>
Subject: [PATCH net-next 12/12] net: homa: create Makefile and Kconfig
Date: Mon, 28 Oct 2024 14:35:39 -0700
Message-ID: <20241028213541.1529-13-ouster@cs.stanford.edu>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20241028213541.1529-1-ouster@cs.stanford.edu>
References: <20241028213541.1529-1-ouster@cs.stanford.edu>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -101.0
X-Scan-Signature: f3ece08579cbcd970dc3d33c6519ba8c
X-Patchwork-Delegate: kuba@kernel.org

Before this commit the Homa code is "inert": it won't be compiled
in kernel builds. This commit adds Homa's Makefile and Kconfig, and
also links Homa into net/Makefile and net/Kconfig, so that Homa
will be built during kernel builds if enabled (it is disabled by
default).

This commit also adds an entry in the MAINTAINERS file.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 MAINTAINERS       |  7 +++++++
 net/Kconfig       |  1 +
 net/Makefile      |  1 +
 net/homa/Kconfig  | 17 +++++++++++++++++
 net/homa/Makefile | 14 ++++++++++++++
 5 files changed, 40 insertions(+)
 create mode 100644 net/homa/Kconfig
 create mode 100644 net/homa/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index 1389704c7d8d..935d1e995018 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10391,6 +10391,13 @@ F:	lib/test_hmm*
 F:	mm/hmm*
 F:	tools/testing/selftests/mm/*hmm*
 
+HOMA TRANSPORT PROTOCOL
+M:	John Ousterhout <ouster@cs.stanford.edu>
+S:	Maintained
+W:	https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
+F:	include/uapi/linux/homa.h
+F:	net/homa/
+
 HONEYWELL HSC030PA PRESSURE SENSOR SERIES IIO DRIVER
 M:	Petre Rodan <petre.rodan@subdimension.ro>
 L:	linux-iio@vger.kernel.org
diff --git a/net/Kconfig b/net/Kconfig
index a629f92dc86b..ca8551c1a226 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -244,6 +244,7 @@ endif
 
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
+source "net/homa/Kconfig"
 source "net/rds/Kconfig"
 source "net/tipc/Kconfig"
 source "net/atm/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 65bb8c72a35e..18fa3c323187 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -44,6 +44,7 @@ obj-y				+= 8021q/
 endif
 obj-$(CONFIG_IP_DCCP)		+= dccp/
 obj-$(CONFIG_IP_SCTP)		+= sctp/
+obj-$(CONFIG_HOMA)              += homa/
 obj-$(CONFIG_RDS)		+= rds/
 obj-$(CONFIG_WIRELESS)		+= wireless/
 obj-$(CONFIG_MAC80211)		+= mac80211/
diff --git a/net/homa/Kconfig b/net/homa/Kconfig
new file mode 100644
index 000000000000..bb98e96d8f66
--- /dev/null
+++ b/net/homa/Kconfig
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: BSD-2-Clause
+#
+# Homa transport protocol
+#
+
+menuconfig HOMA
+	tristate "The Homa transport protocol"
+
+	help
+	Homa is a network transport protocol for communication within
+	a datacenter. It provides significantly lower latency than TCP,
+	particularly for workloads containing a mixture of large and small
+	messages operating at high network utilization. For more information
+	see the homa(7) man page or checkout the Homa Wiki at
+	https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview.
+
+	If unsure, say N.
diff --git a/net/homa/Makefile b/net/homa/Makefile
new file mode 100644
index 000000000000..3eb192a6ffa6
--- /dev/null
+++ b/net/homa/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: BSD-2-Clause
+#
+# Makefile for the Linux implementation of the Homa transport protocol.
+
+obj-$(CONFIG_HOMA) := homa.o
+homa-y:=        homa_incoming.o \
+		homa_outgoing.o \
+		homa_peer.o \
+		homa_pool.o \
+		homa_plumbing.o \
+		homa_rpc.o \
+		homa_sock.o \
+		homa_timer.o \
+		homa_utils.o