diff mbox series

[net-next,04/12] net: homa: create homa_pool.h and homa_pool.c

Message ID 20241028213541.1529-5-ouster@cs.stanford.edu (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Begin upstreaming Homa transport protocol | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 4 maintainers not CCed: horms@kernel.org kuba@kernel.org pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 4 this patch: 4
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch warning WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc fail Errors and warnings before: 0 this patch: 2
netdev/source_inline fail Was 0 now: 1

Commit Message

John Ousterhout Oct. 28, 2024, 9:35 p.m. UTC
These files implement Homa's mechanism for managing application-level
buffer space for incoming messages This mechanism is needed to allow
Homa to copy data out to user space in parallel with receiving packets;
it was discussed in a talk at NetDev 0x17.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_pool.c | 434 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_pool.h | 152 +++++++++++++++
 2 files changed, 586 insertions(+)
 create mode 100644 net/homa/homa_pool.c
 create mode 100644 net/homa/homa_pool.h

Comments

Andrew Lunn Oct. 30, 2024, 12:09 a.m. UTC | #1
On Mon, Oct 28, 2024 at 02:35:31PM -0700, John Ousterhout wrote:
> These files implement Homa's mechanism for managing application-level
> buffer space for incoming messages This mechanism is needed to allow
> Homa to copy data out to user space in parallel with receiving packets;
> it was discussed in a talk at NetDev 0x17.
> 
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> ---
>  net/homa/homa_pool.c | 434 +++++++++++++++++++++++++++++++++++++++++++
>  net/homa/homa_pool.h | 152 +++++++++++++++
>  2 files changed, 586 insertions(+)
>  create mode 100644 net/homa/homa_pool.c
>  create mode 100644 net/homa/homa_pool.h
> 
> diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
> new file mode 100644
> index 000000000000..6c59f659fba1
> --- /dev/null
> +++ b/net/homa/homa_pool.c
> @@ -0,0 +1,434 @@
> +// SPDX-License-Identifier: BSD-2-Clause
> +
> +#include "homa_impl.h"
> +#include "homa_pool.h"
> +
> +/* This file contains functions that manage user-space buffer pools. */
> +
> +/* Pools must always have at least this many bpages (no particular
> + * reasoning behind this value).
> + */
> +#define MIN_POOL_SIZE 2
> +
> +/* Used when determining how many bpages to consider for allocation. */
> +#define MIN_EXTRA 4
> +
> +/* When running unit tests, allow HOMA_BPAGE_SIZE and HOMA_BPAGE_SHIFT
> + * to be overridden.
> + */
> +#ifdef __UNIT_TEST__
> +#include "mock.h"
> +#undef HOMA_BPAGE_SIZE
> +#define HOMA_BPAGE_SIZE mock_bpage_size
> +#undef HOMA_BPAGE_SHIFT
> +#define HOMA_BPAGE_SHIFT mock_bpage_shift
> +#endif
> +
> +/**
> + * set_bpages_needed() - Set the bpages_needed field of @pool based
> + * on the length of the first RPC that's waiting for buffer space.
> + * The caller must own the lock for @pool->hsk.
> + * @pool: Pool to update.
> + */
> +static inline void set_bpages_needed(struct homa_pool *pool)

Generally we say no inline functions in .c files, let the compiler
decide. If you have some micro benchmark indicating the compiler is
getting it wrong, we will then consider allowing it.

It would be good if somebody who knows about the page pool took a look
at this code. Could the page pool be used as a basis?

   Andrew
John Ousterhout Oct. 30, 2024, 4:15 a.m. UTC | #2
On Tue, Oct 29, 2024 at 5:09 PM Andrew Lunn <andrew@lunn.ch> wrote:

> > +static inline void set_bpages_needed(struct homa_pool *pool)
>
> Generally we say no inline functions in .c files, let the compiler
> decide. If you have some micro benchmark indicating the compiler is
> getting it wrong, we will then consider allowing it.

I didn't realize that the compiler will inline automatically; I will
remove "inline" from all of the .c files.

> It would be good if somebody who knows about the page pool took a look
> at this code. Could the page pool be used as a basis?

I think this is a different problem from what page pools solve. Rather
than the application providing a buffer each time it calls recvmsg, it
provides a large region of memory in its virtual address space in
advance; Homa allocates buffer space from that region so that it can
begin copying data to user space before recvmsg is invoked. More
precisely, Homa doesn't know which message will be returned as the
response to a recvmsg until a message actually completes (the first
one to complete will be returned). Before this mechanism was
implemented, Homa couldn't start copying data to user space until the
last packet of the message had been received, which serialized the
copy with network transmission and reduced throughput dramatically. If
I understand page pools correctly, I don't think they would be
beneficial here.

-John-
Andrew Lunn Oct. 30, 2024, 12:54 p.m. UTC | #3
On Tue, Oct 29, 2024 at 09:15:09PM -0700, John Ousterhout wrote:
> On Tue, Oct 29, 2024 at 5:09 PM Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > +static inline void set_bpages_needed(struct homa_pool *pool)
> >
> > Generally we say no inline functions in .c files, let the compiler
> > decide. If you have some micro benchmark indicating the compiler is
> > getting it wrong, we will then consider allowing it.
> 
> I didn't realize that the compiler will inline automatically; I will
> remove "inline" from all of the .c files.
> 
> > It would be good if somebody who knows about the page pool took a look
> > at this code. Could the page pool be used as a basis?
> 
> I think this is a different problem from what page pools solve. Rather
> than the application providing a buffer each time it calls recvmsg, it
> provides a large region of memory in its virtual address space in
> advance;

Ah, O.K. Yes, page pool is for kernel memory. However, is the virtual
address space mapped to pages and pinned? Or do you allocate pages
into that VM range as you need them? And then free them once the
application says it has completed? If you are allocating and freeing
pages, the page pool might be useful for these allocations.

Taking a step back here, the kernel already has a number of allocators
and ideally we don't want to add yet another one unless it is really
required. So it would be good to get some reviews from the MM people.

	Andrew
John Ousterhout Oct. 30, 2024, 3:48 p.m. UTC | #4
(resending... forgot to cc netdev in the original response)

On Wed, Oct 30, 2024 at 5:54 AM Andrew Lunn <andrew@lunn.ch> wrote:

> > I think this is a different problem from what page pools solve. Rather
> > than the application providing a buffer each time it calls recvmsg, it
> > provides a large region of memory in its virtual address space in
> > advance;
>
> Ah, O.K. Yes, page pool is for kernel memory. However, is the virtual
> address space mapped to pages and pinned? Or do you allocate pages
> into that VM range as you need them? And then free them once the
> application says it has completed? If you are allocating and freeing
> pages, the page pool might be useful for these allocations.

Homa doesn't allocate or free pages for this: the application mmap's a
region and passes the virtual address range to Homa. Homa doesn't need
to pin the pages. This memory is used in a fashion similar to how a
buffer passed to recvmsg would be used, except that Homa maintains
access to the region for the lifetime of the associated socket. When
the application finishes processing an incoming message, it notifies
Homa so that Homa can reuse the message's buffer space for future
messages; there's no page allocation or freeing in this process.

> Taking a step back here, the kernel already has a number of allocators
> and ideally we don't want to add yet another one unless it is really
> required. So it would be good to get some reviews from the MM people.

I'm happy to do that if you still think it's necessary; how do I do that?

-John-
John Ousterhout Oct. 30, 2024, 8:13 p.m. UTC | #5
Hi Andrew,

Andrew Lunn suggested that I write to you in the hopes that you could
identify someone to review this patch series from the standpoint of
memory management:

https://patchwork.kernel.org/project/netdevbpf/list/?series=903993&state=*

The patch series introduces a new transport protocol, Homa, into the
kernel. Homa uses an unusual approach for managing receive buffers for
incoming messages. The traditional approach, where the application
provides a receive buffer in the recvmsg kernel call, results in poor
performance for Homa because it prevents Homa from overlapping the
copying of data to user space with receiving data over the network.
Instead, a Homa application mmaps a large region of memory and passes
its virtual address range to Homa. Homa associates the memory with a
particular socket, retains it for the life of the socket, and
allocates buffer space for incoming messages out of this region. The
recvmsg kernel call returns the location of the buffer(s), and can
also be used to return buffers back to Homa once the application has
finished processing messages. The code for managing this buffer space
is in the files homa_pool.c and homa_pool.h.

I gave a talk on this mechanism at NetDev last year, which may be
useful to provide more background. Slides and video are available
here:

https://netdevconf.info/0x17/sessions/talk/kernel-managed-user-buffers-in-homa.html

Thanks in advance for any help you can provide.

-John-

On Wed, Oct 30, 2024 at 9:03 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Oct 30, 2024 at 08:46:33AM -0700, John Ousterhout wrote:
> > On Wed, Oct 30, 2024 at 5:54 AM Andrew Lunn <andrew@lunn.ch> wrote:
> > > > I think this is a different problem from what page pools solve. Rather
> > > > than the application providing a buffer each time it calls recvmsg, it
> > > > provides a large region of memory in its virtual address space in
> > > > advance;
> > >
> > > Ah, O.K. Yes, page pool is for kernel memory. However, is the virtual
> > > address space mapped to pages and pinned? Or do you allocate pages
> > > into that VM range as you need them? And then free them once the
> > > application says it has completed? If you are allocating and freeing
> > > pages, the page pool might be useful for these allocations.
> >
> > Homa doesn't allocate or free pages for this: the application mmap's a
> > region and passes the virtual address range to Homa. Homa doesn't need
> > to pin the pages. This memory is used in a fashion similar to how a
> > buffer passed to recvmsg would be used, except that Homa maintains
> > access to the region for the lifetime of the associated socket. When
> > the application finishes processing an incoming message, it notifies
> > Homa so that Homa can reuse the message's buffer space for future
> > messages; there's no page allocation or freeing in this process.
>
> I clearly don't know enough about memory management! I would of
> expected the kernel to do lazy allocation of pages to VM addresses as
> needed. Maybe it is, and when you actually access one of these missing
> pages, you get a page fault and the MM code is kicking in to put an
> actual page there? This could all be hidden inside the copy_to_user()
> call.
>
> > > Taking a step back here, the kernel already has a number of allocators
> > > and ideally we don't want to add yet another one unless it is really
> > > required. So it would be good to get some reviews from the MM people.
> >
> > I'm happy to do that if you still think it's necessary; how do I do that?
>
> Reach out to Andrew Morton <akpm@linux-foundation.org>, the main
> Memory Management Maintainer. Ask who a good person would be to review
> this code.
>
>         Andrew
John Ousterhout Oct. 30, 2024, 8:17 p.m. UTC | #6
On Wed, Oct 30, 2024 at 9:03 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Oct 30, 2024 at 08:46:33AM -0700, John Ousterhout wrote:
> > On Wed, Oct 30, 2024 at 5:54 AM Andrew Lunn <andrew@lunn.ch> wrote:
> > > > I think this is a different problem from what page pools solve. Rather
> > > > than the application providing a buffer each time it calls recvmsg, it
> > > > provides a large region of memory in its virtual address space in
> > > > advance;
> > >
> > > Ah, O.K. Yes, page pool is for kernel memory. However, is the virtual
> > > address space mapped to pages and pinned? Or do you allocate pages
> > > into that VM range as you need them? And then free them once the
> > > application says it has completed? If you are allocating and freeing
> > > pages, the page pool might be useful for these allocations.
> >
> > Homa doesn't allocate or free pages for this: the application mmap's a
> > region and passes the virtual address range to Homa. Homa doesn't need
> > to pin the pages. This memory is used in a fashion similar to how a
> > buffer passed to recvmsg would be used, except that Homa maintains
> > access to the region for the lifetime of the associated socket. When
> > the application finishes processing an incoming message, it notifies
> > Homa so that Homa can reuse the message's buffer space for future
> > messages; there's no page allocation or freeing in this process.
>
> I clearly don't know enough about memory management! I would of
> expected the kernel to do lazy allocation of pages to VM addresses as
> needed. Maybe it is, and when you actually access one of these missing
> pages, you get a page fault and the MM code is kicking in to put an
> actual page there? This could all be hidden inside the copy_to_user()
> call.

Yes, this is all correct.  MM code gets called during copy_to_user to
allocate pages as needed (I should have been more clear: Homa doesn't
allocate or free pages directly). Homa tries to be clever about using
the buffer region to minimize the number of physical pages that
actually need to be allocated (it tries to allocate at the beginning
of the region, only using higher addresses when the lower addresses
are in use).

> > > Taking a step back here, the kernel already has a number of allocators
> > > and ideally we don't want to add yet another one unless it is really
> > > required. So it would be good to get some reviews from the MM people.
> >
> > I'm happy to do that if you still think it's necessary; how do I do that?
>
> Reach out to Andrew Morton <akpm@linux-foundation.org>, the main
> Memory Management Maintainer. Ask who a good person would be to review
> this code.

I have started this process in a separate email.

-John-
Przemek Kitszel Nov. 4, 2024, 1:12 p.m. UTC | #7
On 10/30/24 16:48, John Ousterhout wrote:
> (resending... forgot to cc netdev in the original response)
> 
> On Wed, Oct 30, 2024 at 5:54 AM Andrew Lunn <andrew@lunn.ch> wrote:
> 
>>> I think this is a different problem from what page pools solve. Rather
>>> than the application providing a buffer each time it calls recvmsg, it
>>> provides a large region of memory in its virtual address space in
>>> advance;
>>
>> Ah, O.K. Yes, page pool is for kernel memory. However, is the virtual
>> address space mapped to pages and pinned? Or do you allocate pages
>> into that VM range as you need them? And then free them once the
>> application says it has completed? If you are allocating and freeing
>> pages, the page pool might be useful for these allocations.
> 
> Homa doesn't allocate or free pages for this: the application mmap's a
> region and passes the virtual address range to Homa. Homa doesn't need
> to pin the pages. This memory is used in a fashion similar to how a
> buffer passed to recvmsg would be used, except that Homa maintains
> access to the region for the lifetime of the associated socket. When
> the application finishes processing an incoming message, it notifies
> Homa so that Homa can reuse the message's buffer space for future
> messages; there's no page allocation or freeing in this process.

nice, and I see the obvious performance gains that this approach yields,
would it be possible to instead of mmap() from user, they will ask Homa
to prealloc? That way it will be harder to unmap prematurely, and easier
to port apps between OSes too.

> 
>> Taking a step back here, the kernel already has a number of allocators
>> and ideally we don't want to add yet another one unless it is really
>> required. So it would be good to get some reviews from the MM people.
> 
> I'm happy to do that if you still think it's necessary; how do I do that?
> 
> -John-
>
John Ousterhout Nov. 4, 2024, 11:57 p.m. UTC | #8
On Mon, Nov 4, 2024 at 5:13 AM Przemek Kitszel
<przemyslaw.kitszel@intel.com> wrote:
> >
> > Homa doesn't allocate or free pages for this: the application mmap's a
> > region and passes the virtual address range to Homa. Homa doesn't need
> > to pin the pages. This memory is used in a fashion similar to how a
> > buffer passed to recvmsg would be used, except that Homa maintains
> > access to the region for the lifetime of the associated socket. When
> > the application finishes processing an incoming message, it notifies
> > Homa so that Homa can reuse the message's buffer space for future
> > messages; there's no page allocation or freeing in this process.
>
> nice, and I see the obvious performance gains that this approach yields,
> would it be possible to instead of mmap() from user, they will ask Homa
> to prealloc? That way it will be harder to unmap prematurely, and easier
> to port apps between OSes too.

Can you say a bit more about how your proposal would work? I'm not
familiar enough with Linux internals to know how Homa could populate a
region of address space on behalf of the user.

Having the application mmap the region is pretty clean and simple and
gives the application total control over the type and placement of the
region (it doesn't necessarily even have to be a separate mmapped
region). If the application messes things up by unmapping prematurely,
it won't cause any problems for Homa: this will be detected when Homa
tries to copy info out to the bogus addresses, just as would happen
with a bogus buffer in a traditional approach to buffer management.
And, the mmapping is likely to be done automatically by a user-space
library, so it shouldn't impose additional complexity on the main
application.

-John-
diff mbox series

Patch

diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
new file mode 100644
index 000000000000..6c59f659fba1
--- /dev/null
+++ b/net/homa/homa_pool.c
@@ -0,0 +1,434 @@ 
+// SPDX-License-Identifier: BSD-2-Clause
+
+#include "homa_impl.h"
+#include "homa_pool.h"
+
+/* This file contains functions that manage user-space buffer pools. */
+
+/* Pools must always have at least this many bpages (no particular
+ * reasoning behind this value).
+ */
+#define MIN_POOL_SIZE 2
+
+/* Used when determining how many bpages to consider for allocation. */
+#define MIN_EXTRA 4
+
+/* When running unit tests, allow HOMA_BPAGE_SIZE and HOMA_BPAGE_SHIFT
+ * to be overridden.
+ */
+#ifdef __UNIT_TEST__
+#include "mock.h"
+#undef HOMA_BPAGE_SIZE
+#define HOMA_BPAGE_SIZE mock_bpage_size
+#undef HOMA_BPAGE_SHIFT
+#define HOMA_BPAGE_SHIFT mock_bpage_shift
+#endif
+
+/**
+ * set_bpages_needed() - Set the bpages_needed field of @pool based
+ * on the length of the first RPC that's waiting for buffer space.
+ * The caller must own the lock for @pool->hsk.
+ * @pool: Pool to update.
+ */
+static inline void set_bpages_needed(struct homa_pool *pool)
+{
+	struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+			struct homa_rpc, buf_links);
+	pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
+			>> HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_pool_init() - Initialize a homa_pool; any previous contents of the
+ * objects are overwritten.
+ * @hsk:          Socket containing the pool to initialize.
+ * @region:       First byte of the memory region for the pool, allocated
+ *                by the application; must be page-aligned.
+ * @region_size:  Total number of bytes available at @buf_region.
+ * Return: Either zero (for success) or a negative errno for failure.
+ */
+int homa_pool_init(struct homa_sock *hsk, void *region, __u64 region_size)
+{
+	struct homa_pool *pool = hsk->buffer_pool;
+	int i, result;
+
+	if (((__u64)region) & ~PAGE_MASK)
+		return -EINVAL;
+	pool->hsk = hsk;
+	pool->region = (char *)region;
+	pool->num_bpages = region_size >> HOMA_BPAGE_SHIFT;
+	pool->descriptors = NULL;
+	pool->cores = NULL;
+	if (pool->num_bpages < MIN_POOL_SIZE) {
+		result = -EINVAL;
+		goto error;
+	}
+	pool->descriptors = kmalloc_array(pool->num_bpages,
+					  sizeof(struct homa_bpage), GFP_ATOMIC);
+	if (!pool->descriptors) {
+		result = -ENOMEM;
+		goto error;
+	}
+	for (i = 0; i < pool->num_bpages; i++) {
+		struct homa_bpage *bp = &pool->descriptors[i];
+
+		spin_lock_init(&bp->lock);
+		atomic_set(&bp->refs, 0);
+		bp->owner = -1;
+		bp->expiration = 0;
+	}
+	atomic_set(&pool->free_bpages, pool->num_bpages);
+	pool->bpages_needed = INT_MAX;
+
+	/* Allocate and initialize core-specific data. */
+	pool->cores = kmalloc_array(nr_cpu_ids, sizeof(struct homa_pool_core),
+				    GFP_ATOMIC);
+	if (!pool->cores) {
+		result = -ENOMEM;
+		goto error;
+	}
+	pool->num_cores = nr_cpu_ids;
+	for (i = 0; i < pool->num_cores; i++) {
+		pool->cores[i].page_hint = 0;
+		pool->cores[i].allocated = 0;
+		pool->cores[i].next_candidate = 0;
+	}
+	pool->check_waiting_invoked = 0;
+
+	return 0;
+
+error:
+	kfree(pool->descriptors);
+	kfree(pool->cores);
+	pool->region = NULL;
+	return result;
+}
+
+/**
+ * homa_pool_destroy() - Destructor for homa_pool. After this method
+ * returns, the object should not be used unless it has been reinitialized.
+ * @pool: Pool to destroy.
+ */
+void homa_pool_destroy(struct homa_pool *pool)
+{
+	if (!pool->region)
+		return;
+	kfree(pool->descriptors);
+	kfree(pool->cores);
+	pool->region = NULL;
+}
+
+/**
+ * homa_pool_get_pages() - Allocate one or more full pages from the pool.
+ * @pool:         Pool from which to allocate pages
+ * @num_pages:    Number of pages needed
+ * @pages:        The indices of the allocated pages are stored here; caller
+ *                must ensure this array is big enough. Reference counts have
+ *                been set to 1 on all of these pages (or 2 if set_owner
+ *                was specified).
+ * @set_owner:    If nonzero, the current core is marked as owner of all
+ *                of the allocated pages (and the expiration time is also
+ *                set). Otherwise the pages are left unowned.
+ * Return: 0 for success, -1 if there wasn't enough free space in the pool.
+ */
+int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
+			int set_owner)
+{
+	int core_num = raw_smp_processor_id();
+	struct homa_pool_core *core;
+	__u64 now = get_cycles();
+	int alloced = 0;
+	int limit = 0;
+
+	core = &pool->cores[core_num];
+	if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
+		atomic_add(num_pages, &pool->free_bpages);
+		return -1;
+	}
+
+	/* Once we get to this point we know we will be able to find
+	 * enough free pages; now we just have to find them.
+	 */
+	while (alloced != num_pages) {
+		struct homa_bpage *bpage;
+		int cur, ref_count;
+
+		/* If we don't need to use all of the bpages in the pool,
+		 * then try to use only the ones with low indexes. This
+		 * will reduce the cache footprint for the pool by reusing
+		 * a few bpages over and over. Specifically this code will
+		 * not consider any candidate page whose index is >= limit.
+		 * Limit is chosen to make sure there are a reasonable
+		 * number of free pages in the range, so we won't have to
+		 * check a huge number of pages.
+		 */
+		if (limit == 0) {
+			int extra;
+
+			limit = pool->num_bpages
+					- atomic_read(&pool->free_bpages);
+			extra = limit >> 2;
+			limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
+			if (limit > pool->num_bpages)
+				limit = pool->num_bpages;
+		}
+
+		cur = core->next_candidate;
+		core->next_candidate++;
+		if (cur >= limit) {
+			core->next_candidate = 0;
+
+			/* Must recompute the limit for each new loop through
+			 * the bpage array: we may need to consider a larger
+			 * range of pages because of concurrent allocations.
+			 */
+			limit = 0;
+			continue;
+		}
+		bpage = &pool->descriptors[cur];
+
+		/* Figure out whether this candidate is free (or can be
+		 * stolen). Do a quick check without locking the page, and
+		 * if the page looks promising, then lock it and check again
+		 * (must check again in case someone else snuck in and
+		 * grabbed the page).
+		 */
+		ref_count = atomic_read(&bpage->refs);
+		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
+							  bpage->expiration > now)))
+			continue;
+		if (!spin_trylock_bh(&bpage->lock))
+			continue;
+		ref_count = atomic_read(&bpage->refs);
+		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
+							  bpage->expiration > now))) {
+			spin_unlock_bh(&bpage->lock);
+			continue;
+		}
+		if (bpage->owner >= 0)
+			atomic_inc(&pool->free_bpages);
+		if (set_owner) {
+			atomic_set(&bpage->refs, 2);
+			bpage->owner = core_num;
+			bpage->expiration = now
+					+ pool->hsk->homa->bpage_lease_cycles;
+		} else {
+			atomic_set(&bpage->refs, 1);
+			bpage->owner = -1;
+		}
+		spin_unlock_bh(&bpage->lock);
+		pages[alloced] = cur;
+		alloced++;
+	}
+	return 0;
+}
+
+/**
+ * homa_pool_allocate() - Allocate buffer space for an RPC.
+ * @rpc:  RPC that needs space allocated for its incoming message (space must
+ *        not already have been allocated). The fields @msgin->num_buffers
+ *        and @msgin->buffers are filled in. Must be locked by caller.
+ * Return: The return value is normally 0, which means either buffer space
+ * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
+ * occurred, such as no buffer pool present, then a negative errno is
+ * returned.
+ */
+int homa_pool_allocate(struct homa_rpc *rpc)
+{
+	struct homa_pool *pool = rpc->hsk->buffer_pool;
+	int full_pages, partial, i, core_id;
+	__u32 pages[HOMA_MAX_BPAGES];
+	struct homa_pool_core *core;
+	struct homa_bpage *bpage;
+	__u64 now = get_cycles();
+	struct homa_rpc *other;
+
+	if (!pool->region)
+		return -ENOMEM;
+
+	/* First allocate any full bpages that are needed. */
+	full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
+	if (unlikely(full_pages)) {
+		if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
+			goto out_of_space;
+		for (i = 0; i < full_pages; i++)
+			rpc->msgin.bpage_offsets[i] = pages[i] << HOMA_BPAGE_SHIFT;
+	}
+	rpc->msgin.num_bpages = full_pages;
+
+	/* The last chunk may be less than a full bpage; for this we use
+	 * the bpage that we own (and reuse it for multiple messages).
+	 */
+	partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
+	if (unlikely(partial == 0))
+		goto success;
+	core_id = raw_smp_processor_id();
+	core = &pool->cores[core_id];
+	bpage = &pool->descriptors[core->page_hint];
+	if (!spin_trylock_bh(&bpage->lock))
+		spin_lock_bh(&bpage->lock);
+	if (bpage->owner != core_id) {
+		spin_unlock_bh(&bpage->lock);
+		goto new_page;
+	}
+	if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
+		if (atomic_read(&bpage->refs) == 1) {
+			/* Bpage is totally free, so we can reuse it. */
+			core->allocated = 0;
+		} else {
+			bpage->owner = -1;
+
+			/* We know the reference count can't reach zero here
+			 * because of check above, so we won't have to decrement
+			 * pool->free_bpages.
+			 */
+			atomic_dec_return(&bpage->refs);
+			spin_unlock_bh(&bpage->lock);
+			goto new_page;
+		}
+	}
+	bpage->expiration = now + pool->hsk->homa->bpage_lease_cycles;
+	atomic_inc(&bpage->refs);
+	spin_unlock_bh(&bpage->lock);
+	goto allocate_partial;
+
+	/* Can't use the current page; get another one. */
+new_page:
+	if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
+		homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
+					  rpc->msgin.bpage_offsets);
+		rpc->msgin.num_bpages = 0;
+		goto out_of_space;
+	}
+	core->page_hint = pages[0];
+	core->allocated = 0;
+
+allocate_partial:
+	rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
+			+ (core->page_hint << HOMA_BPAGE_SHIFT);
+	rpc->msgin.num_bpages++;
+	core->allocated += partial;
+
+success:
+	return 0;
+
+	/* We get here if there wasn't enough buffer space for this
+	 * message; add the RPC to hsk->waiting_for_bufs.
+	 */
+out_of_space:
+	homa_sock_lock(pool->hsk, "homa_pool_allocate");
+	list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
+		if (other->msgin.length > rpc->msgin.length) {
+			list_add_tail(&rpc->buf_links, &other->buf_links);
+			goto queued;
+		}
+	}
+	list_add_tail_rcu(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
+
+queued:
+	set_bpages_needed(pool);
+	homa_sock_unlock(pool->hsk);
+	return 0;
+}
+
+/**
+ * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
+ * message data.
+ * @rpc:        RPC for which incoming message data is being processed; its
+ *              msgin must be properly initialized and buffer space must have
+ *              been allocated for the message.
+ * @offset:     Offset within @rpc's incoming message.
+ * @available:  Will be filled in with the number of bytes of space available
+ *              at the returned address.
+ * Return:      The application's virtual address for buffer space corresponding
+ *              to @offset in the incoming message for @rpc.
+ */
+void *homa_pool_get_buffer(struct homa_rpc *rpc, int offset, int *available)
+{
+	int bpage_index, bpage_offset;
+
+	bpage_index = offset >> HOMA_BPAGE_SHIFT;
+	BUG_ON(bpage_index >= rpc->msgin.num_bpages);
+	bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
+	*available = (bpage_index < (rpc->msgin.num_bpages - 1))
+			? HOMA_BPAGE_SIZE - bpage_offset
+			: rpc->msgin.length - offset;
+	return rpc->hsk->buffer_pool->region + rpc->msgin.bpage_offsets[bpage_index]
+			+ bpage_offset;
+}
+
+/**
+ * homa_pool_release_buffers() - Release buffer space so that it can be
+ * reused.
+ * @pool:         Pool that the buffer space belongs to. Doesn't need to
+ *                be locked.
+ * @num_buffers:  How many buffers to release.
+ * @buffers:      Points to @num_buffers values, each of which is an offset
+ *                from the start of the pool to the buffer to be released.
+ */
+void homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
+			       __u32 *buffers)
+{
+	int i;
+
+	if (!pool->region)
+		return;
+	for (i = 0; i < num_buffers; i++) {
+		__u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
+		struct homa_bpage *bpage = &pool->descriptors[bpage_index];
+
+		if (bpage_index < pool->num_bpages)
+			if (atomic_dec_return(&bpage->refs) == 0)
+				atomic_inc(&pool->free_bpages);
+		}
+	}
+
+/**
+ * homa_pool_check_waiting() - Checks to see if there are enough free
+ * bpages to wake up any RPCs that were blocked. Whenever
+ * homa_pool_release_buffers is invoked, this function must be invoked later,
+ * at a point when the caller holds no locks (homa_pool_release_buffers may
+ * be invoked with locks held, so it can't safely invoke this function).
+ * This is regrettably tricky, but I can't think of a better solution.
+ * @pool:         Information about the buffer pool.
+ */
+void homa_pool_check_waiting(struct homa_pool *pool)
+{
+#ifdef __UNIT_TEST__
+	pool->check_waiting_invoked += 1;
+#endif
+	while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
+		struct homa_rpc *rpc;
+
+		homa_sock_lock(pool->hsk, "buffer pool");
+		if (list_empty(&pool->hsk->waiting_for_bufs)) {
+			pool->bpages_needed = INT_MAX;
+			homa_sock_unlock(pool->hsk);
+			break;
+		}
+		rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+				       struct homa_rpc, buf_links);
+		if (!homa_bucket_try_lock(rpc->bucket, rpc->id,
+					  "homa_pool_check_waiting")) {
+			/* Can't just spin on the RPC lock because we're
+			 * holding the socket lock (see sync.txt). Instead,
+			 * release the socket lock and try the entire
+			 * operation again.
+			 */
+			homa_sock_unlock(pool->hsk);
+			UNIT_LOG("; ", "rpc lock unavailable in %s", __func__);
+			continue;
+		}
+		list_del_init(&rpc->buf_links);
+		if (list_empty(&pool->hsk->waiting_for_bufs))
+			pool->bpages_needed = INT_MAX;
+		else
+			set_bpages_needed(pool);
+		homa_sock_unlock(pool->hsk);
+		homa_pool_allocate(rpc);
+		if (rpc->msgin.num_bpages > 0)
+			/* Allocation succeeded; "wake up" the RPC. */
+			rpc->msgin.resend_all = 1;
+		homa_rpc_unlock(rpc);
+	}
+}
diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
new file mode 100644
index 000000000000..e14a71bae9e6
--- /dev/null
+++ b/net/homa/homa_pool.h
@@ -0,0 +1,152 @@ 
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions used to manage user-space buffer pools.
+ */
+
+#ifndef _HOMA_POOL_H
+#define _HOMA_POOL_H
+
+#include "homa_rpc.h"
+
+/**
+ * struct homa_bpage - Contains information about a single page in
+ * a buffer pool.
+ */
+struct homa_bpage {
+	union {
+		/**
+		 * @cache_line: Ensures that each homa_bpage object
+		 * is exactly one cache line long.
+		 */
+		char cache_line[L1_CACHE_BYTES];
+		struct {
+			/** @lock: to synchronize shared access. */
+			spinlock_t lock;
+
+			/**
+			 * @refs: Counts number of distinct uses of this
+			 * bpage (1 tick for each message that is using
+			 * this page, plus an additional tick if the @owner
+			 * field is set).
+			 */
+			atomic_t refs;
+
+			/**
+			 * @owner: kernel core that currently owns this page
+			 * (< 0 if none).
+			 */
+			int owner;
+
+			/**
+			 * @expiration: time (in get_cycles units) after
+			 * which it's OK to steal this page from its current
+			 * owner (if @refs is 1).
+			 */
+			__u64 expiration;
+		};
+	};
+};
+
+/**
+ * struct homa_pool_core - Holds core-specific data for a homa_pool (a bpage
+ * out of which that core is allocating small chunks).
+ */
+struct homa_pool_core {
+	union {
+		/**
+		 * @cache_line: Ensures that each object is exactly one
+		 * cache line long.
+		 */
+		char cache_line[L1_CACHE_BYTES];
+		struct {
+			/**
+			 * @page_hint: Index of bpage in pool->descriptors,
+			 * which may be owned by this core. If so, we'll use it
+			 * for allocating partial pages.
+			 */
+			int page_hint;
+
+			/**
+			 * @allocated: if the page given by @page_hint is
+			 * owned by this core, this variable gives the number of
+			 * (initial) bytes that have already been allocated
+			 * from the page.
+			 */
+			int allocated;
+
+			/**
+			 * @next_candidate: when searching for free bpages,
+			 * check this index next.
+			 */
+			int next_candidate;
+		};
+	};
+};
+
+/**
+ * struct homa_pool - Describes a pool of buffer space for incoming
+ * messages for a particular socket; managed by homa_pool.c. The pool is
+ * divided up into "bpages", which are a multiple of the hardware page size.
+ * A bpage may be owned by a particular core so that it can more efficiently
+ * allocate space for small messages.
+ */
+struct homa_pool {
+	/**
+	 * @hsk: the socket that this pool belongs to.
+	 */
+	struct homa_sock *hsk;
+
+	/**
+	 * @region: beginning of the pool's region (in the app's virtual
+	 * memory). Divided into bpages. 0 means the pool hasn't yet been
+	 * initialized.
+	 */
+	char *region;
+
+	/** @num_bpages: total number of bpages in the pool. */
+	int num_bpages;
+
+	/** @descriptors: kmalloced area containing one entry for each bpage. */
+	struct homa_bpage *descriptors;
+
+	/**
+	 * @free_bpages: the number of pages still available for allocation
+	 * by homa_pool_get pages. This equals the number of pages with zero
+	 * reference counts, minus the number of pages that have been claimed
+	 * by homa_get_pool_pages but not yet allocated.
+	 */
+	atomic_t free_bpages;
+
+	/**
+	 * The number of free bpages required to satisfy the needs of the
+	 * first RPC on @hsk->waiting_for_bufs, or INT_MAX if that queue
+	 * is empty.
+	 */
+	int bpages_needed;
+
+	/** @cores: core-specific info; dynamically allocated. */
+	struct homa_pool_core *cores;
+
+	/** @num_cores: number of elements in @cores. */
+	int num_cores;
+
+	/**
+	 * @check_waiting_invoked: incremented during unit tests when
+	 * homa_pool_check_waiting is invoked.
+	 */
+	int check_waiting_invoked;
+};
+
+int      homa_pool_allocate(struct homa_rpc *rpc);
+void     homa_pool_check_waiting(struct homa_pool *pool);
+void     homa_pool_destroy(struct homa_pool *pool);
+void    *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+			      int *available);
+int      homa_pool_get_pages(struct homa_pool *pool, int num_pages,
+			     __u32 *pages, int leave_locked);
+int      homa_pool_init(struct homa_sock *hsk, void *buf_region,
+			__u64 region_size);
+void     homa_pool_release_buffers(struct homa_pool *pool,
+				   int num_buffers, __u32 *buffers);
+
+#endif /* _HOMA_POOL_H */