diff mbox series

[v14,5/7] btrfs: send: allocate send buffer with alloc_page() and vmap() for v2

Message ID 4353fe7122eb0aae24e3c9ff2399f2b58b74f79e.1647537027.git.osandov@fb.com (mailing list archive)
State New, archived
Headers show
Series btrfs: add send/receive support for reading/writing compressed data | expand

Commit Message

Omar Sandoval March 17, 2022, 5:25 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

For encoded writes, we need the raw pages for reading compressed data
directly via a bio. So, replace kvmalloc() with vmap() so we have access
to the raw pages. 144k is large enough that it usually gets allocated
with vmalloc(), anyways.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

Comments

Sweet Tea Dorminy March 24, 2022, 5:53 p.m. UTC | #1
On 3/17/22 13:25, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> For encoded writes, we need the raw pages for reading compressed data
> directly via a bio.
Perhaps:
"For encoded writes, the existing btrfs_encoded_read*() functions expect 
a list of raw pages."

I think it would be a better to continue just vmalloc'ing a large 
continuous buffer and translating each page in the buffer into its raw 
page with something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : 
virt_to_page(data). Vmalloc can request a higher-order allocation, which 
probably doesn't matter but might slightly improve memory locality. And 
in terms of readability, I somewhat like the elegance of having a single 
identical kvmalloc call to allocate and send_buf in both cases, even if 
we do need to initialize the page list for some v2 commands.
Omar Sandoval March 30, 2022, 4:03 p.m. UTC | #2
On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote:
> 
> 
> On 3/17/22 13:25, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > For encoded writes, we need the raw pages for reading compressed data
> > directly via a bio.
> Perhaps:
> "For encoded writes, the existing btrfs_encoded_read*() functions expect a
> list of raw pages."
> 
> I think it would be a better to continue just vmalloc'ing a large continuous
> buffer and translating each page in the buffer into its raw page with
> something like is_vmalloc_addr(data) ? vmalloc_to_page(data) :
> virt_to_page(data). Vmalloc can request a higher-order allocation, which
> probably doesn't matter but might slightly improve memory locality. And in
> terms of readability, I somewhat like the elegance of having a single
> identical kvmalloc call to allocate and send_buf in both cases, even if we
> do need to initialize the page list for some v2 commands.

I like this, but are we guaranteed that kvmalloc() will return a
page-aligned buffer? It seems reasonable to me that it would for
allocations of at least one page, but I can't find that written down
anywhere.
Sweet Tea Dorminy March 30, 2022, 4:33 p.m. UTC | #3
On 3/30/22 12:03, Omar Sandoval wrote:
> On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote:
>>
>>
>> On 3/17/22 13:25, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> For encoded writes, we need the raw pages for reading compressed data
>>> directly via a bio.
>> Perhaps:
>> "For encoded writes, the existing btrfs_encoded_read*() functions expect a
>> list of raw pages."
>>
>> I think it would be a better to continue just vmalloc'ing a large continuous
>> buffer and translating each page in the buffer into its raw page with
>> something like is_vmalloc_addr(data) ? vmalloc_to_page(data) :
>> virt_to_page(data). Vmalloc can request a higher-order allocation, which
>> probably doesn't matter but might slightly improve memory locality. And in
>> terms of readability, I somewhat like the elegance of having a single
>> identical kvmalloc call to allocate and send_buf in both cases, even if we
>> do need to initialize the page list for some v2 commands.
> 
> I like this, but are we guaranteed that kvmalloc() will return a
> page-aligned buffer? It seems reasonable to me that it would for
> allocations of at least one page, but I can't find that written down
> anywhere.

Since vmalloc allocates whole pages, and kmalloc guarantees alignment to 
the allocation size for powers of 2 sizes (and PAGE_SIZE is required to 
be a power of 2), I think that adds up to a guarantee of page alignment 
both ways?

https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 : 
kmalloc: "For @size of power of two bytes, the alignment is also 
guaranteed to be at least to the size."
https://elixir.bootlin.com/linux/v5.17.1/source/mm/vmalloc.c#L3180: 
vmalloc: " Allocate enough pages"...
Omar Sandoval March 30, 2022, 5:13 p.m. UTC | #4
On Wed, Mar 30, 2022 at 12:33:48PM -0400, Sweet Tea Dorminy wrote:
> 
> 
> On 3/30/22 12:03, Omar Sandoval wrote:
> > On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote:
> > > 
> > > 
> > > On 3/17/22 13:25, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > For encoded writes, we need the raw pages for reading compressed data
> > > > directly via a bio.
> > > Perhaps:
> > > "For encoded writes, the existing btrfs_encoded_read*() functions expect a
> > > list of raw pages."
> > > 
> > > I think it would be a better to continue just vmalloc'ing a large continuous
> > > buffer and translating each page in the buffer into its raw page with
> > > something like is_vmalloc_addr(data) ? vmalloc_to_page(data) :
> > > virt_to_page(data). Vmalloc can request a higher-order allocation, which
> > > probably doesn't matter but might slightly improve memory locality. And in
> > > terms of readability, I somewhat like the elegance of having a single
> > > identical kvmalloc call to allocate and send_buf in both cases, even if we
> > > do need to initialize the page list for some v2 commands.
> > 
> > I like this, but are we guaranteed that kvmalloc() will return a
> > page-aligned buffer? It seems reasonable to me that it would for
> > allocations of at least one page, but I can't find that written down
> > anywhere.
> 
> Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the
> allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a
> power of 2), I think that adds up to a guarantee of page alignment both
> ways?
> 
> https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 :
> kmalloc: "For @size of power of two bytes, the alignment is also guaranteed
> to be at least to the size."

Our allocation size is ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE),
which is 144K for PAGE_SIZE = 4k. If I interpret the kmalloc() comment
very literally, since this isn't a power of two, it's not guaranteed to
be aligned, right?
Sweet Tea Dorminy March 30, 2022, 6:48 p.m. UTC | #5
On 3/30/22 13:13, Omar Sandoval wrote:
> On Wed, Mar 30, 2022 at 12:33:48PM -0400, Sweet Tea Dorminy wrote:
>>
>>
>> On 3/30/22 12:03, Omar Sandoval wrote:
>>> On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote:
>>>>
>>>>
>>>> On 3/17/22 13:25, Omar Sandoval wrote:
>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>
>>>>> For encoded writes, we need the raw pages for reading compressed data
>>>>> directly via a bio.
>>>> Perhaps:
>>>> "For encoded writes, the existing btrfs_encoded_read*() functions expect a
>>>> list of raw pages."
>>>>
>>>> I think it would be a better to continue just vmalloc'ing a large continuous
>>>> buffer and translating each page in the buffer into its raw page with
>>>> something like is_vmalloc_addr(data) ? vmalloc_to_page(data) :
>>>> virt_to_page(data). Vmalloc can request a higher-order allocation, which
>>>> probably doesn't matter but might slightly improve memory locality. And in
>>>> terms of readability, I somewhat like the elegance of having a single
>>>> identical kvmalloc call to allocate and send_buf in both cases, even if we
>>>> do need to initialize the page list for some v2 commands.
>>>
>>> I like this, but are we guaranteed that kvmalloc() will return a
>>> page-aligned buffer? It seems reasonable to me that it would for
>>> allocations of at least one page, but I can't find that written down
>>> anywhere.
>>
>> Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the
>> allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a
>> power of 2), I think that adds up to a guarantee of page alignment both
>> ways?
>>
>> https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 :
>> kmalloc: "For @size of power of two bytes, the alignment is also guaranteed
>> to be at least to the size."
> 
> Our allocation size is ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE),
> which is 144K for PAGE_SIZE = 4k. If I interpret the kmalloc() comment
> very literally, since this isn't a power of two, it's not guaranteed to
> be aligned, right?

Ah, an excellent point.

Now that I think about it, the kmalloc path picks a slab to allocate 
from based on the log_2 of the size: 
https://elixir.bootlin.com/linux/v5.17.1/source/mm/slab_common.c#L733 so 
we'd end up wasting 128k-16k space using kmalloc, whether it's aligned 
or not, I think?

So maybe it should just always use vmalloc and get the page alignment?
Omar Sandoval March 30, 2022, 8:42 p.m. UTC | #6
On Wed, Mar 30, 2022 at 02:48:42PM -0400, Sweet Tea Dorminy wrote:
> 
> 
> On 3/30/22 13:13, Omar Sandoval wrote:
> > On Wed, Mar 30, 2022 at 12:33:48PM -0400, Sweet Tea Dorminy wrote:
> > > 
> > > 
> > > On 3/30/22 12:03, Omar Sandoval wrote:
> > > > On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote:
> > > > > 
> > > > > 
> > > > > On 3/17/22 13:25, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > 
> > > > > > For encoded writes, we need the raw pages for reading compressed data
> > > > > > directly via a bio.
> > > > > Perhaps:
> > > > > "For encoded writes, the existing btrfs_encoded_read*() functions expect a
> > > > > list of raw pages."
> > > > > 
> > > > > I think it would be a better to continue just vmalloc'ing a large continuous
> > > > > buffer and translating each page in the buffer into its raw page with
> > > > > something like is_vmalloc_addr(data) ? vmalloc_to_page(data) :
> > > > > virt_to_page(data). Vmalloc can request a higher-order allocation, which
> > > > > probably doesn't matter but might slightly improve memory locality. And in
> > > > > terms of readability, I somewhat like the elegance of having a single
> > > > > identical kvmalloc call to allocate and send_buf in both cases, even if we
> > > > > do need to initialize the page list for some v2 commands.
> > > > 
> > > > I like this, but are we guaranteed that kvmalloc() will return a
> > > > page-aligned buffer? It seems reasonable to me that it would for
> > > > allocations of at least one page, but I can't find that written down
> > > > anywhere.
> > > 
> > > Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the
> > > allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a
> > > power of 2), I think that adds up to a guarantee of page alignment both
> > > ways?
> > > 
> > > https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 :
> > > kmalloc: "For @size of power of two bytes, the alignment is also guaranteed
> > > to be at least to the size."
> > 
> > Our allocation size is ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE),
> > which is 144K for PAGE_SIZE = 4k. If I interpret the kmalloc() comment
> > very literally, since this isn't a power of two, it's not guaranteed to
> > be aligned, right?
> 
> Ah, an excellent point.
> 
> Now that I think about it, the kmalloc path picks a slab to allocate from
> based on the log_2 of the size:
> https://elixir.bootlin.com/linux/v5.17.1/source/mm/slab_common.c#L733 so
> we'd end up wasting 128k-16k space using kmalloc, whether it's aligned or
> not, I think?
> 
> So maybe it should just always use vmalloc and get the page alignment?

Yeah, vmalloc()+vmalloc_to_page() is going to be more or less equivalent
to the vmap thing I'm doing here, but a lot cleaner. Replacing this
patch with the below patch seems to work:

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index c0ca45dae6d6..e574d4f4a167 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -87,6 +87,7 @@ struct send_ctx {
 	 * command (since protocol v2, data must be the last attribute).
 	 */
 	bool put_data;
+	struct page **send_buf_pages;
 	u64 flags;	/* 'flags' member of btrfs_ioctl_send_args is u64 */
 	/* Protocol version compatibility requested */
 	u32 proto;
@@ -7486,12 +7487,32 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
 	sctx->clone_roots_cnt = arg->clone_sources_count;
 
 	if (sctx->proto >= 2) {
+		u32 send_buf_num_pages;
+
 		sctx->send_max_size = ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED,
 					    PAGE_SIZE);
+		sctx->send_buf = vmalloc(sctx->send_max_size);
+		if (!sctx->send_buf) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT;
+		sctx->send_buf_pages = kcalloc(send_buf_num_pages,
+					       sizeof(*sctx->send_buf_pages),
+					       GFP_KERNEL);
+		if (!sctx->send_buf_pages) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		for (i = 0; i < send_buf_num_pages; i++) {
+			sctx->send_buf_pages[i] =
+				vmalloc_to_page(sctx->send_buf +
+						(i << PAGE_SHIFT));
+		}
 	} else {
 		sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
+		sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	}
-	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
 		ret = -ENOMEM;
 		goto out;
@@ -7684,6 +7705,7 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
 			fput(sctx->send_filp);
 
 		kvfree(sctx->clone_roots);
+		kfree(sctx->send_buf_pages);
 		kvfree(sctx->send_buf);
 
 		name_cache_free(sctx);
Sweet Tea Dorminy March 30, 2022, 9:04 p.m. UTC | #7
> Yeah, vmalloc()+vmalloc_to_page() is going to be more or less equivalent
> to the vmap thing I'm doing here, but a lot cleaner. Replacing this
> patch with the below patch seems to work:
Looks great to me - thanks!
diff mbox series

Patch

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 02053fff80ca..ac2a1297027a 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -83,6 +83,7 @@  struct send_ctx {
 	u32 send_size;
 	u32 send_max_size;
 	bool put_data;
+	struct page **send_buf_pages;
 	u64 flags;	/* 'flags' member of btrfs_ioctl_send_args is u64 */
 	/* Protocol version compatibility requested */
 	u32 proto;
@@ -7392,6 +7393,7 @@  long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
 	struct btrfs_root *clone_root;
 	struct send_ctx *sctx = NULL;
 	u32 i;
+	u32 send_buf_num_pages = 0;
 	u64 *clone_sources_tmp = NULL;
 	int clone_sources_to_rollback = 0;
 	size_t alloc_size;
@@ -7483,10 +7485,28 @@  long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
 	if (sctx->proto >= 2) {
 		sctx->send_max_size = ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED,
 					    PAGE_SIZE);
+		send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT;
+		sctx->send_buf_pages = kcalloc(send_buf_num_pages,
+					       sizeof(*sctx->send_buf_pages),
+					       GFP_KERNEL);
+		if (!sctx->send_buf_pages) {
+			send_buf_num_pages = 0;
+			ret = -ENOMEM;
+			goto out;
+		}
+		for (i = 0; i < send_buf_num_pages; i++) {
+			sctx->send_buf_pages[i] = alloc_page(GFP_KERNEL);
+			if (!sctx->send_buf_pages[i]) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+		sctx->send_buf = vmap(sctx->send_buf_pages, send_buf_num_pages,
+				      VM_MAP, PAGE_KERNEL);
 	} else {
 		sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
+		sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	}
-	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
 		ret = -ENOMEM;
 		goto out;
@@ -7679,7 +7699,16 @@  long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
 			fput(sctx->send_filp);
 
 		kvfree(sctx->clone_roots);
-		kvfree(sctx->send_buf);
+		if (sctx->proto >= 2) {
+			vunmap(sctx->send_buf);
+			for (i = 0; i < send_buf_num_pages; i++) {
+				if (sctx->send_buf_pages[i])
+					__free_page(sctx->send_buf_pages[i]);
+			}
+			kfree(sctx->send_buf_pages);
+		} else {
+			kvfree(sctx->send_buf);
+		}
 
 		name_cache_free(sctx);