Message ID | 4353fe7122eb0aae24e3c9ff2399f2b58b74f79e.1647537027.git.osandov@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: add send/receive support for reading/writing compressed data | expand |
On 3/17/22 13:25, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > For encoded writes, we need the raw pages for reading compressed data > directly via a bio. Perhaps: "For encoded writes, the existing btrfs_encoded_read*() functions expect a list of raw pages." I think it would be a better to continue just vmalloc'ing a large continuous buffer and translating each page in the buffer into its raw page with something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : virt_to_page(data). Vmalloc can request a higher-order allocation, which probably doesn't matter but might slightly improve memory locality. And in terms of readability, I somewhat like the elegance of having a single identical kvmalloc call to allocate and send_buf in both cases, even if we do need to initialize the page list for some v2 commands.
On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote: > > > On 3/17/22 13:25, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > For encoded writes, we need the raw pages for reading compressed data > > directly via a bio. > Perhaps: > "For encoded writes, the existing btrfs_encoded_read*() functions expect a > list of raw pages." > > I think it would be a better to continue just vmalloc'ing a large continuous > buffer and translating each page in the buffer into its raw page with > something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : > virt_to_page(data). Vmalloc can request a higher-order allocation, which > probably doesn't matter but might slightly improve memory locality. And in > terms of readability, I somewhat like the elegance of having a single > identical kvmalloc call to allocate and send_buf in both cases, even if we > do need to initialize the page list for some v2 commands. I like this, but are we guaranteed that kvmalloc() will return a page-aligned buffer? It seems reasonable to me that it would for allocations of at least one page, but I can't find that written down anywhere.
On 3/30/22 12:03, Omar Sandoval wrote: > On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote: >> >> >> On 3/17/22 13:25, Omar Sandoval wrote: >>> From: Omar Sandoval <osandov@fb.com> >>> >>> For encoded writes, we need the raw pages for reading compressed data >>> directly via a bio. >> Perhaps: >> "For encoded writes, the existing btrfs_encoded_read*() functions expect a >> list of raw pages." >> >> I think it would be a better to continue just vmalloc'ing a large continuous >> buffer and translating each page in the buffer into its raw page with >> something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : >> virt_to_page(data). Vmalloc can request a higher-order allocation, which >> probably doesn't matter but might slightly improve memory locality. And in >> terms of readability, I somewhat like the elegance of having a single >> identical kvmalloc call to allocate and send_buf in both cases, even if we >> do need to initialize the page list for some v2 commands. > > I like this, but are we guaranteed that kvmalloc() will return a > page-aligned buffer? It seems reasonable to me that it would for > allocations of at least one page, but I can't find that written down > anywhere. Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a power of 2), I think that adds up to a guarantee of page alignment both ways? https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 : kmalloc: "For @size of power of two bytes, the alignment is also guaranteed to be at least to the size." https://elixir.bootlin.com/linux/v5.17.1/source/mm/vmalloc.c#L3180: vmalloc: " Allocate enough pages"...
On Wed, Mar 30, 2022 at 12:33:48PM -0400, Sweet Tea Dorminy wrote: > > > On 3/30/22 12:03, Omar Sandoval wrote: > > On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote: > > > > > > > > > On 3/17/22 13:25, Omar Sandoval wrote: > > > > From: Omar Sandoval <osandov@fb.com> > > > > > > > > For encoded writes, we need the raw pages for reading compressed data > > > > directly via a bio. > > > Perhaps: > > > "For encoded writes, the existing btrfs_encoded_read*() functions expect a > > > list of raw pages." > > > > > > I think it would be a better to continue just vmalloc'ing a large continuous > > > buffer and translating each page in the buffer into its raw page with > > > something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : > > > virt_to_page(data). Vmalloc can request a higher-order allocation, which > > > probably doesn't matter but might slightly improve memory locality. And in > > > terms of readability, I somewhat like the elegance of having a single > > > identical kvmalloc call to allocate and send_buf in both cases, even if we > > > do need to initialize the page list for some v2 commands. > > > > I like this, but are we guaranteed that kvmalloc() will return a > > page-aligned buffer? It seems reasonable to me that it would for > > allocations of at least one page, but I can't find that written down > > anywhere. > > Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the > allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a > power of 2), I think that adds up to a guarantee of page alignment both > ways? > > https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 : > kmalloc: "For @size of power of two bytes, the alignment is also guaranteed > to be at least to the size." Our allocation size is ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE), which is 144K for PAGE_SIZE = 4k. If I interpret the kmalloc() comment very literally, since this isn't a power of two, it's not guaranteed to be aligned, right?
On 3/30/22 13:13, Omar Sandoval wrote: > On Wed, Mar 30, 2022 at 12:33:48PM -0400, Sweet Tea Dorminy wrote: >> >> >> On 3/30/22 12:03, Omar Sandoval wrote: >>> On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote: >>>> >>>> >>>> On 3/17/22 13:25, Omar Sandoval wrote: >>>>> From: Omar Sandoval <osandov@fb.com> >>>>> >>>>> For encoded writes, we need the raw pages for reading compressed data >>>>> directly via a bio. >>>> Perhaps: >>>> "For encoded writes, the existing btrfs_encoded_read*() functions expect a >>>> list of raw pages." >>>> >>>> I think it would be a better to continue just vmalloc'ing a large continuous >>>> buffer and translating each page in the buffer into its raw page with >>>> something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : >>>> virt_to_page(data). Vmalloc can request a higher-order allocation, which >>>> probably doesn't matter but might slightly improve memory locality. And in >>>> terms of readability, I somewhat like the elegance of having a single >>>> identical kvmalloc call to allocate and send_buf in both cases, even if we >>>> do need to initialize the page list for some v2 commands. >>> >>> I like this, but are we guaranteed that kvmalloc() will return a >>> page-aligned buffer? It seems reasonable to me that it would for >>> allocations of at least one page, but I can't find that written down >>> anywhere. >> >> Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the >> allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a >> power of 2), I think that adds up to a guarantee of page alignment both >> ways? >> >> https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 : >> kmalloc: "For @size of power of two bytes, the alignment is also guaranteed >> to be at least to the size." > > Our allocation size is ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE), > which is 144K for PAGE_SIZE = 4k. If I interpret the kmalloc() comment > very literally, since this isn't a power of two, it's not guaranteed to > be aligned, right? Ah, an excellent point. Now that I think about it, the kmalloc path picks a slab to allocate from based on the log_2 of the size: https://elixir.bootlin.com/linux/v5.17.1/source/mm/slab_common.c#L733 so we'd end up wasting 128k-16k space using kmalloc, whether it's aligned or not, I think? So maybe it should just always use vmalloc and get the page alignment?
On Wed, Mar 30, 2022 at 02:48:42PM -0400, Sweet Tea Dorminy wrote: > > > On 3/30/22 13:13, Omar Sandoval wrote: > > On Wed, Mar 30, 2022 at 12:33:48PM -0400, Sweet Tea Dorminy wrote: > > > > > > > > > On 3/30/22 12:03, Omar Sandoval wrote: > > > > On Thu, Mar 24, 2022 at 01:53:20PM -0400, Sweet Tea Dorminy wrote: > > > > > > > > > > > > > > > On 3/17/22 13:25, Omar Sandoval wrote: > > > > > > From: Omar Sandoval <osandov@fb.com> > > > > > > > > > > > > For encoded writes, we need the raw pages for reading compressed data > > > > > > directly via a bio. > > > > > Perhaps: > > > > > "For encoded writes, the existing btrfs_encoded_read*() functions expect a > > > > > list of raw pages." > > > > > > > > > > I think it would be a better to continue just vmalloc'ing a large continuous > > > > > buffer and translating each page in the buffer into its raw page with > > > > > something like is_vmalloc_addr(data) ? vmalloc_to_page(data) : > > > > > virt_to_page(data). Vmalloc can request a higher-order allocation, which > > > > > probably doesn't matter but might slightly improve memory locality. And in > > > > > terms of readability, I somewhat like the elegance of having a single > > > > > identical kvmalloc call to allocate and send_buf in both cases, even if we > > > > > do need to initialize the page list for some v2 commands. > > > > > > > > I like this, but are we guaranteed that kvmalloc() will return a > > > > page-aligned buffer? It seems reasonable to me that it would for > > > > allocations of at least one page, but I can't find that written down > > > > anywhere. > > > > > > Since vmalloc allocates whole pages, and kmalloc guarantees alignment to the > > > allocation size for powers of 2 sizes (and PAGE_SIZE is required to be a > > > power of 2), I think that adds up to a guarantee of page alignment both > > > ways? > > > > > > https://elixir.bootlin.com/linux/v5.17.1/source/include/linux/slab.h#L522 : > > > kmalloc: "For @size of power of two bytes, the alignment is also guaranteed > > > to be at least to the size." > > > > Our allocation size is ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE), > > which is 144K for PAGE_SIZE = 4k. If I interpret the kmalloc() comment > > very literally, since this isn't a power of two, it's not guaranteed to > > be aligned, right? > > Ah, an excellent point. > > Now that I think about it, the kmalloc path picks a slab to allocate from > based on the log_2 of the size: > https://elixir.bootlin.com/linux/v5.17.1/source/mm/slab_common.c#L733 so > we'd end up wasting 128k-16k space using kmalloc, whether it's aligned or > not, I think? > > So maybe it should just always use vmalloc and get the page alignment? Yeah, vmalloc()+vmalloc_to_page() is going to be more or less equivalent to the vmap thing I'm doing here, but a lot cleaner. Replacing this patch with the below patch seems to work: diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index c0ca45dae6d6..e574d4f4a167 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -87,6 +87,7 @@ struct send_ctx { * command (since protocol v2, data must be the last attribute). */ bool put_data; + struct page **send_buf_pages; u64 flags; /* 'flags' member of btrfs_ioctl_send_args is u64 */ /* Protocol version compatibility requested */ u32 proto; @@ -7486,12 +7487,32 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) sctx->clone_roots_cnt = arg->clone_sources_count; if (sctx->proto >= 2) { + u32 send_buf_num_pages; + sctx->send_max_size = ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE); + sctx->send_buf = vmalloc(sctx->send_max_size); + if (!sctx->send_buf) { + ret = -ENOMEM; + goto out; + } + send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT; + sctx->send_buf_pages = kcalloc(send_buf_num_pages, + sizeof(*sctx->send_buf_pages), + GFP_KERNEL); + if (!sctx->send_buf_pages) { + ret = -ENOMEM; + goto out; + } + for (i = 0; i < send_buf_num_pages; i++) { + sctx->send_buf_pages[i] = + vmalloc_to_page(sctx->send_buf + + (i << PAGE_SHIFT)); + } } else { sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1; + sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL); } - sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL); if (!sctx->send_buf) { ret = -ENOMEM; goto out; @@ -7684,6 +7705,7 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) fput(sctx->send_filp); kvfree(sctx->clone_roots); + kfree(sctx->send_buf_pages); kvfree(sctx->send_buf); name_cache_free(sctx);
> Yeah, vmalloc()+vmalloc_to_page() is going to be more or less equivalent > to the vmap thing I'm doing here, but a lot cleaner. Replacing this > patch with the below patch seems to work: Looks great to me - thanks!
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 02053fff80ca..ac2a1297027a 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -83,6 +83,7 @@ struct send_ctx { u32 send_size; u32 send_max_size; bool put_data; + struct page **send_buf_pages; u64 flags; /* 'flags' member of btrfs_ioctl_send_args is u64 */ /* Protocol version compatibility requested */ u32 proto; @@ -7392,6 +7393,7 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) struct btrfs_root *clone_root; struct send_ctx *sctx = NULL; u32 i; + u32 send_buf_num_pages = 0; u64 *clone_sources_tmp = NULL; int clone_sources_to_rollback = 0; size_t alloc_size; @@ -7483,10 +7485,28 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) if (sctx->proto >= 2) { sctx->send_max_size = ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED, PAGE_SIZE); + send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT; + sctx->send_buf_pages = kcalloc(send_buf_num_pages, + sizeof(*sctx->send_buf_pages), + GFP_KERNEL); + if (!sctx->send_buf_pages) { + send_buf_num_pages = 0; + ret = -ENOMEM; + goto out; + } + for (i = 0; i < send_buf_num_pages; i++) { + sctx->send_buf_pages[i] = alloc_page(GFP_KERNEL); + if (!sctx->send_buf_pages[i]) { + ret = -ENOMEM; + goto out; + } + } + sctx->send_buf = vmap(sctx->send_buf_pages, send_buf_num_pages, + VM_MAP, PAGE_KERNEL); } else { sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1; + sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL); } - sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL); if (!sctx->send_buf) { ret = -ENOMEM; goto out; @@ -7679,7 +7699,16 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) fput(sctx->send_filp); kvfree(sctx->clone_roots); - kvfree(sctx->send_buf); + if (sctx->proto >= 2) { + vunmap(sctx->send_buf); + for (i = 0; i < send_buf_num_pages; i++) { + if (sctx->send_buf_pages[i]) + __free_page(sctx->send_buf_pages[i]); + } + kfree(sctx->send_buf_pages); + } else { + kvfree(sctx->send_buf); + } name_cache_free(sctx);