Message ID | 20231213-zswap-dstmem-v4-1-f228b059dd89@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/zswap: dstmem reuse optimizations and cleanups | expand |
On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > Change the dstmem size from 2 * PAGE_SIZE to only one page since > we only need at most one page when compress, and the "dlen" is also > PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE > we don't wanna store the output in zswap anyway. > > So change it to one page, and delete the stale comment. > > There is no any history about the reason why we needed 2 pages, it has > been 2 * PAGE_SIZE since the time zswap was first merged. i remember there was an over-compression case, that means the compressed data can be bigger than the source data. the similar thing is also done in zram drivers/block/zram/zcomp.c int zcomp_compress(struct zcomp_strm *zstrm, const void *src, unsigned int *dst_len) { /* * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized * because sometimes we can endup having a bigger compressed data * due to various reasons: for example compression algorithms tend * to add some padding to the compressed buffer. Speaking of padding, * comp algorithm `842' pads the compressed length to multiple of 8 * and returns -ENOSP when the dst memory is not big enough, which * is not something that ZRAM wants to see. We can handle the * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we * receive -ERRNO from the compressing backend we can't help it * anymore. To make `842' happy we need to tell the exact size of * the dst buffer, zram_drv will take care of the fact that * compressed buffer is too big. */ *dst_len = PAGE_SIZE * 2; return crypto_comp_compress(zstrm->tfm, src, PAGE_SIZE, zstrm->buffer, dst_len); } > > According to Yosry and Nhat, one potential reason is that we used to > store a zswap header containing the swap entry in the compressed page > for writeback purposes, but we don't do that anymore. > > This patch works good in kernel build testing even when the input data > doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see > from the bpftrace tool: > > bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}' > @[1]: 2 > @[0]: 12011430 > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > Reviewed-by: Nhat Pham <nphamcs@gmail.com> > Acked-by: Chris Li <chrisl@kernel.org> (Google) > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > --- > mm/zswap.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 7ee54a3d8281..976f278aa507 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu) > struct mutex *mutex; > u8 *dst; > > - dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); > + dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); > if (!dst) > return -ENOMEM; > > @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio) > sg_init_table(&input, 1); > sg_set_page(&input, page, PAGE_SIZE, 0); > > - /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */ > - sg_init_one(&output, dst, PAGE_SIZE * 2); > + sg_init_one(&output, dst, PAGE_SIZE); > acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); > /* > * it maybe looks a little bit silly that we send an asynchronous request, > > -- > b4 0.10.1 > Thanks Barry
On 2023/12/27 09:07, Barry Song wrote: > On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> Change the dstmem size from 2 * PAGE_SIZE to only one page since >> we only need at most one page when compress, and the "dlen" is also >> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE >> we don't wanna store the output in zswap anyway. >> >> So change it to one page, and delete the stale comment. >> >> There is no any history about the reason why we needed 2 pages, it has >> been 2 * PAGE_SIZE since the time zswap was first merged. > > i remember there was an over-compression case, that means the compressed > data can be bigger than the source data. the similar thing is also done in zram > drivers/block/zram/zcomp.c Right, there is a buffer overflow report[1] that I just +to you. I think over-compression is all right, but buffer overflow is not acceptable, so we should fix any buffer overflow problem IMHO. Anyway, 2 pages maybe overflowed too, just with smaller probability, right? Thanks. > > int zcomp_compress(struct zcomp_strm *zstrm, > const void *src, unsigned int *dst_len) > { > /* > * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized > * because sometimes we can endup having a bigger compressed data > * due to various reasons: for example compression algorithms tend > * to add some padding to the compressed buffer. Speaking of padding, > * comp algorithm `842' pads the compressed length to multiple of 8 > * and returns -ENOSP when the dst memory is not big enough, which > * is not something that ZRAM wants to see. We can handle the > * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we > * receive -ERRNO from the compressing backend we can't help it > * anymore. To make `842' happy we need to tell the exact size of > * the dst buffer, zram_drv will take care of the fact that > * compressed buffer is too big. > */ > *dst_len = PAGE_SIZE * 2; > > return crypto_comp_compress(zstrm->tfm, > src, PAGE_SIZE, > zstrm->buffer, dst_len); > } > > >> >> According to Yosry and Nhat, one potential reason is that we used to >> store a zswap header containing the swap entry in the compressed page >> for writeback purposes, but we don't do that anymore. >> >> This patch works good in kernel build testing even when the input data >> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see >> from the bpftrace tool: >> >> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}' >> @[1]: 2 >> @[0]: 12011430 >> >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com> >> Acked-by: Chris Li <chrisl@kernel.org> (Google) >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> >> --- >> mm/zswap.c | 5 ++--- >> 1 file changed, 2 insertions(+), 3 deletions(-) >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 7ee54a3d8281..976f278aa507 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu) >> struct mutex *mutex; >> u8 *dst; >> >> - dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); >> + dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); >> if (!dst) >> return -ENOMEM; >> >> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio) >> sg_init_table(&input, 1); >> sg_set_page(&input, page, PAGE_SIZE, 0); >> >> - /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */ >> - sg_init_one(&output, dst, PAGE_SIZE * 2); >> + sg_init_one(&output, dst, PAGE_SIZE); >> acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); >> /* >> * it maybe looks a little bit silly that we send an asynchronous request, >> >> -- >> b4 0.10.1 >> > > Thanks > Barry
On Wed, Dec 27, 2023 at 7:11 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2023/12/27 09:07, Barry Song wrote: > > On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou > > <zhouchengming@bytedance.com> wrote: > >> > >> Change the dstmem size from 2 * PAGE_SIZE to only one page since > >> we only need at most one page when compress, and the "dlen" is also > >> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE > >> we don't wanna store the output in zswap anyway. > >> > >> So change it to one page, and delete the stale comment. > >> > >> There is no any history about the reason why we needed 2 pages, it has > >> been 2 * PAGE_SIZE since the time zswap was first merged. > > > > i remember there was an over-compression case, that means the compressed > > data can be bigger than the source data. the similar thing is also done in zram > > drivers/block/zram/zcomp.c > > Right, there is a buffer overflow report[1] that I just +to you. > > I think over-compression is all right, but buffer overflow is not acceptable, > so we should fix any buffer overflow problem IMHO. Anyway, 2 pages maybe > overflowed too, just with smaller probability, right? practically, the typical page size is 4KB or above, so we have never seen 2 pages can be overflowed. We may have a chance to let CPU-based compression code to return earlier before overflowing though it is still very tough. but for accelerators-based compression in drivers/crypto, the only choice is giving its dma engine a buffer whose length is enough - 2*PAGE_SIZE. so i don't think this patch is correct. > > Thanks. > > > > > int zcomp_compress(struct zcomp_strm *zstrm, > > const void *src, unsigned int *dst_len) > > { > > /* > > * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized > > * because sometimes we can endup having a bigger compressed data > > * due to various reasons: for example compression algorithms tend > > * to add some padding to the compressed buffer. Speaking of padding, > > * comp algorithm `842' pads the compressed length to multiple of 8 > > * and returns -ENOSP when the dst memory is not big enough, which > > * is not something that ZRAM wants to see. We can handle the > > * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we > > * receive -ERRNO from the compressing backend we can't help it > > * anymore. To make `842' happy we need to tell the exact size of > > * the dst buffer, zram_drv will take care of the fact that > > * compressed buffer is too big. > > */ > > *dst_len = PAGE_SIZE * 2; > > > > return crypto_comp_compress(zstrm->tfm, > > src, PAGE_SIZE, > > zstrm->buffer, dst_len); > > } > > > > > >> > >> According to Yosry and Nhat, one potential reason is that we used to > >> store a zswap header containing the swap entry in the compressed page > >> for writeback purposes, but we don't do that anymore. > >> > >> This patch works good in kernel build testing even when the input data > >> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see > >> from the bpftrace tool: > >> > >> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}' > >> @[1]: 2 > >> @[0]: 12011430 > >> > >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > >> Reviewed-by: Nhat Pham <nphamcs@gmail.com> > >> Acked-by: Chris Li <chrisl@kernel.org> (Google) > >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > >> --- > >> mm/zswap.c | 5 ++--- > >> 1 file changed, 2 insertions(+), 3 deletions(-) > >> > >> diff --git a/mm/zswap.c b/mm/zswap.c > >> index 7ee54a3d8281..976f278aa507 100644 > >> --- a/mm/zswap.c > >> +++ b/mm/zswap.c > >> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu) > >> struct mutex *mutex; > >> u8 *dst; > >> > >> - dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); > >> + dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); > >> if (!dst) > >> return -ENOMEM; > >> > >> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio) > >> sg_init_table(&input, 1); > >> sg_set_page(&input, page, PAGE_SIZE, 0); > >> > >> - /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */ > >> - sg_init_one(&output, dst, PAGE_SIZE * 2); > >> + sg_init_one(&output, dst, PAGE_SIZE); > >> acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); > >> /* > >> * it maybe looks a little bit silly that we send an asynchronous request, > >> > >> -- > >> b4 0.10.1 > >> Thanks Barry
On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote: > > i remember there was an over-compression case, that means the compressed > > data can be bigger than the source data. the similar thing is also done in zram > > drivers/block/zram/zcomp.c > > Right, there is a buffer overflow report[1] that I just +to you. What does "[1]" refer to? Is there a bug report about this series?
On Wed, Dec 27, 2023 at 12:58 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote: > > > > i remember there was an over-compression case, that means the compressed > > > data can be bigger than the source data. the similar thing is also done in zram > > > drivers/block/zram/zcomp.c > > > > Right, there is a buffer overflow report[1] that I just +to you. > > What does "[1]" refer to? Is there a bug report about this series? I think Chengming was referring to this: https://lore.kernel.org/lkml/0000000000000b05cd060d6b5511@google.com/ Syzkaller/syzbot found an edge case where the page's "compressed" form was larger than one page, which tripped up the compression code (since we reduced the compression buffer size to 1 page here).
On 2023/12/28 07:21, Nhat Pham wrote: > On Wed, Dec 27, 2023 at 12:58 PM Andrew Morton > <akpm@linux-foundation.org> wrote: >> >> On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote: >> >>>> i remember there was an over-compression case, that means the compressed >>>> data can be bigger than the source data. the similar thing is also done in zram >>>> drivers/block/zram/zcomp.c >>> >>> Right, there is a buffer overflow report[1] that I just +to you. >> >> What does "[1]" refer to? Is there a bug report about this series? > > I think Chengming was referring to this: > > https://lore.kernel.org/lkml/0000000000000b05cd060d6b5511@google.com/ > > Syzkaller/syzbot found an edge case where the page's "compressed" form > was larger than one page, which tripped up the compression code (since > we reduced the compression buffer size to 1 page here). Right, thanks Nhat! The reported bug can be fixed by a patch I posted: https://lore.kernel.org/all/20231227093523.2735484-1-chengming.zhou@linux.dev/ Although this bug is fixed, we still have to revert the first patch to use 2 pages buffer in zswap, since not all compressor drivers would respect the buffer size we passed in and may overflow our output buffer. Barry Song has explained the background in: https://lore.kernel.org/all/CAGsJ_4xuuaPnQzkkQVaRyZL6ZdwkiQ_B7_c2baNaCKVg_O7ZQA@mail.gmail.com/ I will send an updated series later. Thanks!
diff --git a/mm/zswap.c b/mm/zswap.c index 7ee54a3d8281..976f278aa507 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu) struct mutex *mutex; u8 *dst; - dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); + dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu)); if (!dst) return -ENOMEM; @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio) sg_init_table(&input, 1); sg_set_page(&input, page, PAGE_SIZE, 0); - /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */ - sg_init_one(&output, dst, PAGE_SIZE * 2); + sg_init_one(&output, dst, PAGE_SIZE); acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); /* * it maybe looks a little bit silly that we send an asynchronous request,