Message ID | 20250325034210.3337080-1-senozhatsky@chromium.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | zram: modernize writeback interface | expand |
Ccing richardycc@google.com since he has working on this: On Tue, Mar 25, 2025 at 12:42:04PM +0900, Sergey Senozhatsky wrote: > The writeback interface supports a page_index=N parameter which > performs writeback of the given page. Since we rarely need to > writeback just one single page, the typical use case involves > a number of writeback calls, each performing writeback of one > page: > > echo page_index=100 > zram0/writeback > ... > echo page_index=200 > zram0/writeback > echo page_index=500 > zram0/writeback > ... > echo page_index=700 > zram0/writeback > > One obvious downside of this is that it increases the number of > syscalls. Less obvious, but a significantly more important downside, > is that when given only one page to post-process zram cannot perform > an optimal target selection. This becomes a critical limitation > when writeback_limit is enabled, because under writeback_limit we > want to guarantee the highest memory savings hence we first need > to writeback pages that release the highest amount of zsmalloc pool > memory. > > This patch adds page_index_range=LOW-HIGH parameter to the writeback > interface: > > echo page_index_range=100-200 \ > page_index_range=500-700 > zram0/writeback > > This gives zram a chance to apply an optimal target selection > strategy on each iteration of the writeback loop. > > Apart from that the patch also unifies parameters passing and resembles > other "modern" zram device attributes (e.g. recompression), while the > old interface used a mixed scheme: values-less parameters for mode and > a key=value format for page_index. We still support the "old" value-less > format for compatibility reasons. > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> > --- > Documentation/admin-guide/blockdev/zram.rst | 11 + > drivers/block/zram/zram_drv.c | 321 +++++++++++++------- > 2 files changed, 227 insertions(+), 105 deletions(-) > > diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst > index 9bdb30901a93..9dca86365a4d 100644 > --- a/Documentation/admin-guide/blockdev/zram.rst > +++ b/Documentation/admin-guide/blockdev/zram.rst > @@ -369,6 +369,17 @@ they could write a page index into the interface:: > > echo "page_index=1251" > /sys/block/zramX/writeback > > +In Linux 6.16 this interface underwent some rework. First, the interface > +now supports `key=value` format for all of its parameters (`type=huge_idle`, > +etc.) Second, the support for `page_index_range` was introduced, which > +specify `LOW-HIGH` range (or ranges) of pages to be written-back. This > +reduces the number of syscalls, but more importantly this enables optimal > +post-processing target selection strategy. Usage example:: > + > + echo "type=idle" > /sys/block/zramX/writeback > + echo "page_index_range=1-100 page_index_range=200-300" > \ > + /sys/block/zramX/writeback > + > If there are lots of write IO with flash device, potentially, it has > flash wearout problem so that admin needs to design write limitation > to guarantee storage health for entire product life. > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index fda7d8624889..2c39d12bd2d4 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -734,114 +734,19 @@ static void read_from_bdev_async(struct zram *zram, struct page *page, > submit_bio(bio); > } > > -#define PAGE_WB_SIG "page_index=" > - > -#define PAGE_WRITEBACK 0 > -#define HUGE_WRITEBACK (1<<0) > -#define IDLE_WRITEBACK (1<<1) > -#define INCOMPRESSIBLE_WRITEBACK (1<<2) > - > -static int scan_slots_for_writeback(struct zram *zram, u32 mode, > - unsigned long nr_pages, > - unsigned long index, > - struct zram_pp_ctl *ctl) > +static int zram_writeback_slots(struct zram *zram, struct zram_pp_ctl *ctl) > { > - for (; nr_pages != 0; index++, nr_pages--) { > - bool ok = true; > - > - zram_slot_lock(zram, index); > - if (!zram_allocated(zram, index)) > - goto next; > - > - if (zram_test_flag(zram, index, ZRAM_WB) || > - zram_test_flag(zram, index, ZRAM_SAME)) > - goto next; > - > - if (mode & IDLE_WRITEBACK && > - !zram_test_flag(zram, index, ZRAM_IDLE)) > - goto next; > - if (mode & HUGE_WRITEBACK && > - !zram_test_flag(zram, index, ZRAM_HUGE)) > - goto next; > - if (mode & INCOMPRESSIBLE_WRITEBACK && > - !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) > - goto next; > - > - ok = place_pp_slot(zram, ctl, index); > -next: > - zram_slot_unlock(zram, index); > - if (!ok) > - break; > - } > - > - return 0; > -} > - > -static ssize_t writeback_store(struct device *dev, > - struct device_attribute *attr, const char *buf, size_t len) > -{ > - struct zram *zram = dev_to_zram(dev); > - unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; > - struct zram_pp_ctl *ctl = NULL; > + unsigned long blk_idx = 0; > + struct page *page = NULL; > struct zram_pp_slot *pps; > - unsigned long index = 0; > - struct bio bio; > struct bio_vec bio_vec; > - struct page *page = NULL; > - ssize_t ret = len; > - int mode, err; > - unsigned long blk_idx = 0; > - > - if (sysfs_streq(buf, "idle")) > - mode = IDLE_WRITEBACK; > - else if (sysfs_streq(buf, "huge")) > - mode = HUGE_WRITEBACK; > - else if (sysfs_streq(buf, "huge_idle")) > - mode = IDLE_WRITEBACK | HUGE_WRITEBACK; > - else if (sysfs_streq(buf, "incompressible")) > - mode = INCOMPRESSIBLE_WRITEBACK; > - else { > - if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) > - return -EINVAL; > - > - if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) || > - index >= nr_pages) > - return -EINVAL; > - > - nr_pages = 1; > - mode = PAGE_WRITEBACK; > - } > - > - down_read(&zram->init_lock); > - if (!init_done(zram)) { > - ret = -EINVAL; > - goto release_init_lock; > - } > - > - /* Do not permit concurrent post-processing actions. */ > - if (atomic_xchg(&zram->pp_in_progress, 1)) { > - up_read(&zram->init_lock); > - return -EAGAIN; > - } > - > - if (!zram->backing_dev) { > - ret = -ENODEV; > - goto release_init_lock; > - } > + struct bio bio; > + int ret, err; > + u32 index; > > page = alloc_page(GFP_KERNEL); > - if (!page) { > - ret = -ENOMEM; > - goto release_init_lock; > - } > - > - ctl = init_pp_ctl(); > - if (!ctl) { > - ret = -ENOMEM; > - goto release_init_lock; > - } > - > - scan_slots_for_writeback(zram, mode, nr_pages, index, ctl); > + if (!page) > + return -ENOMEM; > > while ((pps = select_pp_slot(ctl))) { > spin_lock(&zram->wb_limit_lock); > @@ -929,10 +834,216 @@ static ssize_t writeback_store(struct device *dev, > > if (blk_idx) > free_block_bdev(zram, blk_idx); > - > -release_init_lock: > if (page) > __free_page(page); > + > + return ret; > +} > + > +#define PAGE_WRITEBACK 0 > +#define HUGE_WRITEBACK (1 << 0) > +#define IDLE_WRITEBACK (1 << 1) > +#define INCOMPRESSIBLE_WRITEBACK (1 << 2) > + > +static int parse_page_index(char *val, unsigned long nr_pages, > + unsigned long *lo, unsigned long *hi) > +{ > + int ret; > + > + ret = kstrtoul(val, 10, lo); > + if (ret) > + return ret; > + *hi = *lo + 1; > + if (*lo >= nr_pages || *hi > nr_pages) > + return -ERANGE; > + return 0; > +} > + > +static int parse_page_index_range(char *val, unsigned long nr_pages, > + unsigned long *lo, unsigned long *hi) > +{ > + char *delim; > + int ret; > + > + delim = strchr(val, '-'); > + if (!delim) > + return -EINVAL; > + > + *delim = 0x00; > + ret = kstrtoul(val, 10, lo); > + if (ret) > + return ret; > + if (*lo >= nr_pages) > + return -ERANGE; > + > + ret = kstrtoul(delim + 1, 10, hi); > + if (ret) > + return ret; > + if (*hi >= nr_pages || *lo > *hi) > + return -ERANGE; > + *hi += 1; > + return 0; > +} > + > +static int parse_mode(char *val, u32 *mode) > +{ > + *mode = 0; > + > + if (!strcmp(val, "idle")) > + *mode = IDLE_WRITEBACK; > + if (!strcmp(val, "huge")) > + *mode = HUGE_WRITEBACK; > + if (!strcmp(val, "huge_idle")) > + *mode = IDLE_WRITEBACK | HUGE_WRITEBACK; > + if (!strcmp(val, "incompressible")) > + *mode = INCOMPRESSIBLE_WRITEBACK; > + > + if (*mode == 0) > + return -EINVAL; > + return 0; > +} > + > +static int scan_slots_for_writeback(struct zram *zram, u32 mode, > + unsigned long lo, unsigned long hi, > + struct zram_pp_ctl *ctl) > +{ > + u32 index = lo; > + > + while (index < hi) { > + bool ok = true; > + > + zram_slot_lock(zram, index); > + if (!zram_allocated(zram, index)) > + goto next; > + > + if (zram_test_flag(zram, index, ZRAM_WB) || > + zram_test_flag(zram, index, ZRAM_SAME)) > + goto next; > + > + if (mode & IDLE_WRITEBACK && > + !zram_test_flag(zram, index, ZRAM_IDLE)) > + goto next; > + if (mode & HUGE_WRITEBACK && > + !zram_test_flag(zram, index, ZRAM_HUGE)) > + goto next; > + if (mode & INCOMPRESSIBLE_WRITEBACK && > + !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) > + goto next; > + > + ok = place_pp_slot(zram, ctl, index); > +next: > + zram_slot_unlock(zram, index); > + if (!ok) > + break; > + index++; > + } > + > + return 0; > +} > + > +static ssize_t writeback_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + struct zram *zram = dev_to_zram(dev); > + u64 nr_pages = zram->disksize >> PAGE_SHIFT; > + unsigned long lo = 0, hi = nr_pages; > + struct zram_pp_ctl *ctl = NULL; > + char *args, *param, *val; > + ssize_t ret = len; > + int err, mode = 0; > + > + down_read(&zram->init_lock); > + if (!init_done(zram)) { > + up_read(&zram->init_lock); > + return -EINVAL; > + } > + > + /* Do not permit concurrent post-processing actions. */ > + if (atomic_xchg(&zram->pp_in_progress, 1)) { > + up_read(&zram->init_lock); > + return -EAGAIN; > + } > + > + if (!zram->backing_dev) { > + ret = -ENODEV; > + goto release_init_lock; > + } > + > + ctl = init_pp_ctl(); > + if (!ctl) { > + ret = -ENOMEM; > + goto release_init_lock; > + } > + > + args = skip_spaces(buf); > + while (*args) { > + args = next_arg(args, ¶m, &val); > + > + /* > + * Workaround to support the old writeback interface. > + * > + * The old writeback interface has a minor inconsistency and > + * requires key=value only for page_index parameter, while the > + * writeback mode is a valueless parameter. > + * > + * This is not the case anymore and now all parameters are > + * required to have values, however, we need to support the > + * legacy writeback interface format so we check if we can > + * recognize a valueless parameter as the (legacy) writeback > + * mode. > + */ > + if (!val || !*val) { > + err = parse_mode(param, &mode); > + if (err) { > + ret = err; > + goto release_init_lock; > + } > + > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > + break; > + } > + > + if (!strcmp(param, "type")) { > + err = parse_mode(val, &mode); > + if (err) { > + ret = err; > + goto release_init_lock; > + } > + > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > + break; > + } > + > + if (!strcmp(param, "page_index")) { > + err = parse_page_index(val, nr_pages, &lo, &hi); > + if (err) { > + ret = err; > + goto release_init_lock; > + } > + > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > + break; > + } > + > + /* There can be several page index ranges */ > + if (!strcmp(param, "page_index_range")) { > + err = parse_page_index_range(val, nr_pages, &lo, &hi); > + if (err) { > + ret = err; > + goto release_init_lock; > + } > + > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > + continue; > + } > + } > + > + err = zram_writeback_slots(zram, ctl); > + if (err) > + ret = err; > + > +release_init_lock: > release_pp_ctl(zram, ctl); > atomic_set(&zram->pp_in_progress, 0); > up_read(&zram->init_lock); > -- > 2.49.0.395.g12beb8f557-goog >
Hi Sergey, Since the input buffer length is only PAGE_SIZE long, can we reduce the duplicated "page_index_range=" strings? Eg: Turn echo page_index_range=100-200 \ page_index_range=500-700 > zram0/writeback To: echo page_index_range=100-200,500-700 > zram0/writeback On Wed, Mar 26, 2025 at 5:52 AM Minchan Kim <minchan@kernel.org> wrote: > > Ccing richardycc@google.com since he has working on this: > > On Tue, Mar 25, 2025 at 12:42:04PM +0900, Sergey Senozhatsky wrote: > > The writeback interface supports a page_index=N parameter which > > performs writeback of the given page. Since we rarely need to > > writeback just one single page, the typical use case involves > > a number of writeback calls, each performing writeback of one > > page: > > > > echo page_index=100 > zram0/writeback > > ... > > echo page_index=200 > zram0/writeback > > echo page_index=500 > zram0/writeback > > ... > > echo page_index=700 > zram0/writeback > > > > One obvious downside of this is that it increases the number of > > syscalls. Less obvious, but a significantly more important downside, > > is that when given only one page to post-process zram cannot perform > > an optimal target selection. This becomes a critical limitation > > when writeback_limit is enabled, because under writeback_limit we > > want to guarantee the highest memory savings hence we first need > > to writeback pages that release the highest amount of zsmalloc pool > > memory. > > > > This patch adds page_index_range=LOW-HIGH parameter to the writeback > > interface: > > > > echo page_index_range=100-200 \ > > page_index_range=500-700 > zram0/writeback > > > > This gives zram a chance to apply an optimal target selection > > strategy on each iteration of the writeback loop. > > > > Apart from that the patch also unifies parameters passing and resembles > > other "modern" zram device attributes (e.g. recompression), while the > > old interface used a mixed scheme: values-less parameters for mode and > > a key=value format for page_index. We still support the "old" value-less > > format for compatibility reasons. > > > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> > > --- > > Documentation/admin-guide/blockdev/zram.rst | 11 + > > drivers/block/zram/zram_drv.c | 321 +++++++++++++------- > > 2 files changed, 227 insertions(+), 105 deletions(-) > > > > diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst > > index 9bdb30901a93..9dca86365a4d 100644 > > --- a/Documentation/admin-guide/blockdev/zram.rst > > +++ b/Documentation/admin-guide/blockdev/zram.rst > > @@ -369,6 +369,17 @@ they could write a page index into the interface:: > > > > echo "page_index=1251" > /sys/block/zramX/writeback > > > > +In Linux 6.16 this interface underwent some rework. First, the interface > > +now supports `key=value` format for all of its parameters (`type=huge_idle`, > > +etc.) Second, the support for `page_index_range` was introduced, which > > +specify `LOW-HIGH` range (or ranges) of pages to be written-back. This > > +reduces the number of syscalls, but more importantly this enables optimal > > +post-processing target selection strategy. Usage example:: > > + > > + echo "type=idle" > /sys/block/zramX/writeback > > + echo "page_index_range=1-100 page_index_range=200-300" > \ > > + /sys/block/zramX/writeback > > + > > If there are lots of write IO with flash device, potentially, it has > > flash wearout problem so that admin needs to design write limitation > > to guarantee storage health for entire product life. > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > > index fda7d8624889..2c39d12bd2d4 100644 > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -734,114 +734,19 @@ static void read_from_bdev_async(struct zram *zram, struct page *page, > > submit_bio(bio); > > } > > > > -#define PAGE_WB_SIG "page_index=" > > - > > -#define PAGE_WRITEBACK 0 > > -#define HUGE_WRITEBACK (1<<0) > > -#define IDLE_WRITEBACK (1<<1) > > -#define INCOMPRESSIBLE_WRITEBACK (1<<2) > > - > > -static int scan_slots_for_writeback(struct zram *zram, u32 mode, > > - unsigned long nr_pages, > > - unsigned long index, > > - struct zram_pp_ctl *ctl) > > +static int zram_writeback_slots(struct zram *zram, struct zram_pp_ctl *ctl) > > { > > - for (; nr_pages != 0; index++, nr_pages--) { > > - bool ok = true; > > - > > - zram_slot_lock(zram, index); > > - if (!zram_allocated(zram, index)) > > - goto next; > > - > > - if (zram_test_flag(zram, index, ZRAM_WB) || > > - zram_test_flag(zram, index, ZRAM_SAME)) > > - goto next; > > - > > - if (mode & IDLE_WRITEBACK && > > - !zram_test_flag(zram, index, ZRAM_IDLE)) > > - goto next; > > - if (mode & HUGE_WRITEBACK && > > - !zram_test_flag(zram, index, ZRAM_HUGE)) > > - goto next; > > - if (mode & INCOMPRESSIBLE_WRITEBACK && > > - !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) > > - goto next; > > - > > - ok = place_pp_slot(zram, ctl, index); > > -next: > > - zram_slot_unlock(zram, index); > > - if (!ok) > > - break; > > - } > > - > > - return 0; > > -} > > - > > -static ssize_t writeback_store(struct device *dev, > > - struct device_attribute *attr, const char *buf, size_t len) > > -{ > > - struct zram *zram = dev_to_zram(dev); > > - unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; > > - struct zram_pp_ctl *ctl = NULL; > > + unsigned long blk_idx = 0; > > + struct page *page = NULL; > > struct zram_pp_slot *pps; > > - unsigned long index = 0; > > - struct bio bio; > > struct bio_vec bio_vec; > > - struct page *page = NULL; > > - ssize_t ret = len; > > - int mode, err; > > - unsigned long blk_idx = 0; > > - > > - if (sysfs_streq(buf, "idle")) > > - mode = IDLE_WRITEBACK; > > - else if (sysfs_streq(buf, "huge")) > > - mode = HUGE_WRITEBACK; > > - else if (sysfs_streq(buf, "huge_idle")) > > - mode = IDLE_WRITEBACK | HUGE_WRITEBACK; > > - else if (sysfs_streq(buf, "incompressible")) > > - mode = INCOMPRESSIBLE_WRITEBACK; > > - else { > > - if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) > > - return -EINVAL; > > - > > - if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) || > > - index >= nr_pages) > > - return -EINVAL; > > - > > - nr_pages = 1; > > - mode = PAGE_WRITEBACK; > > - } > > - > > - down_read(&zram->init_lock); > > - if (!init_done(zram)) { > > - ret = -EINVAL; > > - goto release_init_lock; > > - } > > - > > - /* Do not permit concurrent post-processing actions. */ > > - if (atomic_xchg(&zram->pp_in_progress, 1)) { > > - up_read(&zram->init_lock); > > - return -EAGAIN; > > - } > > - > > - if (!zram->backing_dev) { > > - ret = -ENODEV; > > - goto release_init_lock; > > - } > > + struct bio bio; > > + int ret, err; > > + u32 index; > > > > page = alloc_page(GFP_KERNEL); > > - if (!page) { > > - ret = -ENOMEM; > > - goto release_init_lock; > > - } > > - > > - ctl = init_pp_ctl(); > > - if (!ctl) { > > - ret = -ENOMEM; > > - goto release_init_lock; > > - } > > - > > - scan_slots_for_writeback(zram, mode, nr_pages, index, ctl); > > + if (!page) > > + return -ENOMEM; > > > > while ((pps = select_pp_slot(ctl))) { > > spin_lock(&zram->wb_limit_lock); > > @@ -929,10 +834,216 @@ static ssize_t writeback_store(struct device *dev, > > > > if (blk_idx) > > free_block_bdev(zram, blk_idx); > > - > > -release_init_lock: > > if (page) > > __free_page(page); > > + > > + return ret; > > +} > > + > > +#define PAGE_WRITEBACK 0 > > +#define HUGE_WRITEBACK (1 << 0) > > +#define IDLE_WRITEBACK (1 << 1) > > +#define INCOMPRESSIBLE_WRITEBACK (1 << 2) > > + > > +static int parse_page_index(char *val, unsigned long nr_pages, > > + unsigned long *lo, unsigned long *hi) > > +{ > > + int ret; > > + > > + ret = kstrtoul(val, 10, lo); > > + if (ret) > > + return ret; > > + *hi = *lo + 1; > > + if (*lo >= nr_pages || *hi > nr_pages) > > + return -ERANGE; > > + return 0; > > +} > > + > > +static int parse_page_index_range(char *val, unsigned long nr_pages, > > + unsigned long *lo, unsigned long *hi) > > +{ > > + char *delim; > > + int ret; > > + > > + delim = strchr(val, '-'); > > + if (!delim) > > + return -EINVAL; > > + > > + *delim = 0x00; > > + ret = kstrtoul(val, 10, lo); > > + if (ret) > > + return ret; > > + if (*lo >= nr_pages) > > + return -ERANGE; > > + > > + ret = kstrtoul(delim + 1, 10, hi); > > + if (ret) > > + return ret; > > + if (*hi >= nr_pages || *lo > *hi) > > + return -ERANGE; > > + *hi += 1; > > + return 0; > > +} > > + > > +static int parse_mode(char *val, u32 *mode) > > +{ > > + *mode = 0; > > + > > + if (!strcmp(val, "idle")) > > + *mode = IDLE_WRITEBACK; > > + if (!strcmp(val, "huge")) > > + *mode = HUGE_WRITEBACK; > > + if (!strcmp(val, "huge_idle")) > > + *mode = IDLE_WRITEBACK | HUGE_WRITEBACK; > > + if (!strcmp(val, "incompressible")) > > + *mode = INCOMPRESSIBLE_WRITEBACK; > > + > > + if (*mode == 0) > > + return -EINVAL; > > + return 0; > > +} > > + > > +static int scan_slots_for_writeback(struct zram *zram, u32 mode, > > + unsigned long lo, unsigned long hi, > > + struct zram_pp_ctl *ctl) > > +{ > > + u32 index = lo; > > + > > + while (index < hi) { > > + bool ok = true; > > + > > + zram_slot_lock(zram, index); > > + if (!zram_allocated(zram, index)) > > + goto next; > > + > > + if (zram_test_flag(zram, index, ZRAM_WB) || > > + zram_test_flag(zram, index, ZRAM_SAME)) > > + goto next; > > + > > + if (mode & IDLE_WRITEBACK && > > + !zram_test_flag(zram, index, ZRAM_IDLE)) > > + goto next; > > + if (mode & HUGE_WRITEBACK && > > + !zram_test_flag(zram, index, ZRAM_HUGE)) > > + goto next; > > + if (mode & INCOMPRESSIBLE_WRITEBACK && > > + !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) > > + goto next; > > + > > + ok = place_pp_slot(zram, ctl, index); > > +next: > > + zram_slot_unlock(zram, index); > > + if (!ok) > > + break; > > + index++; > > + } > > + > > + return 0; > > +} > > + > > +static ssize_t writeback_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t len) > > +{ > > + struct zram *zram = dev_to_zram(dev); > > + u64 nr_pages = zram->disksize >> PAGE_SHIFT; > > + unsigned long lo = 0, hi = nr_pages; > > + struct zram_pp_ctl *ctl = NULL; > > + char *args, *param, *val; > > + ssize_t ret = len; > > + int err, mode = 0; > > + > > + down_read(&zram->init_lock); > > + if (!init_done(zram)) { > > + up_read(&zram->init_lock); > > + return -EINVAL; > > + } > > + > > + /* Do not permit concurrent post-processing actions. */ > > + if (atomic_xchg(&zram->pp_in_progress, 1)) { > > + up_read(&zram->init_lock); > > + return -EAGAIN; > > + } > > + > > + if (!zram->backing_dev) { > > + ret = -ENODEV; > > + goto release_init_lock; > > + } > > + > > + ctl = init_pp_ctl(); > > + if (!ctl) { > > + ret = -ENOMEM; > > + goto release_init_lock; > > + } > > + > > + args = skip_spaces(buf); > > + while (*args) { > > + args = next_arg(args, ¶m, &val); > > + > > + /* > > + * Workaround to support the old writeback interface. > > + * > > + * The old writeback interface has a minor inconsistency and > > + * requires key=value only for page_index parameter, while the > > + * writeback mode is a valueless parameter. > > + * > > + * This is not the case anymore and now all parameters are > > + * required to have values, however, we need to support the > > + * legacy writeback interface format so we check if we can > > + * recognize a valueless parameter as the (legacy) writeback > > + * mode. > > + */ > > + if (!val || !*val) { > > + err = parse_mode(param, &mode); > > + if (err) { > > + ret = err; > > + goto release_init_lock; > > + } > > + > > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > > + break; > > + } > > + > > + if (!strcmp(param, "type")) { > > + err = parse_mode(val, &mode); > > + if (err) { > > + ret = err; > > + goto release_init_lock; > > + } > > + > > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > > + break; > > + } > > + > > + if (!strcmp(param, "page_index")) { > > + err = parse_page_index(val, nr_pages, &lo, &hi); > > + if (err) { > > + ret = err; > > + goto release_init_lock; > > + } > > + > > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > > + break; > > + } > > + > > + /* There can be several page index ranges */ > > + if (!strcmp(param, "page_index_range")) { > > + err = parse_page_index_range(val, nr_pages, &lo, &hi); > > + if (err) { > > + ret = err; > > + goto release_init_lock; > > + } > > + > > + scan_slots_for_writeback(zram, mode, lo, hi, ctl); > > + continue; > > + } > > + } > > + > > + err = zram_writeback_slots(zram, ctl); > > + if (err) > > + ret = err; > > + > > +release_init_lock: > > release_pp_ctl(zram, ctl); > > atomic_set(&zram->pp_in_progress, 0); > > up_read(&zram->init_lock); > > -- > > 2.49.0.395.g12beb8f557-goog > >
Hi, On (25/03/26 12:03), Richard Chang wrote: > Hi Sergey, > Since the input buffer length is only PAGE_SIZE long, can we reduce > the duplicated "page_index_range=" strings? > Eg: > Turn > echo page_index_range=100-200 \ > page_index_range=500-700 > zram0/writeback > To: > echo page_index_range=100-200,500-700 > zram0/writeback Do you expect to ever have so many ranges that PAGE_SIZE buffer would be too small? I didn't want to put a list parser into the kernel, I wanted to keep arguments parsing as simple as possible. But if you really need to writeback that many pages then I guess I can implement it as a list of ranges. Alternatively: We don't necessarily need to use page_index_range key, which is a little long, but can use page_indices=/page_indexes= or just pages=?
Hi Sergey, On Wed, Mar 26, 2025 at 12:16 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > Hi, > > On (25/03/26 12:03), Richard Chang wrote: > > Hi Sergey, > > Since the input buffer length is only PAGE_SIZE long, can we reduce > > the duplicated "page_index_range=" strings? > > Eg: > > Turn > > echo page_index_range=100-200 \ > > page_index_range=500-700 > zram0/writeback > > To: > > echo page_index_range=100-200,500-700 > zram0/writeback > > Do you expect to ever have so many ranges that PAGE_SIZE buffer > would be too small? I didn't want to put a list parser into > the kernel, I wanted to keep arguments parsing as simple as > possible. But if you really need to writeback that many pages > then I guess I can implement it as a list of ranges. > > Alternatively: > We don't necessarily need to use page_index_range key, which is a > little long, but can use page_indices=/page_indexes= or just pages=? I am just counting how many pages we could writeback per syscall. In a worst case, page_index_range with several two-adjacent indices: Assume PAGE_SIZE is 4k and index range is around 10000, page_index_range allows 272 pages per syscall: page_index_range=10000-10001 page_index_range=10003-10004... List_range allows 678 pages per syscall: page_index_range=10000-10001,10003-10004... page_indices allows 314 pages per syscall: page_indices=10000-10001 page_indices=10003-10004... It's the worst case so the actual world might not be that bad. Another alternative thought, how about page_index supporting both single instance and ranges? The key is shorter and the parser is relatively simpler. Eg: page_index=500 page_index=10000-10001
On (25/03/26 15:07), Richard Chang wrote: > I am just counting how many pages we could writeback per syscall. > In a worst case, page_index_range with several two-adjacent indices: > Assume PAGE_SIZE is 4k and index range is around 10000, > > page_index_range allows 272 pages per syscall: > page_index_range=10000-10001 page_index_range=10003-10004... > > List_range allows 678 pages per syscall: > page_index_range=10000-10001,10003-10004... Page index can be u32, much longer than 10000. With really high indices you'd probably waste more space on "N-N+1" alone than on page_index=N. I'd say index-range should be used when there is a range to begin with. [..] > Another alternative thought, how about page_index supporting both > single instance and ranges? > The key is shorter and the parser is relatively simpler. > Eg: page_index=500 page_index=10000-10001 I probably can look into it.
On (25/03/26 17:15), Sergey Senozhatsky wrote: > On (25/03/26 15:07), Richard Chang wrote: > [..] > > Another alternative thought, how about page_index supporting both > > single instance and ranges? > > The key is shorter and the parser is relatively simpler. > > Eg: page_index=500 page_index=10000-10001 > > I probably can look into it. Can't say I really like that "index" will mean both index and a range of indexes. But let me think more.
On (25/03/26 17:45), Sergey Senozhatsky wrote: > On (25/03/26 17:15), Sergey Senozhatsky wrote: > > On (25/03/26 15:07), Richard Chang wrote: > > [..] > > > Another alternative thought, how about page_index supporting both > > > single instance and ranges? > > > The key is shorter and the parser is relatively simpler. > > > Eg: page_index=500 page_index=10000-10001 > > > > I probably can look into it. > > Can't say I really like that "index" will mean both index and > a range of indexes. But let me think more. We can permit multiple page_index= as well page_index=1000 page_index=2000 page_indexes=5000-6000 because one element range (N-N+1) can easily waste more space than page_index. (historically zram permitted only one page_index per call.) And plural for index comes at two extra bytes, which seems fine. I'll wait for more opinions. --- if (!strcmp(param, "page_index")) { err = parse_page_index(val, nr_pages, &lo, &hi); if (err) { ret = err; goto release_init_lock; } scan_slots_for_writeback(zram, mode, lo, hi, ctl); continue; } if (!strcmp(param, "page_indexes")) { err = parse_page_indexes(val, nr_pages, &lo, &hi); if (err) { ret = err; goto release_init_lock; } scan_slots_for_writeback(zram, mode, lo, hi, ctl); continue; }
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 9bdb30901a93..9dca86365a4d 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -369,6 +369,17 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_index_range` was introduced, which +specify `LOW-HIGH` range (or ranges) of pages to be written-back. This +reduces the number of syscalls, but more importantly this enables optimal +post-processing target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_index_range=1-100 page_index_range=200-300" > \ + /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index fda7d8624889..2c39d12bd2d4 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -734,114 +734,19 @@ static void read_from_bdev_async(struct zram *zram, struct page *page, submit_bio(bio); } -#define PAGE_WB_SIG "page_index=" - -#define PAGE_WRITEBACK 0 -#define HUGE_WRITEBACK (1<<0) -#define IDLE_WRITEBACK (1<<1) -#define INCOMPRESSIBLE_WRITEBACK (1<<2) - -static int scan_slots_for_writeback(struct zram *zram, u32 mode, - unsigned long nr_pages, - unsigned long index, - struct zram_pp_ctl *ctl) +static int zram_writeback_slots(struct zram *zram, struct zram_pp_ctl *ctl) { - for (; nr_pages != 0; index++, nr_pages--) { - bool ok = true; - - zram_slot_lock(zram, index); - if (!zram_allocated(zram, index)) - goto next; - - if (zram_test_flag(zram, index, ZRAM_WB) || - zram_test_flag(zram, index, ZRAM_SAME)) - goto next; - - if (mode & IDLE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_IDLE)) - goto next; - if (mode & HUGE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_HUGE)) - goto next; - if (mode & INCOMPRESSIBLE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) - goto next; - - ok = place_pp_slot(zram, ctl, index); -next: - zram_slot_unlock(zram, index); - if (!ok) - break; - } - - return 0; -} - -static ssize_t writeback_store(struct device *dev, - struct device_attribute *attr, const char *buf, size_t len) -{ - struct zram *zram = dev_to_zram(dev); - unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; - struct zram_pp_ctl *ctl = NULL; + unsigned long blk_idx = 0; + struct page *page = NULL; struct zram_pp_slot *pps; - unsigned long index = 0; - struct bio bio; struct bio_vec bio_vec; - struct page *page = NULL; - ssize_t ret = len; - int mode, err; - unsigned long blk_idx = 0; - - if (sysfs_streq(buf, "idle")) - mode = IDLE_WRITEBACK; - else if (sysfs_streq(buf, "huge")) - mode = HUGE_WRITEBACK; - else if (sysfs_streq(buf, "huge_idle")) - mode = IDLE_WRITEBACK | HUGE_WRITEBACK; - else if (sysfs_streq(buf, "incompressible")) - mode = INCOMPRESSIBLE_WRITEBACK; - else { - if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) - return -EINVAL; - - if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) || - index >= nr_pages) - return -EINVAL; - - nr_pages = 1; - mode = PAGE_WRITEBACK; - } - - down_read(&zram->init_lock); - if (!init_done(zram)) { - ret = -EINVAL; - goto release_init_lock; - } - - /* Do not permit concurrent post-processing actions. */ - if (atomic_xchg(&zram->pp_in_progress, 1)) { - up_read(&zram->init_lock); - return -EAGAIN; - } - - if (!zram->backing_dev) { - ret = -ENODEV; - goto release_init_lock; - } + struct bio bio; + int ret, err; + u32 index; page = alloc_page(GFP_KERNEL); - if (!page) { - ret = -ENOMEM; - goto release_init_lock; - } - - ctl = init_pp_ctl(); - if (!ctl) { - ret = -ENOMEM; - goto release_init_lock; - } - - scan_slots_for_writeback(zram, mode, nr_pages, index, ctl); + if (!page) + return -ENOMEM; while ((pps = select_pp_slot(ctl))) { spin_lock(&zram->wb_limit_lock); @@ -929,10 +834,216 @@ static ssize_t writeback_store(struct device *dev, if (blk_idx) free_block_bdev(zram, blk_idx); - -release_init_lock: if (page) __free_page(page); + + return ret; +} + +#define PAGE_WRITEBACK 0 +#define HUGE_WRITEBACK (1 << 0) +#define IDLE_WRITEBACK (1 << 1) +#define INCOMPRESSIBLE_WRITEBACK (1 << 2) + +static int parse_page_index(char *val, unsigned long nr_pages, + unsigned long *lo, unsigned long *hi) +{ + int ret; + + ret = kstrtoul(val, 10, lo); + if (ret) + return ret; + *hi = *lo + 1; + if (*lo >= nr_pages || *hi > nr_pages) + return -ERANGE; + return 0; +} + +static int parse_page_index_range(char *val, unsigned long nr_pages, + unsigned long *lo, unsigned long *hi) +{ + char *delim; + int ret; + + delim = strchr(val, '-'); + if (!delim) + return -EINVAL; + + *delim = 0x00; + ret = kstrtoul(val, 10, lo); + if (ret) + return ret; + if (*lo >= nr_pages) + return -ERANGE; + + ret = kstrtoul(delim + 1, 10, hi); + if (ret) + return ret; + if (*hi >= nr_pages || *lo > *hi) + return -ERANGE; + *hi += 1; + return 0; +} + +static int parse_mode(char *val, u32 *mode) +{ + *mode = 0; + + if (!strcmp(val, "idle")) + *mode = IDLE_WRITEBACK; + if (!strcmp(val, "huge")) + *mode = HUGE_WRITEBACK; + if (!strcmp(val, "huge_idle")) + *mode = IDLE_WRITEBACK | HUGE_WRITEBACK; + if (!strcmp(val, "incompressible")) + *mode = INCOMPRESSIBLE_WRITEBACK; + + if (*mode == 0) + return -EINVAL; + return 0; +} + +static int scan_slots_for_writeback(struct zram *zram, u32 mode, + unsigned long lo, unsigned long hi, + struct zram_pp_ctl *ctl) +{ + u32 index = lo; + + while (index < hi) { + bool ok = true; + + zram_slot_lock(zram, index); + if (!zram_allocated(zram, index)) + goto next; + + if (zram_test_flag(zram, index, ZRAM_WB) || + zram_test_flag(zram, index, ZRAM_SAME)) + goto next; + + if (mode & IDLE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_IDLE)) + goto next; + if (mode & HUGE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_HUGE)) + goto next; + if (mode & INCOMPRESSIBLE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) + goto next; + + ok = place_pp_slot(zram, ctl, index); +next: + zram_slot_unlock(zram, index); + if (!ok) + break; + index++; + } + + return 0; +} + +static ssize_t writeback_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct zram *zram = dev_to_zram(dev); + u64 nr_pages = zram->disksize >> PAGE_SHIFT; + unsigned long lo = 0, hi = nr_pages; + struct zram_pp_ctl *ctl = NULL; + char *args, *param, *val; + ssize_t ret = len; + int err, mode = 0; + + down_read(&zram->init_lock); + if (!init_done(zram)) { + up_read(&zram->init_lock); + return -EINVAL; + } + + /* Do not permit concurrent post-processing actions. */ + if (atomic_xchg(&zram->pp_in_progress, 1)) { + up_read(&zram->init_lock); + return -EAGAIN; + } + + if (!zram->backing_dev) { + ret = -ENODEV; + goto release_init_lock; + } + + ctl = init_pp_ctl(); + if (!ctl) { + ret = -ENOMEM; + goto release_init_lock; + } + + args = skip_spaces(buf); + while (*args) { + args = next_arg(args, ¶m, &val); + + /* + * Workaround to support the old writeback interface. + * + * The old writeback interface has a minor inconsistency and + * requires key=value only for page_index parameter, while the + * writeback mode is a valueless parameter. + * + * This is not the case anymore and now all parameters are + * required to have values, however, we need to support the + * legacy writeback interface format so we check if we can + * recognize a valueless parameter as the (legacy) writeback + * mode. + */ + if (!val || !*val) { + err = parse_mode(param, &mode); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + break; + } + + if (!strcmp(param, "type")) { + err = parse_mode(val, &mode); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + break; + } + + if (!strcmp(param, "page_index")) { + err = parse_page_index(val, nr_pages, &lo, &hi); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + break; + } + + /* There can be several page index ranges */ + if (!strcmp(param, "page_index_range")) { + err = parse_page_index_range(val, nr_pages, &lo, &hi); + if (err) { + ret = err; + goto release_init_lock; + } + + scan_slots_for_writeback(zram, mode, lo, hi, ctl); + continue; + } + } + + err = zram_writeback_slots(zram, ctl); + if (err) + ret = err; + +release_init_lock: release_pp_ctl(zram, ctl); atomic_set(&zram->pp_in_progress, 0); up_read(&zram->init_lock);
The writeback interface supports a page_index=N parameter which performs writeback of the given page. Since we rarely need to writeback just one single page, the typical use case involves a number of writeback calls, each performing writeback of one page: echo page_index=100 > zram0/writeback ... echo page_index=200 > zram0/writeback echo page_index=500 > zram0/writeback ... echo page_index=700 > zram0/writeback One obvious downside of this is that it increases the number of syscalls. Less obvious, but a significantly more important downside, is that when given only one page to post-process zram cannot perform an optimal target selection. This becomes a critical limitation when writeback_limit is enabled, because under writeback_limit we want to guarantee the highest memory savings hence we first need to writeback pages that release the highest amount of zsmalloc pool memory. This patch adds page_index_range=LOW-HIGH parameter to the writeback interface: echo page_index_range=100-200 \ page_index_range=500-700 > zram0/writeback This gives zram a chance to apply an optimal target selection strategy on each iteration of the writeback loop. Apart from that the patch also unifies parameters passing and resembles other "modern" zram device attributes (e.g. recompression), while the old interface used a mixed scheme: values-less parameters for mode and a key=value format for page_index. We still support the "old" value-less format for compatibility reasons. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> --- Documentation/admin-guide/blockdev/zram.rst | 11 + drivers/block/zram/zram_drv.c | 321 +++++++++++++------- 2 files changed, 227 insertions(+), 105 deletions(-)