diff mbox series

[RFC,2/2] mm: introduce budgt control in readahead

Message ID 20240509023937.1090421-3-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series introduce budgt control in readahead | expand

Commit Message

zhaoyang.huang May 9, 2024, 2:39 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Currently, readahead's size is decided mainly by page cache's status
like hit/miss or hole size which could lead to suspension of following
bio which is over the size of blk-throttle allowed size when
BLK_THROTTLING is on. Introduce the budgt value here to have the bio's
size be within the legal size.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 mm/readahead.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

Comments

Matthew Wilcox (Oracle) May 9, 2024, 3:15 a.m. UTC | #1
On Thu, May 09, 2024 at 10:39:37AM +0800, zhaoyang.huang wrote:
> -static unsigned long get_next_ra_size(struct file_ra_state *ra,
> +static unsigned long get_next_ra_size(struct readahead_control *ractl,
>  				      unsigned long max)
>  {
> -	unsigned long cur = ra->size;
> +	unsigned long cur = ractl->ra->size;
> +	struct inode *inode = ractl->mapping->host;
> +	unsigned long budgt = inode->i_sb->s_bdev ?
> +			blk_throttle_budgt(inode->i_sb->s_bdev) : 0;

You can't do this.  There's no guarantee that the IO is going to
mapping->host->i_sb->s_bdev.  You'd have to figure out how to ask the
filesystem to get the bdev for the particular range (eg the fs might
implement RAID internally).
Christoph Hellwig May 9, 2024, 12:39 p.m. UTC | #2
> +	unsigned long budgt = inode->i_sb->s_bdev ?
> +			blk_throttle_budgt(inode->i_sb->s_bdev) : 0;

The readahead code is used for all file systems, you can't just call
into block layer code here.
Zhaoyang Huang May 10, 2024, 2:43 a.m. UTC | #3
On Thu, May 9, 2024 at 11:15 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 09, 2024 at 10:39:37AM +0800, zhaoyang.huang wrote:
> > -static unsigned long get_next_ra_size(struct file_ra_state *ra,
> > +static unsigned long get_next_ra_size(struct readahead_control *ractl,
> >                                     unsigned long max)
> >  {
> > -     unsigned long cur = ra->size;
> > +     unsigned long cur = ractl->ra->size;
> > +     struct inode *inode = ractl->mapping->host;
> > +     unsigned long budgt = inode->i_sb->s_bdev ?
> > +                     blk_throttle_budgt(inode->i_sb->s_bdev) : 0;
>
> You can't do this.  There's no guarantee that the IO is going to
> mapping->host->i_sb->s_bdev.  You'd have to figure out how to ask the
> filesystem to get the bdev for the particular range (eg the fs might
> implement RAID internally).
>
Thanks for the prompt. I did some basic research on soft RAID and
wonder if applying the bps limit on /dev/md0 like below could make
this work.

mdadm -C -v /dev/md0 -l raid0 -n 2 /dev/sd[b-c]1
mount /dev/md0 /mnt/raid0/
echo "/dev/md0 100000" > blkio.throttle.read_bps_device

I didn't find information about 'RAID internally'. Could we set the
limit on the root device(the one used for mount) to manage the whole
partition without caring about where the bio finally goes? Or ask the
user to decide if to use by making sure the device they apply will not
do RAID?
Zhaoyang Huang May 10, 2024, 3:06 a.m. UTC | #4
On Thu, May 9, 2024 at 8:40 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> > +     unsigned long budgt = inode->i_sb->s_bdev ?
> > +                     blk_throttle_budgt(inode->i_sb->s_bdev) : 0;
>
> The readahead code is used for all file systems, you can't just call
> into block layer code here.
>
ok. I would like to know any suggestions on introducing throttle
budget control into readahead which actually works as a negative
feedback path. IMO, negative feedback is a good methodology which has
been used in scheduler(EAS) and thermal control(IPA) and
memory(MGLRU). I would like to suggest to have a try on have it work
cross the boundary of memory and block layer.

vfs_read / page fault
|
readahead  <---------|
|                               |
aops->readpages    |
|                               |
block_layer------------
Matthew Wilcox (Oracle) May 10, 2024, 3:18 a.m. UTC | #5
On Fri, May 10, 2024 at 10:43:20AM +0800, Zhaoyang Huang wrote:
> Thanks for the prompt. I did some basic research on soft RAID and
> wonder if applying the bps limit on /dev/md0 like below could make
> this work.

No.  Look at btrfs' raid support, for example.  it doesn't use md0.

> I didn't find information about 'RAID internally'. Could we set the
> limit on the root device(the one used for mount) to manage the whole
> partition without caring about where the bio finally goes? Or ask the
> user to decide if to use by making sure the device they apply will not
> do RAID?

No.
Matthew Wilcox (Oracle) May 10, 2024, 4:14 a.m. UTC | #6
On Fri, May 10, 2024 at 11:06:14AM +0800, Zhaoyang Huang wrote:
> On Thu, May 9, 2024 at 8:40 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > +     unsigned long budgt = inode->i_sb->s_bdev ?
> > > +                     blk_throttle_budgt(inode->i_sb->s_bdev) : 0;
> >
> > The readahead code is used for all file systems, you can't just call
> > into block layer code here.
> >
> ok. I would like to know any suggestions on introducing throttle
> budget control into readahead which actually works as a negative
> feedback path. IMO, negative feedback is a good methodology which has
> been used in scheduler(EAS) and thermal control(IPA) and
> memory(MGLRU). I would like to suggest to have a try on have it work
> cross the boundary of memory and block layer.
> 
> vfs_read / page fault
> |
> readahead  <---------|
> |                               |
> aops->readpages    |
> |                               |
> block_layer------------

what you could do is have blk-throttle fail bios that are tagged as
readahead if we've hit the threshold?
Zhaoyang Huang May 10, 2024, 7:08 a.m. UTC | #7
On Fri, May 10, 2024 at 12:14 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 10, 2024 at 11:06:14AM +0800, Zhaoyang Huang wrote:
> > On Thu, May 9, 2024 at 8:40 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > +     unsigned long budgt = inode->i_sb->s_bdev ?
> > > > +                     blk_throttle_budgt(inode->i_sb->s_bdev) : 0;
> > >
> > > The readahead code is used for all file systems, you can't just call
> > > into block layer code here.
> > >
> > ok. I would like to know any suggestions on introducing throttle
> > budget control into readahead which actually works as a negative
> > feedback path. IMO, negative feedback is a good methodology which has
> > been used in scheduler(EAS) and thermal control(IPA) and
> > memory(MGLRU). I would like to suggest to have a try on have it work
> > cross the boundary of memory and block layer.
> >
> > vfs_read / page fault
> > |
> > readahead  <---------|
> > |                               |
> > aops->readpages    |
> > |                               |
> > block_layer------------
>
> what you could do is have blk-throttle fail bios that are tagged as
> readahead if we've hit the threshold?
Actually, blk throttle will postpone the over-size bio's launch by
adding it to the throttle group's private queue which this idea aims
at. The delay here could be avoidable by some means to have the bio
meet the max ability of the throttle blkcg. Furthermore, we may get a
totally non over-sized readahead mechanism if we do this well.
Zhaoyang Huang May 11, 2024, 7:35 a.m. UTC | #8
On Fri, May 10, 2024 at 11:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 10, 2024 at 10:43:20AM +0800, Zhaoyang Huang wrote:
> > Thanks for the prompt. I did some basic research on soft RAID and
> > wonder if applying the bps limit on /dev/md0 like below could make
> > this work.
>
> No.  Look at btrfs' raid support, for example.  it doesn't use md0.
If I understand the below command correctly, btrfs uses one of the
volumes within RAID as the mount block device, not /dev/md0. However,
I think this is a problem of blkio.throttle rather than this commit
which means this readahead budget control will work accordingly as
long as blkio.throttle's parameter is configured correctly(eg. 50/50
on sdb and sdc)

mkfs.btrfs -m raid0 -d raid0 /dev/sdb /dev/sdc
mount -t btrfs /dev/sdb /mnt/btr



>
> > I didn't find information about 'RAID internally'. Could we set the
> > limit on the root device(the one used for mount) to manage the whole
> > partition without caring about where the bio finally goes? Or ask the
> > user to decide if to use by making sure the device they apply will not
> > do RAID?
>
> No.
Zhaoyang Huang May 14, 2024, 2:37 a.m. UTC | #9
On Sat, May 11, 2024 at 3:35 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Fri, May 10, 2024 at 11:18 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, May 10, 2024 at 10:43:20AM +0800, Zhaoyang Huang wrote:
> > > Thanks for the prompt. I did some basic research on soft RAID and
> > > wonder if applying the bps limit on /dev/md0 like below could make
> > > this work.
> >
> > No.  Look at btrfs' raid support, for example.  it doesn't use md0.
> If I understand the below command correctly, btrfs uses one of the
> volumes within RAID as the mount block device, not /dev/md0. However,
> I think this is a problem of blkio.throttle rather than this commit
> which means this readahead budget control will work accordingly as
> long as blkio.throttle's parameter is configured correctly(eg. 50/50
> on sdb and sdc)
>
> mkfs.btrfs -m raid0 -d raid0 /dev/sdb /dev/sdc
> mount -t btrfs /dev/sdb /mnt/btr
>
>
>
> >
> > > I didn't find information about 'RAID internally'. Could we set the
> > > limit on the root device(the one used for mount) to manage the whole
> > > partition without caring about where the bio finally goes? Or ask the
> > > user to decide if to use by making sure the device they apply will not
> > > do RAID?
> >
> > No.

@all, Please find below for more test results where we can find this
commit has the result meet the desired value more closely and enhance
it by 3% than mainline.

echo "254:48 20000000" > blkio.throttle.read_bps_device
fio -filename=/data/ylog/ap/000-0101_000015_poweron.ylog -rw=read
-direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest

    before : IOPS=37.9k, BW=148MiB/s (155MB/s)(11.6GiB/80333msec)
    after  : IOPS=39.0k, BW=153MiB/s (160MB/s)(15.6GiB/104914msec)

    before : clat (usec): min=4, max=1056.6k, avg=197.23, stdev=10080.69
    after  : clat (usec): min=4, max=193481, avg=188.83, stdev=4651.29

    before : lat (usec): min=5, max=1056.6k, avg=200.48, stdev=10080.76
    after  : lat (usec): min=5, max=193483, avg=192.68, stdev=4651.87


echo "254:48 30000000" > blkio.throttle.read_bps_device
fio -filename=/data/ylog/ap/000-0101_000015_poweron.ylog -rw=read
-direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest

    before : IOPS=57.2k, BW=224MiB/s (234MB/s)(15.6GiB/71561msec)
    after  : IOPS=58.5k, BW=229MiB/s (240MB/s)(15.6GiB/69996msec)

    before : clat (usec): min=4, max=1105.5k, avg=126.20, stdev=6419.22
    after  : clat (usec): min=4, max=183956, avg=120.60, stdev=2957.28

    before : lat (usec): min=5, max=1105.5k, avg=129.45, stdev=6419.29
    after  : lat (usec): min=5, max=183958, avg=124.40, stdev=2958.18
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..2b6120ced6f9 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -128,6 +128,7 @@ 
 #include <linux/blk-cgroup.h>
 #include <linux/fadvise.h>
 #include <linux/sched/mm.h>
+#include <linux/minmax.h>
 
 #include "internal.h"
 
@@ -358,16 +359,23 @@  static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
  *  Get the previous window size, ramp it up, and
  *  return it as the new window size.
  */
-static unsigned long get_next_ra_size(struct file_ra_state *ra,
+static unsigned long get_next_ra_size(struct readahead_control *ractl,
 				      unsigned long max)
 {
-	unsigned long cur = ra->size;
+	unsigned long cur = ractl->ra->size;
+	struct inode *inode = ractl->mapping->host;
+	unsigned long budgt = inode->i_sb->s_bdev ?
+			blk_throttle_budgt(inode->i_sb->s_bdev) : 0;
+	unsigned long val = max;
 
 	if (cur < max / 16)
-		return 4 * cur;
+		val = 4 * cur;
 	if (cur <= max / 2)
-		return 2 * cur;
-	return max;
+		val = 2 * cur;
+
+	val = budgt ? min(budgt / PAGE_SIZE, val) : val;
+
+	return val;
 }
 
 /*
@@ -437,6 +445,8 @@  static int try_context_readahead(struct address_space *mapping,
 				 unsigned long max)
 {
 	pgoff_t size;
+	unsigned long budgt = mapping->host->i_sb->s_bdev ?
+		blk_throttle_budgt(mapping->host->i_sb->s_bdev) : 0;
 
 	size = count_history_pages(mapping, index, max);
 
@@ -455,7 +465,7 @@  static int try_context_readahead(struct address_space *mapping,
 		size *= 2;
 
 	ra->start = index;
-	ra->size = min(size + req_size, max);
+	ra->size = min3(budgt / PAGE_SIZE, size + req_size, max);
 	ra->async_size = 1;
 
 	return 1;
@@ -552,6 +562,8 @@  static void ondemand_readahead(struct readahead_control *ractl,
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t expected, prev_index;
 	unsigned int order = folio ? folio_order(folio) : 0;
+	unsigned long budgt = ractl->mapping->host->i_sb->s_bdev ?
+		blk_throttle_budgt(ractl->mapping->host->i_sb->s_bdev) : 0;
 
 	/*
 	 * If the request exceeds the readahead window, allow the read to
@@ -574,7 +586,7 @@  static void ondemand_readahead(struct readahead_control *ractl,
 			1UL << order);
 	if (index == expected || index == (ra->start + ra->size)) {
 		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->size = get_next_ra_size(ractl, max_pages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -599,7 +611,7 @@  static void ondemand_readahead(struct readahead_control *ractl,
 		ra->start = start;
 		ra->size = start - index;	/* old async_size */
 		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->size = get_next_ra_size(ractl, max_pages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -631,6 +643,9 @@  static void ondemand_readahead(struct readahead_control *ractl,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
+	if (budgt)
+		req_size = min(budgt / PAGE_SIZE, req_size);
+
 	do_page_cache_ra(ractl, req_size, 0);
 	return;
 
@@ -647,7 +662,7 @@  static void ondemand_readahead(struct readahead_control *ractl,
 	 * Take care of maximum IO pages as above.
 	 */
 	if (index == ra->start && ra->size == ra->async_size) {
-		add_pages = get_next_ra_size(ra, max_pages);
+		add_pages = get_next_ra_size(ractl, max_pages);
 		if (ra->size + add_pages <= max_pages) {
 			ra->async_size = add_pages;
 			ra->size += add_pages;