Message ID | 20240509023937.1090421-3-zhaoyang.huang@unisoc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | introduce budgt control in readahead | expand |
On Thu, May 09, 2024 at 10:39:37AM +0800, zhaoyang.huang wrote: > -static unsigned long get_next_ra_size(struct file_ra_state *ra, > +static unsigned long get_next_ra_size(struct readahead_control *ractl, > unsigned long max) > { > - unsigned long cur = ra->size; > + unsigned long cur = ractl->ra->size; > + struct inode *inode = ractl->mapping->host; > + unsigned long budgt = inode->i_sb->s_bdev ? > + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; You can't do this. There's no guarantee that the IO is going to mapping->host->i_sb->s_bdev. You'd have to figure out how to ask the filesystem to get the bdev for the particular range (eg the fs might implement RAID internally).
> + unsigned long budgt = inode->i_sb->s_bdev ? > + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; The readahead code is used for all file systems, you can't just call into block layer code here.
On Thu, May 9, 2024 at 11:15 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, May 09, 2024 at 10:39:37AM +0800, zhaoyang.huang wrote: > > -static unsigned long get_next_ra_size(struct file_ra_state *ra, > > +static unsigned long get_next_ra_size(struct readahead_control *ractl, > > unsigned long max) > > { > > - unsigned long cur = ra->size; > > + unsigned long cur = ractl->ra->size; > > + struct inode *inode = ractl->mapping->host; > > + unsigned long budgt = inode->i_sb->s_bdev ? > > + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; > > You can't do this. There's no guarantee that the IO is going to > mapping->host->i_sb->s_bdev. You'd have to figure out how to ask the > filesystem to get the bdev for the particular range (eg the fs might > implement RAID internally). > Thanks for the prompt. I did some basic research on soft RAID and wonder if applying the bps limit on /dev/md0 like below could make this work. mdadm -C -v /dev/md0 -l raid0 -n 2 /dev/sd[b-c]1 mount /dev/md0 /mnt/raid0/ echo "/dev/md0 100000" > blkio.throttle.read_bps_device I didn't find information about 'RAID internally'. Could we set the limit on the root device(the one used for mount) to manage the whole partition without caring about where the bio finally goes? Or ask the user to decide if to use by making sure the device they apply will not do RAID?
On Thu, May 9, 2024 at 8:40 PM Christoph Hellwig <hch@infradead.org> wrote: > > > + unsigned long budgt = inode->i_sb->s_bdev ? > > + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; > > The readahead code is used for all file systems, you can't just call > into block layer code here. > ok. I would like to know any suggestions on introducing throttle budget control into readahead which actually works as a negative feedback path. IMO, negative feedback is a good methodology which has been used in scheduler(EAS) and thermal control(IPA) and memory(MGLRU). I would like to suggest to have a try on have it work cross the boundary of memory and block layer. vfs_read / page fault | readahead <---------| | | aops->readpages | | | block_layer------------
On Fri, May 10, 2024 at 10:43:20AM +0800, Zhaoyang Huang wrote: > Thanks for the prompt. I did some basic research on soft RAID and > wonder if applying the bps limit on /dev/md0 like below could make > this work. No. Look at btrfs' raid support, for example. it doesn't use md0. > I didn't find information about 'RAID internally'. Could we set the > limit on the root device(the one used for mount) to manage the whole > partition without caring about where the bio finally goes? Or ask the > user to decide if to use by making sure the device they apply will not > do RAID? No.
On Fri, May 10, 2024 at 11:06:14AM +0800, Zhaoyang Huang wrote: > On Thu, May 9, 2024 at 8:40 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > + unsigned long budgt = inode->i_sb->s_bdev ? > > > + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; > > > > The readahead code is used for all file systems, you can't just call > > into block layer code here. > > > ok. I would like to know any suggestions on introducing throttle > budget control into readahead which actually works as a negative > feedback path. IMO, negative feedback is a good methodology which has > been used in scheduler(EAS) and thermal control(IPA) and > memory(MGLRU). I would like to suggest to have a try on have it work > cross the boundary of memory and block layer. > > vfs_read / page fault > | > readahead <---------| > | | > aops->readpages | > | | > block_layer------------ what you could do is have blk-throttle fail bios that are tagged as readahead if we've hit the threshold?
On Fri, May 10, 2024 at 12:14 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, May 10, 2024 at 11:06:14AM +0800, Zhaoyang Huang wrote: > > On Thu, May 9, 2024 at 8:40 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > + unsigned long budgt = inode->i_sb->s_bdev ? > > > > + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; > > > > > > The readahead code is used for all file systems, you can't just call > > > into block layer code here. > > > > > ok. I would like to know any suggestions on introducing throttle > > budget control into readahead which actually works as a negative > > feedback path. IMO, negative feedback is a good methodology which has > > been used in scheduler(EAS) and thermal control(IPA) and > > memory(MGLRU). I would like to suggest to have a try on have it work > > cross the boundary of memory and block layer. > > > > vfs_read / page fault > > | > > readahead <---------| > > | | > > aops->readpages | > > | | > > block_layer------------ > > what you could do is have blk-throttle fail bios that are tagged as > readahead if we've hit the threshold? Actually, blk throttle will postpone the over-size bio's launch by adding it to the throttle group's private queue which this idea aims at. The delay here could be avoidable by some means to have the bio meet the max ability of the throttle blkcg. Furthermore, we may get a totally non over-sized readahead mechanism if we do this well.
On Fri, May 10, 2024 at 11:18 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, May 10, 2024 at 10:43:20AM +0800, Zhaoyang Huang wrote: > > Thanks for the prompt. I did some basic research on soft RAID and > > wonder if applying the bps limit on /dev/md0 like below could make > > this work. > > No. Look at btrfs' raid support, for example. it doesn't use md0. If I understand the below command correctly, btrfs uses one of the volumes within RAID as the mount block device, not /dev/md0. However, I think this is a problem of blkio.throttle rather than this commit which means this readahead budget control will work accordingly as long as blkio.throttle's parameter is configured correctly(eg. 50/50 on sdb and sdc) mkfs.btrfs -m raid0 -d raid0 /dev/sdb /dev/sdc mount -t btrfs /dev/sdb /mnt/btr > > > I didn't find information about 'RAID internally'. Could we set the > > limit on the root device(the one used for mount) to manage the whole > > partition without caring about where the bio finally goes? Or ask the > > user to decide if to use by making sure the device they apply will not > > do RAID? > > No.
On Sat, May 11, 2024 at 3:35 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Fri, May 10, 2024 at 11:18 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, May 10, 2024 at 10:43:20AM +0800, Zhaoyang Huang wrote: > > > Thanks for the prompt. I did some basic research on soft RAID and > > > wonder if applying the bps limit on /dev/md0 like below could make > > > this work. > > > > No. Look at btrfs' raid support, for example. it doesn't use md0. > If I understand the below command correctly, btrfs uses one of the > volumes within RAID as the mount block device, not /dev/md0. However, > I think this is a problem of blkio.throttle rather than this commit > which means this readahead budget control will work accordingly as > long as blkio.throttle's parameter is configured correctly(eg. 50/50 > on sdb and sdc) > > mkfs.btrfs -m raid0 -d raid0 /dev/sdb /dev/sdc > mount -t btrfs /dev/sdb /mnt/btr > > > > > > > > I didn't find information about 'RAID internally'. Could we set the > > > limit on the root device(the one used for mount) to manage the whole > > > partition without caring about where the bio finally goes? Or ask the > > > user to decide if to use by making sure the device they apply will not > > > do RAID? > > > > No. @all, Please find below for more test results where we can find this commit has the result meet the desired value more closely and enhance it by 3% than mainline. echo "254:48 20000000" > blkio.throttle.read_bps_device fio -filename=/data/ylog/ap/000-0101_000015_poweron.ylog -rw=read -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest before : IOPS=37.9k, BW=148MiB/s (155MB/s)(11.6GiB/80333msec) after : IOPS=39.0k, BW=153MiB/s (160MB/s)(15.6GiB/104914msec) before : clat (usec): min=4, max=1056.6k, avg=197.23, stdev=10080.69 after : clat (usec): min=4, max=193481, avg=188.83, stdev=4651.29 before : lat (usec): min=5, max=1056.6k, avg=200.48, stdev=10080.76 after : lat (usec): min=5, max=193483, avg=192.68, stdev=4651.87 echo "254:48 30000000" > blkio.throttle.read_bps_device fio -filename=/data/ylog/ap/000-0101_000015_poweron.ylog -rw=read -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest before : IOPS=57.2k, BW=224MiB/s (234MB/s)(15.6GiB/71561msec) after : IOPS=58.5k, BW=229MiB/s (240MB/s)(15.6GiB/69996msec) before : clat (usec): min=4, max=1105.5k, avg=126.20, stdev=6419.22 after : clat (usec): min=4, max=183956, avg=120.60, stdev=2957.28 before : lat (usec): min=5, max=1105.5k, avg=129.45, stdev=6419.29 after : lat (usec): min=5, max=183958, avg=124.40, stdev=2958.18
diff --git a/mm/readahead.c b/mm/readahead.c index 130c0e7df99f..2b6120ced6f9 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -128,6 +128,7 @@ #include <linux/blk-cgroup.h> #include <linux/fadvise.h> #include <linux/sched/mm.h> +#include <linux/minmax.h> #include "internal.h" @@ -358,16 +359,23 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max) * Get the previous window size, ramp it up, and * return it as the new window size. */ -static unsigned long get_next_ra_size(struct file_ra_state *ra, +static unsigned long get_next_ra_size(struct readahead_control *ractl, unsigned long max) { - unsigned long cur = ra->size; + unsigned long cur = ractl->ra->size; + struct inode *inode = ractl->mapping->host; + unsigned long budgt = inode->i_sb->s_bdev ? + blk_throttle_budgt(inode->i_sb->s_bdev) : 0; + unsigned long val = max; if (cur < max / 16) - return 4 * cur; + val = 4 * cur; if (cur <= max / 2) - return 2 * cur; - return max; + val = 2 * cur; + + val = budgt ? min(budgt / PAGE_SIZE, val) : val; + + return val; } /* @@ -437,6 +445,8 @@ static int try_context_readahead(struct address_space *mapping, unsigned long max) { pgoff_t size; + unsigned long budgt = mapping->host->i_sb->s_bdev ? + blk_throttle_budgt(mapping->host->i_sb->s_bdev) : 0; size = count_history_pages(mapping, index, max); @@ -455,7 +465,7 @@ static int try_context_readahead(struct address_space *mapping, size *= 2; ra->start = index; - ra->size = min(size + req_size, max); + ra->size = min3(budgt / PAGE_SIZE, size + req_size, max); ra->async_size = 1; return 1; @@ -552,6 +562,8 @@ static void ondemand_readahead(struct readahead_control *ractl, pgoff_t index = readahead_index(ractl); pgoff_t expected, prev_index; unsigned int order = folio ? folio_order(folio) : 0; + unsigned long budgt = ractl->mapping->host->i_sb->s_bdev ? + blk_throttle_budgt(ractl->mapping->host->i_sb->s_bdev) : 0; /* * If the request exceeds the readahead window, allow the read to @@ -574,7 +586,7 @@ static void ondemand_readahead(struct readahead_control *ractl, 1UL << order); if (index == expected || index == (ra->start + ra->size)) { ra->start += ra->size; - ra->size = get_next_ra_size(ra, max_pages); + ra->size = get_next_ra_size(ractl, max_pages); ra->async_size = ra->size; goto readit; } @@ -599,7 +611,7 @@ static void ondemand_readahead(struct readahead_control *ractl, ra->start = start; ra->size = start - index; /* old async_size */ ra->size += req_size; - ra->size = get_next_ra_size(ra, max_pages); + ra->size = get_next_ra_size(ractl, max_pages); ra->async_size = ra->size; goto readit; } @@ -631,6 +643,9 @@ static void ondemand_readahead(struct readahead_control *ractl, * standalone, small random read * Read as is, and do not pollute the readahead state. */ + if (budgt) + req_size = min(budgt / PAGE_SIZE, req_size); + do_page_cache_ra(ractl, req_size, 0); return; @@ -647,7 +662,7 @@ static void ondemand_readahead(struct readahead_control *ractl, * Take care of maximum IO pages as above. */ if (index == ra->start && ra->size == ra->async_size) { - add_pages = get_next_ra_size(ra, max_pages); + add_pages = get_next_ra_size(ractl, max_pages); if (ra->size + add_pages <= max_pages) { ra->async_size = add_pages; ra->size += add_pages;