Message ID | 20230509005703.155321-1-xiubli@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] ceph: fix blindly expanding the readahead windows | expand |
On Tue, 2023-05-09 at 08:57 +0800, xiubli@redhat.com wrote: > From: Xiubo Li <xiubli@redhat.com> > > Blindly expanding the readahead windows will cause unneccessary > pagecache thrashing and also will introdue the network workload. > We should disable expanding the windows if the readahead is disabled > and also shouldn't expand the windows too much. > > Expanding forward firstly instead of expanding backward for possible > sequential reads. > > Bound `rreq->len` to the actual file size to restore the previous page > cache usage. > > Cc: stable@vger.kernel.org > Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") > URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn > URL: https://www.spinics.net/lists/ceph-users/msg76183.html > Cc: Hu Weiwen <sehuww@mail.scut.edu.cn> > Signed-off-by: Xiubo Li <xiubli@redhat.com> > --- > > V4: > - two small cleanup from Ilya's comments. Thanks > > (cc'ing Steve French since he was asking me about ceph readahead yesterday) FWIW, the original idea here was to try to read whole OSD objects when we can. I can see that that may have been overzealous though, so ramping up the size more slowly makes sense. > fs/ceph/addr.c | 28 +++++++++++++++++++++------- > 1 file changed, 21 insertions(+), 7 deletions(-) > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > index ca4dc6450887..683ba9fbd590 100644 > --- a/fs/ceph/addr.c > +++ b/fs/ceph/addr.c > @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) > struct inode *inode = rreq->inode; > struct ceph_inode_info *ci = ceph_inode(inode); > struct ceph_file_layout *lo = &ci->i_layout; > + unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; > + unsigned long max_len = max_pages << PAGE_SHIFT; > + loff_t end = rreq->start + rreq->len, new_end; > u32 blockoff; > - u64 blockno; > > - /* Expand the start downward */ > - blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); > - rreq->start = blockno * lo->stripe_unit; > - rreq->len += blockoff; > + /* Readahead is disabled */ > + if (!max_pages) > + return; > > - /* Now, round up the length to the next block */ > - rreq->len = roundup(rreq->len, lo->stripe_unit); > + /* > + * Try to expand the length forward by rounding up it to the next > + * block, but do not exceed the file size, unless the original > + * request already exceeds it. > + */ Do you really need to clamp this to the i_size? Is it ever possible for the client to be out of date as to the real file size? If so then you might end up with a short read when there is actually more data. I guess by doing this, you're trying to avoid having to call the OSD back after a short read and get a zero-length read when the file is shorter than the requested read? > + new_end = min(round_up(end, lo->stripe_unit), rreq->i_size); > + if (new_end > end && new_end <= rreq->start + max_len) > + rreq->len = new_end - rreq->start; > + > + /* Try to expand the start downward */ > + div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); > + if (rreq->len + blockoff <= max_len) { > + rreq->start -= blockoff; > + rreq->len += blockoff; > + } > } > > static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
On 5/9/23 22:18, Jeff Layton wrote: > On Tue, 2023-05-09 at 08:57 +0800, xiubli@redhat.com wrote: >> From: Xiubo Li <xiubli@redhat.com> >> >> Blindly expanding the readahead windows will cause unneccessary >> pagecache thrashing and also will introdue the network workload. >> We should disable expanding the windows if the readahead is disabled >> and also shouldn't expand the windows too much. >> >> Expanding forward firstly instead of expanding backward for possible >> sequential reads. >> >> Bound `rreq->len` to the actual file size to restore the previous page >> cache usage. >> >> Cc: stable@vger.kernel.org >> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") >> URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn >> URL: https://www.spinics.net/lists/ceph-users/msg76183.html >> Cc: Hu Weiwen <sehuww@mail.scut.edu.cn> >> Signed-off-by: Xiubo Li <xiubli@redhat.com> >> --- >> >> V4: >> - two small cleanup from Ilya's comments. Thanks >> >> > (cc'ing Steve French since he was asking me about ceph readahead > yesterday) > > FWIW, the original idea here was to try to read whole OSD objects when > we can. I can see that that may have been overzealous though, so > ramping up the size more slowly makes sense. > >> fs/ceph/addr.c | 28 +++++++++++++++++++++------- >> 1 file changed, 21 insertions(+), 7 deletions(-) >> >> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c >> index ca4dc6450887..683ba9fbd590 100644 >> --- a/fs/ceph/addr.c >> +++ b/fs/ceph/addr.c >> @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) >> struct inode *inode = rreq->inode; >> struct ceph_inode_info *ci = ceph_inode(inode); >> struct ceph_file_layout *lo = &ci->i_layout; >> + unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; >> + unsigned long max_len = max_pages << PAGE_SHIFT; >> + loff_t end = rreq->start + rreq->len, new_end; >> u32 blockoff; >> - u64 blockno; >> >> - /* Expand the start downward */ >> - blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); >> - rreq->start = blockno * lo->stripe_unit; >> - rreq->len += blockoff; >> + /* Readahead is disabled */ >> + if (!max_pages) >> + return; >> >> - /* Now, round up the length to the next block */ >> - rreq->len = roundup(rreq->len, lo->stripe_unit); >> + /* >> + * Try to expand the length forward by rounding up it to the next >> + * block, but do not exceed the file size, unless the original >> + * request already exceeds it. >> + */ Hi Jeff, > Do you really need to clamp this to the i_size? Is it ever possible for > the client to be out of date as to the real file size? If so then you > might end up with a short read when there is actually more data. > > I guess by doing this, you're trying to avoid having to call the OSD > back after a short read and get a zero-length read when the file is > shorter than the requested read? This is folded from Weiwen's another fix https://patchwork.kernel.org/project/ceph-devel/patch/20230504082510.247-1-sehuww@mail.scut.edu.cn/. For small files use case this may really could cause unnecessary network workload and inefficient usage of the page cache. Thanks - Xiubo >> + new_end = min(round_up(end, lo->stripe_unit), rreq->i_size); >> + if (new_end > end && new_end <= rreq->start + max_len) >> + rreq->len = new_end - rreq->start; >> + >> + /* Try to expand the start downward */ >> + div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); >> + if (rreq->len + blockoff <= max_len) { >> + rreq->start -= blockoff; >> + rreq->len += blockoff; >> + } >> } >> >> static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
On Tue, May 09, 2023 at 08:57:03AM +0800, xiubli@redhat.com wrote: > From: Xiubo Li <xiubli@redhat.com> > > Blindly expanding the readahead windows will cause unneccessary > pagecache thrashing and also will introdue the network workload. ^^^^^^^^ introduce > We should disable expanding the windows if the readahead is disabled > and also shouldn't expand the windows too much. > > Expanding forward firstly instead of expanding backward for possible > sequential reads. > > Bound `rreq->len` to the actual file size to restore the previous page > cache usage. > > Cc: stable@vger.kernel.org > Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") > URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn > URL: https://www.spinics.net/lists/ceph-users/msg76183.html > Cc: Hu Weiwen <sehuww@mail.scut.edu.cn> > Signed-off-by: Xiubo Li <xiubli@redhat.com> > --- > > V4: > - two small cleanup from Ilya's comments. Thanks > > > fs/ceph/addr.c | 28 +++++++++++++++++++++------- > 1 file changed, 21 insertions(+), 7 deletions(-) > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > index ca4dc6450887..683ba9fbd590 100644 > --- a/fs/ceph/addr.c > +++ b/fs/ceph/addr.c > @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) > struct inode *inode = rreq->inode; > struct ceph_inode_info *ci = ceph_inode(inode); > struct ceph_file_layout *lo = &ci->i_layout; > + unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; I think it is better to use `ractl->ra->ra_pages' instead of `inode->i_sb->s_bdi->ra_pages'. So that we can consider per-request ra size config, e.g., posix_fadvise(POSIX_FADV_SEQUENTIAL) will double the ra_pages. But `ractl' is not passed to this function. Can we just add this argument? ceph seems to be the only implementation of expand_readahead, so it should be easy. Or since this patch will be backported, maybe we should keep it simple, and write another patch for this? > + unsigned long max_len = max_pages << PAGE_SHIFT; > + loff_t end = rreq->start + rreq->len, new_end; > u32 blockoff; > - u64 blockno; > > - /* Expand the start downward */ > - blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); > - rreq->start = blockno * lo->stripe_unit; > - rreq->len += blockoff; > + /* Readahead is disabled */ > + if (!max_pages) > + return; If we have access to ractl here, we can also skip expanding on `ractl->file->f_mode & FMODE_RANDOM', which is set by `posix_fadvise(POSIX_FADV_RANDOM)'. > > - /* Now, round up the length to the next block */ > - rreq->len = roundup(rreq->len, lo->stripe_unit); > + /* > + * Try to expand the length forward by rounding up it to the next ^^ an extra space > + * block, but do not exceed the file size, unless the original > + * request already exceeds it. > + */ > + new_end = min(round_up(end, lo->stripe_unit), rreq->i_size); > + if (new_end > end && new_end <= rreq->start + max_len) > + rreq->len = new_end - rreq->start; > + > + /* Try to expand the start downward */ > + div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); > + if (rreq->len + blockoff <= max_len) { > + rreq->start -= blockoff; > + rreq->len += blockoff; > + } > } > > static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq) > -- > 2.40.0 >
On 5/10/23 18:01, 胡玮文 wrote: > On Tue, May 09, 2023 at 08:57:03AM +0800, xiubli@redhat.com wrote: >> From: Xiubo Li <xiubli@redhat.com> >> >> Blindly expanding the readahead windows will cause unneccessary >> pagecache thrashing and also will introdue the network workload. > ^^^^^^^^ > introduce > >> We should disable expanding the windows if the readahead is disabled >> and also shouldn't expand the windows too much. >> >> Expanding forward firstly instead of expanding backward for possible >> sequential reads. >> >> Bound `rreq->len` to the actual file size to restore the previous page >> cache usage. >> >> Cc: stable@vger.kernel.org >> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") >> URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn >> URL: https://www.spinics.net/lists/ceph-users/msg76183.html >> Cc: Hu Weiwen <sehuww@mail.scut.edu.cn> >> Signed-off-by: Xiubo Li <xiubli@redhat.com> >> --- >> >> V4: >> - two small cleanup from Ilya's comments. Thanks >> >> >> fs/ceph/addr.c | 28 +++++++++++++++++++++------- >> 1 file changed, 21 insertions(+), 7 deletions(-) >> >> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c >> index ca4dc6450887..683ba9fbd590 100644 >> --- a/fs/ceph/addr.c >> +++ b/fs/ceph/addr.c >> @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) >> struct inode *inode = rreq->inode; >> struct ceph_inode_info *ci = ceph_inode(inode); >> struct ceph_file_layout *lo = &ci->i_layout; >> + unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; > I think it is better to use `ractl->ra->ra_pages' instead of > `inode->i_sb->s_bdi->ra_pages'. So that we can consider per-request ra > size config, e.g., posix_fadvise(POSIX_FADV_SEQUENTIAL) will double the > ra_pages. Yeah, good catch. > But `ractl' is not passed to this function. Can we just add this > argument? ceph seems to be the only implementation of expand_readahead, > so it should be easy. Or since this patch will be backported, maybe we > should keep it simple, and write another patch for this? I think we should fix it together with this. It should be easy. We can just store the "file->f_ra->ra_pages" in ceph_init_request() instead, because each rreq will be related to a dedicated file. >> + unsigned long max_len = max_pages << PAGE_SHIFT; >> + loff_t end = rreq->start + rreq->len, new_end; >> u32 blockoff; >> - u64 blockno; >> >> - /* Expand the start downward */ >> - blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); >> - rreq->start = blockno * lo->stripe_unit; >> - rreq->len += blockoff; >> + /* Readahead is disabled */ >> + if (!max_pages) >> + return; > If we have access to ractl here, we can also skip expanding on > `ractl->file->f_mode & FMODE_RANDOM', which is set by > `posix_fadvise(POSIX_FADV_RANDOM)'. Correct, because the "page_cache_sync_ra()' will do the right thing and we can skip expanding it here in ceph. Thanks - Xiubo > >> >> - /* Now, round up the length to the next block */ >> - rreq->len = roundup(rreq->len, lo->stripe_unit); >> + /* >> + * Try to expand the length forward by rounding up it to the next > ^^ an extra space > >> + * block, but do not exceed the file size, unless the original >> + * request already exceeds it. >> + */ >> + new_end = min(round_up(end, lo->stripe_unit), rreq->i_size); >> + if (new_end > end && new_end <= rreq->start + max_len) >> + rreq->len = new_end - rreq->start; >> + >> + /* Try to expand the start downward */ >> + div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); >> + if (rreq->len + blockoff <= max_len) { >> + rreq->start -= blockoff; >> + rreq->len += blockoff; >> + } >> } >> >> static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq) >> -- >> 2.40.0 >>
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index ca4dc6450887..683ba9fbd590 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) struct inode *inode = rreq->inode; struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_file_layout *lo = &ci->i_layout; + unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; + unsigned long max_len = max_pages << PAGE_SHIFT; + loff_t end = rreq->start + rreq->len, new_end; u32 blockoff; - u64 blockno; - /* Expand the start downward */ - blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); - rreq->start = blockno * lo->stripe_unit; - rreq->len += blockoff; + /* Readahead is disabled */ + if (!max_pages) + return; - /* Now, round up the length to the next block */ - rreq->len = roundup(rreq->len, lo->stripe_unit); + /* + * Try to expand the length forward by rounding up it to the next + * block, but do not exceed the file size, unless the original + * request already exceeds it. + */ + new_end = min(round_up(end, lo->stripe_unit), rreq->i_size); + if (new_end > end && new_end <= rreq->start + max_len) + rreq->len = new_end - rreq->start; + + /* Try to expand the start downward */ + div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); + if (rreq->len + blockoff <= max_len) { + rreq->start -= blockoff; + rreq->len += blockoff; + } } static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)