Message ID | CAOg9mSTQ-zNKXQGBK9QEnwJCvwqh=zFLbLJZy-ibGZwLve4o0w@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] implement orangefs_readahead | expand |
On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote: > I wish I knew how to specify _nr_pages in the readahead_control > structure so that all the extra pages I need could be obtained > in readahead_page instead of part there and the rest in my > open-coded stuff in orangefs_readpage. But it looks to me as > if values in the readahead_control structure are set heuristically > outside of my control over in ondemand_readahead? That's right (for now). I pointed you at some code from Dave Howells that will allow orangefs to enlarge the readahead window beyond that determined by the core code's algorithms. > [root@vm3 linux]# git diff master..readahead > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c > index 48f0547d4850..682a968cb82a 100644 > --- a/fs/orangefs/inode.c > +++ b/fs/orangefs/inode.c > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct > address_space *mapping, > > static int orangefs_launder_page(struct page *); > > +/* > + * Prefill the page cache with some pages that we're probably > + * about to need... > + */ > +static void orangefs_readahead(struct readahead_control *rac) > +{ > + pgoff_t index = readahead_index(rac); > + struct page *page; > + > + while ((page = readahead_page(rac))) { > + prefetchw(&page->flags); > + put_page(page); > + unlock_page(page); > + index++; > + } > + > + return; > +} This is not the way to do it. You need to actually kick off readahead in this routine so that you get pipelining (ie the app is working on pages 0-15 at the same time the server is getting you pages 16-31). I don't see much support in orangefs for doing async operations; everything seems to be modelled on "submit an I/O and wait for it to complete". I'm happy to help out with pagecache interactions, but adding async support to orangefs is a little bigger task than I'm willing to put significant effort into right now.
>> This is not the way to do it. You need to actually kick >> off readahead in this routine so that you get pipelining >> (ie the app is working on pages 0-15 at the same time >> the server is getting you pages 16-31). Orangefs isn't very good at reading or writing a few pages at a time. Its optimal block size is four megabytes. I'm trying to do IOs big enough to make Orangefs start flowing like it needs to and then have pages on hand to fill with the data. Perhaps I can figure how to use Dave Howell's code to control the readahead window and make adjustments to how many pages Orangefs reads per IO and end up with something that is closer to how readahead is intended to be used. This patch is a big performance improvement over the code that's upstream even though I'm not using readahead as intended. >> I don't see much support in orangefs for doing async >> operations; everything seems to be modelled on >> "submit an I/O and wait for it to complete". Yep... when we were polishing up the kernel module to attempt to go upstream, the code in there for async was left behind... I might be able to make sense of it now, Ida know... You've helped me to see this place where it is needed. >> adding async >> support to orangefs is a little bigger task than I'm willing to put >> significant effort into right now. The effort and help that you're providing is much appreciated and just what I need, thanks! -Mike On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote: > > I wish I knew how to specify _nr_pages in the readahead_control > > structure so that all the extra pages I need could be obtained > > in readahead_page instead of part there and the rest in my > > open-coded stuff in orangefs_readpage. But it looks to me as > > if values in the readahead_control structure are set heuristically > > outside of my control over in ondemand_readahead? > > That's right (for now). I pointed you at some code from Dave Howells > that will allow orangefs to enlarge the readahead window beyond that > determined by the core code's algorithms. > > > [root@vm3 linux]# git diff master..readahead > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c > > index 48f0547d4850..682a968cb82a 100644 > > --- a/fs/orangefs/inode.c > > +++ b/fs/orangefs/inode.c > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct > > address_space *mapping, > > > > static int orangefs_launder_page(struct page *); > > > > +/* > > + * Prefill the page cache with some pages that we're probably > > + * about to need... > > + */ > > +static void orangefs_readahead(struct readahead_control *rac) > > +{ > > + pgoff_t index = readahead_index(rac); > > + struct page *page; > > + > > + while ((page = readahead_page(rac))) { > > + prefetchw(&page->flags); > > + put_page(page); > > + unlock_page(page); > > + index++; > > + } > > + > > + return; > > +} > > This is not the way to do it. You need to actually kick off readahead in > this routine so that you get pipelining (ie the app is working on pages > 0-15 at the same time the server is getting you pages 16-31). I don't > see much support in orangefs for doing async operations; everything > seems to be modelled on "submit an I/O and wait for it to complete". > > I'm happy to help out with pagecache interactions, but adding async > support to orangefs is a little bigger task than I'm willing to put > significant effort into right now.
Greetings everyone. I have made another version of orangefs_readahead, without any of my hand rolled page cache manipulations. I read a bunch of the source in other filesystems and mm and fs and pagemap.h to try and get an idea of how to implement readahead so that my implementation is "with the program". I have described the flawed code I have upstream now in an earlier message. My flawed code has no readahead implementation, but it is much faster than with this readahead implementation. If this readahead implementation is "the right idea", I can use it as a framework to implement an async orangefs read function and start the read at the beginning of my readahead function and collect the results at the end after the readahead pages have been marshaled. Also, once some mechanism like David Howells' code to control the readahead window goes upstream, I should be able take big enough gulps of readahead to make Orangefs do right. The heuristically chosen 64 page max that I can get now isn't enough. I hope some of y'all have the time to review this implementation of readahead... Thanks! -Mike static void orangefs_readahead(struct readahead_control *rac) { struct page **pages; unsigned int npages = readahead_count(rac); loff_t offset = readahead_pos(rac); struct bio_vec *bvs; int i; struct iov_iter iter; struct file *file = rac->file; struct inode *inode = file->f_mapping->host; int ret; /* allocate an array of page pointers. */ pages = kzalloc(npages * (sizeof(struct page *)), GFP_KERNEL); /* Get a batch of pages to read. */ npages = __readahead_batch(rac, pages, npages); /* allocate an array of bio_vecs. */ bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL); /* hook the bio_vecs to the pages. */ for (i = 0; i < npages; i++) { bvs[i].bv_page = pages[i]; bvs[i].bv_len = PAGE_SIZE; bvs[i].bv_offset = 0; } iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE); /* read in the pages. */ ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); /* clean up. */ for (i = 0; i < npages; i++) { SetPageUptodate(bvs[i].bv_page); unlock_page(bvs[i].bv_page); put_page(bvs[i].bv_page); } kfree(pages); kfree(bvs); } On Mon, Feb 1, 2021 at 10:32 PM Mike Marshall <hubcap@omnibond.com> wrote: > > >> This is not the way to do it. You need to actually kick > >> off readahead in this routine so that you get pipelining > >> (ie the app is working on pages 0-15 at the same time > >> the server is getting you pages 16-31). > > Orangefs isn't very good at reading or writing a few > pages at a time. Its optimal block size is four megabytes. > I'm trying to do IOs big enough to make Orangefs > start flowing like it needs to and then have pages > on hand to fill with the data. Perhaps I can figure > how to use Dave Howell's code to control the > readahead window and make adjustments to > how many pages Orangefs reads per IO and > end up with something that is closer to how > readahead is intended to be used. > > This patch is a big performance improvement over > the code that's upstream even though I'm > not using readahead as intended. > > >> I don't see much support in orangefs for doing async > >> operations; everything seems to be modelled on > >> "submit an I/O and wait for it to complete". > > Yep... when we were polishing up the kernel module to > attempt to go upstream, the code in there for async was > left behind... I might be able to make sense of it now, > Ida know... You've helped me to see this place where > it is needed. > > >> adding async > >> support to orangefs is a little bigger task than I'm willing to put > >> significant effort into right now. > > The effort and help that you're providing is much > appreciated and just what I need, thanks! > > -Mike > > On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote: > > > I wish I knew how to specify _nr_pages in the readahead_control > > > structure so that all the extra pages I need could be obtained > > > in readahead_page instead of part there and the rest in my > > > open-coded stuff in orangefs_readpage. But it looks to me as > > > if values in the readahead_control structure are set heuristically > > > outside of my control over in ondemand_readahead? > > > > That's right (for now). I pointed you at some code from Dave Howells > > that will allow orangefs to enlarge the readahead window beyond that > > determined by the core code's algorithms. > > > > > [root@vm3 linux]# git diff master..readahead > > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c > > > index 48f0547d4850..682a968cb82a 100644 > > > --- a/fs/orangefs/inode.c > > > +++ b/fs/orangefs/inode.c > > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct > > > address_space *mapping, > > > > > > static int orangefs_launder_page(struct page *); > > > > > > +/* > > > + * Prefill the page cache with some pages that we're probably > > > + * about to need... > > > + */ > > > +static void orangefs_readahead(struct readahead_control *rac) > > > +{ > > > + pgoff_t index = readahead_index(rac); > > > + struct page *page; > > > + > > > + while ((page = readahead_page(rac))) { > > > + prefetchw(&page->flags); > > > + put_page(page); > > > + unlock_page(page); > > > + index++; > > > + } > > > + > > > + return; > > > +} > > > > This is not the way to do it. You need to actually kick off readahead in > > this routine so that you get pipelining (ie the app is working on pages > > 0-15 at the same time the server is getting you pages 16-31). I don't > > see much support in orangefs for doing async operations; everything > > seems to be modelled on "submit an I/O and wait for it to complete". > > > > I'm happy to help out with pagecache interactions, but adding async > > support to orangefs is a little bigger task than I'm willing to put > > significant effort into right now.
So I took my orangefs_readahead patch and David Howells' patches from his fscache-iter branch and put them on top of Linux 5.12-rc3 so as to get rid of all that RIP: 0010:ttm_bo_release+0x2ea/0x340 [ttm] stuff that was happening to VMs with 5.12-rc2. Then I added a readahead_expand call at the start of orangefs_readahead. I played with various expansion values, but the bottom line is: it works GREAT in my simple tests, speeds reads WAY up. -Mike On Sat, Mar 13, 2021 at 10:31 AM Mike Marshall <hubcap@omnibond.com> wrote: > > Greetings everyone. > > I have made another version of orangefs_readahead, without any > of my hand rolled page cache manipulations. I read a bunch of > the source in other filesystems and mm and fs and pagemap.h to > try and get an idea of how to implement readahead so that my > implementation is "with the program". > > I have described the flawed code I have upstream now in an > earlier message. My flawed code has no readahead implementation, but > it is much faster than with this readahead implementation. > > If this readahead implementation is "the right idea", I can > use it as a framework to implement an async orangefs read function > and start the read at the beginning of my readahead function > and collect the results at the end after the readahead pages > have been marshaled. Also, once some mechanism like David Howells' > code to control the readahead window goes upstream, I should be > able take big enough gulps of readahead to make Orangefs do right. > The heuristically chosen 64 page max that I can get now isn't enough. > > I hope some of y'all have the time to review this implementation of > readahead... > > Thanks! > > -Mike > > static void orangefs_readahead(struct readahead_control *rac) > { > struct page **pages; > unsigned int npages = readahead_count(rac); > loff_t offset = readahead_pos(rac); > struct bio_vec *bvs; > int i; > struct iov_iter iter; > struct file *file = rac->file; > struct inode *inode = file->f_mapping->host; > int ret; > > /* allocate an array of page pointers. */ > pages = kzalloc(npages * (sizeof(struct page *)), GFP_KERNEL); > > /* Get a batch of pages to read. */ > npages = __readahead_batch(rac, pages, npages); > > /* allocate an array of bio_vecs. */ > bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL); > > /* hook the bio_vecs to the pages. */ > for (i = 0; i < npages; i++) { > bvs[i].bv_page = pages[i]; > bvs[i].bv_len = PAGE_SIZE; > bvs[i].bv_offset = 0; > } > > iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE); > > /* read in the pages. */ > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > /* clean up. */ > for (i = 0; i < npages; i++) { > SetPageUptodate(bvs[i].bv_page); > unlock_page(bvs[i].bv_page); > put_page(bvs[i].bv_page); > } > kfree(pages); > kfree(bvs); > } > > On Mon, Feb 1, 2021 at 10:32 PM Mike Marshall <hubcap@omnibond.com> wrote: > > > > >> This is not the way to do it. You need to actually kick > > >> off readahead in this routine so that you get pipelining > > >> (ie the app is working on pages 0-15 at the same time > > >> the server is getting you pages 16-31). > > > > Orangefs isn't very good at reading or writing a few > > pages at a time. Its optimal block size is four megabytes. > > I'm trying to do IOs big enough to make Orangefs > > start flowing like it needs to and then have pages > > on hand to fill with the data. Perhaps I can figure > > how to use Dave Howell's code to control the > > readahead window and make adjustments to > > how many pages Orangefs reads per IO and > > end up with something that is closer to how > > readahead is intended to be used. > > > > This patch is a big performance improvement over > > the code that's upstream even though I'm > > not using readahead as intended. > > > > >> I don't see much support in orangefs for doing async > > >> operations; everything seems to be modelled on > > >> "submit an I/O and wait for it to complete". > > > > Yep... when we were polishing up the kernel module to > > attempt to go upstream, the code in there for async was > > left behind... I might be able to make sense of it now, > > Ida know... You've helped me to see this place where > > it is needed. > > > > >> adding async > > >> support to orangefs is a little bigger task than I'm willing to put > > >> significant effort into right now. > > > > The effort and help that you're providing is much > > appreciated and just what I need, thanks! > > > > -Mike > > > > On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote: > > > > I wish I knew how to specify _nr_pages in the readahead_control > > > > structure so that all the extra pages I need could be obtained > > > > in readahead_page instead of part there and the rest in my > > > > open-coded stuff in orangefs_readpage. But it looks to me as > > > > if values in the readahead_control structure are set heuristically > > > > outside of my control over in ondemand_readahead? > > > > > > That's right (for now). I pointed you at some code from Dave Howells > > > that will allow orangefs to enlarge the readahead window beyond that > > > determined by the core code's algorithms. > > > > > > > [root@vm3 linux]# git diff master..readahead > > > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c > > > > index 48f0547d4850..682a968cb82a 100644 > > > > --- a/fs/orangefs/inode.c > > > > +++ b/fs/orangefs/inode.c > > > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct > > > > address_space *mapping, > > > > > > > > static int orangefs_launder_page(struct page *); > > > > > > > > +/* > > > > + * Prefill the page cache with some pages that we're probably > > > > + * about to need... > > > > + */ > > > > +static void orangefs_readahead(struct readahead_control *rac) > > > > +{ > > > > + pgoff_t index = readahead_index(rac); > > > > + struct page *page; > > > > + > > > > + while ((page = readahead_page(rac))) { > > > > + prefetchw(&page->flags); > > > > + put_page(page); > > > > + unlock_page(page); > > > > + index++; > > > > + } > > > > + > > > > + return; > > > > +} > > > > > > This is not the way to do it. You need to actually kick off readahead in > > > this routine so that you get pipelining (ie the app is working on pages > > > 0-15 at the same time the server is getting you pages 16-31). I don't > > > see much support in orangefs for doing async operations; everything > > > seems to be modelled on "submit an I/O and wait for it to complete". > > > > > > I'm happy to help out with pagecache interactions, but adding async > > > support to orangefs is a little bigger task than I'm willing to put > > > significant effort into right now.
Mike Marshall <hubcap@omnibond.com> wrote: > /* allocate an array of bio_vecs. */ > bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL); > Better to use kcalloc() here as it has overflow checking. > /* hook the bio_vecs to the pages. */ > for (i = 0; i < npages; i++) { > bvs[i].bv_page = pages[i]; > bvs[i].bv_len = PAGE_SIZE; > bvs[i].bv_offset = 0; > } > > iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE); > > /* read in the pages. */ > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > /* clean up. */ > for (i = 0; i < npages; i++) { > SetPageUptodate(bvs[i].bv_page); > unlock_page(bvs[i].bv_page); > put_page(bvs[i].bv_page); > } > kfree(pages); > kfree(bvs); > } Could you try ITER_XARRAY instead of ITER_BVEC: https://lore.kernel.org/linux-fsdevel/161653786033.2770958.14154191921867463240.stgit@warthog.procyon.org.uk/T/#u Setting the iterator looks like: iov_iter_xarray(&iter, READ, &mapping->i_pages, offset, npages * PAGE_SIZE); The xarray iterator will handle THPs, but I'm not sure if bvecs will. Cleanup afterwards would look something like: static void afs_file_read_done(struct afs_read *req) { struct afs_vnode *vnode = req->vnode; struct page *page; pgoff_t index = req->pos >> PAGE_SHIFT; pgoff_t last = index + req->nr_pages - 1; XA_STATE(xas, &vnode->vfs_inode.i_mapping->i_pages, index); if (iov_iter_count(req->iter) > 0) { /* The read was short - clear the excess buffer. */ _debug("afterclear %zx %zx %llx/%llx", req->iter->iov_offset, iov_iter_count(req->iter), req->actual_len, req->len); iov_iter_zero(iov_iter_count(req->iter), req->iter); } rcu_read_lock(); xas_for_each(&xas, page, last) { page_endio(page, false, 0); put_page(page); } rcu_read_unlock(); task_io_account_read(req->len); req->cleanup = NULL; } David
Hi David. I implemented a version with iov_iter_xarray (below). It appears to be doing "the right thing" when it gets called, but then I get a backtrace in the kernel ring buffer "RIP: 0010:read_pages+0x1a1/0x2c0" which is page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) ------------[ cut here ]------------ kernel BUG at include/linux/pagemap.h:892! So it seems that in mm/readahead.c/read_pages I end up entering the "Clean up the remaining pages" part, and never make it through even one iteration... it happens whether I use readahead_expand or not. I've been looking at it a long time :-), I'll look more tomorrow... do you see anything obvious? static void orangefs_readahead_cleanup(struct xarray *i_pages, pgoff_t index, unsigned int npages, struct iov_iter *iter) { pgoff_t last; struct page *page; XA_STATE(xas, i_pages, index); last = npages - 1; if (iov_iter_count(iter) > 0) iov_iter_zero(iov_iter_count(iter), iter); rcu_read_lock(); xas_for_each(&xas, page, last) { page_endio(page, false, 0); put_page(page); } rcu_read_unlock(); } static void orangefs_readahead(struct readahead_control *rac) { unsigned int npages; loff_t offset; struct iov_iter iter; struct file *file = rac->file; struct inode *inode = file->f_mapping->host; struct xarray *i_pages; pgoff_t index; int ret; loff_t new_start = readahead_index(rac) * PAGE_SIZE; size_t new_len = 524288; readahead_expand(rac, new_start, new_len); npages = readahead_count(rac); offset = readahead_pos(rac); i_pages = &file->f_mapping->i_pages; iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE); /* read in the pages. */ ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); /* clean up. */ index = offset >> PAGE_SHIFT; orangefs_readahead_cleanup(i_pages, index, npages, &iter); } On Wed, Mar 24, 2021 at 7:10 AM David Howells <dhowells@redhat.com> wrote: > > Mike Marshall <hubcap@omnibond.com> wrote: > > > /* allocate an array of bio_vecs. */ > > bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL); > > > > Better to use kcalloc() here as it has overflow checking. > > > /* hook the bio_vecs to the pages. */ > > for (i = 0; i < npages; i++) { > > bvs[i].bv_page = pages[i]; > > bvs[i].bv_len = PAGE_SIZE; > > bvs[i].bv_offset = 0; > > } > > > > iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE); > > > > /* read in the pages. */ > > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > > > /* clean up. */ > > for (i = 0; i < npages; i++) { > > SetPageUptodate(bvs[i].bv_page); > > unlock_page(bvs[i].bv_page); > > put_page(bvs[i].bv_page); > > } > > kfree(pages); > > kfree(bvs); > > } > > Could you try ITER_XARRAY instead of ITER_BVEC: > > https://lore.kernel.org/linux-fsdevel/161653786033.2770958.14154191921867463240.stgit@warthog.procyon.org.uk/T/#u > > Setting the iterator looks like: > > iov_iter_xarray(&iter, READ, &mapping->i_pages, > offset, npages * PAGE_SIZE); > > The xarray iterator will handle THPs, but I'm not sure if bvecs will. > > Cleanup afterwards would look something like: > > static void afs_file_read_done(struct afs_read *req) > { > struct afs_vnode *vnode = req->vnode; > struct page *page; > pgoff_t index = req->pos >> PAGE_SHIFT; > pgoff_t last = index + req->nr_pages - 1; > > XA_STATE(xas, &vnode->vfs_inode.i_mapping->i_pages, index); > > if (iov_iter_count(req->iter) > 0) { > /* The read was short - clear the excess buffer. */ > _debug("afterclear %zx %zx %llx/%llx", > req->iter->iov_offset, > iov_iter_count(req->iter), > req->actual_len, req->len); > iov_iter_zero(iov_iter_count(req->iter), req->iter); > } > > rcu_read_lock(); > xas_for_each(&xas, page, last) { > page_endio(page, false, 0); > put_page(page); > } > rcu_read_unlock(); > > task_io_account_read(req->len); > req->cleanup = NULL; > } > > David >
On Fri, Mar 26, 2021 at 10:55:44PM -0400, Mike Marshall wrote: > Hi David. > > I implemented a version with iov_iter_xarray (below). > It appears to be doing "the right thing" when it > gets called, but then I get a backtrace in the kernel > ring buffer "RIP: 0010:read_pages+0x1a1/0x2c0" which is > page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) > ------------[ cut here ]------------ > kernel BUG at include/linux/pagemap.h:892! > > So it seems that in mm/readahead.c/read_pages I end up > entering the "Clean up the remaining pages" part, and > never make it through even one iteration... it happens > whether I use readahead_expand or not. > > I've been looking at it a long time :-), I'll look more > tomorrow... do you see anything obvious? Yes; Dave's sample code doesn't consume the pages from the readahead iterator, so the core code thinks you didn't consume them and unlocks / puts the pages for you. That goes wrong, because you did actually consume them. Glad I added the assertions now! We should probably add something like: static inline void readahead_consume(struct readahead_control *ractl, unsigned int nr) { ractl->_nr_pages -= nr; ractl->_index += nr; } to indicate that you consumed the pages other than by calling readahead_page() or readahead_page_batch(). Or maybe Dave can wrap iov_iter_xarray() in a readahead_iter() macro or something that takes care of adjusting index & nr_pages for you.
Matthew Wilcox <willy@infradead.org> wrote: > > I've been looking at it a long time :-), I'll look more > > tomorrow... do you see anything obvious? > > Yes; Dave's sample code doesn't consume the pages from the readahead > iterator, so the core code thinks you didn't consume them and unlocks > / puts the pages for you. That goes wrong, because you did actually > consume them. Glad I added the assertions now! Yeah... The cleanup function that I posted potentially happens asynchronously from the ->readahead function, so the ractl consumption has to be done elsewhere and may already have happened. In the case of the code I posted from, it's actually done in the netfs lib that's in the works: void netfs_readahead(...) { ... /* Drop the refs on the pages here rather than in the cache or * filesystem. The locks will be dropped in netfs_rreq_unlock(). */ while ((page = readahead_page(ractl))) put_page(page); ... } > We should probably add something like: > > static inline void readahead_consume(struct readahead_control *ractl, > unsigned int nr) > { > ractl->_nr_pages -= nr; > ractl->_index += nr; > } > > to indicate that you consumed the pages other than by calling > readahead_page() or readahead_page_batch(). Or maybe Dave can > wrap iov_iter_xarray() in a readahead_iter() macro or something > that takes care of adjusting index & nr_pages for you. I'm not sure either is useful for my case since iov_iter_xarray() for me isn't being used anywhere that there's an ractl and I still have to drop the page refs. However, in Mike's orangefs_readahead_cleanup(), he could replace: rcu_read_lock(); xas_for_each(&xas, page, last) { page_endio(page, false, 0); put_page(page); } rcu_read_unlock(); with: while ((page = readahead_page(ractl))) { page_endio(page, false, 0); put_page(page); } maybe? David
On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote: > However, in Mike's orangefs_readahead_cleanup(), he could replace: > > rcu_read_lock(); > xas_for_each(&xas, page, last) { > page_endio(page, false, 0); > put_page(page); > } > rcu_read_unlock(); > > with: > > while ((page = readahead_page(ractl))) { > page_endio(page, false, 0); > put_page(page); > } > > maybe? I'd rather see that than open-coded use of the XArray. It's mildly slower, but given that we're talking about doing I/O, probably not enough to care about.
OK... I used David's suggestion, and also put I it right in orangefs_readahead, orangefs_readahead_cleanup is gone. It seems to me to work great, I used it some with some printks in it and watched it do like I think it ought to. Here's an example of what's upstream in 5.11.8-200.fc33.x86_64: # dd if=/pvfsmnt/z1 of=/dev/null bs=4194304 count=30 30+0 records in 30+0 records out 125829120 bytes (126 MB, 120 MiB) copied, 5.77943 s, 21.8 MB/s And here's this version of orangefs_readahead on top of 5.12.0-rc4: # dd if=/pvfsmnt/z1 of=/dev/null bs=4194304 count=30 30+0 records in 30+0 records out 125829120 bytes (126 MB, 120 MiB) copied, 0.325919 s, 386 MB/s So now we're getting somewhere :-). I hope readahead_expand will be upstream soon. I plan to use inode->i_size and offset to decide how much expansion is needed on each call to orangefs_readahead, I hope looking at i_size isn't one of those race condition things I'm always screwing up on. If y'all think the orangefs_readahead below is an OK starting point, I'll add in the i_size/offset logic so I can get fullsized orangefs gulps of readahead all the way up to the last whatever sized fragment of the file and run xfstests on it to see if it still seems like it is doing right. One day when it is possible I wish I could figure out how to use huge pages or something, copying 1024 pages at a time out of the orangefs internal buffer into the page cache is probably slower than if I could figure out a way to copy 4194304 bytes out of our buffer into the page cache at once... Matthew>> but given that we're talking about doing I/O, probably Matthew>> not enough to care about. With orangefs that almost ALL we care about. Thanks for your help! -Mike static void orangefs_readahead(struct readahead_control *rac) { unsigned int npages; loff_t offset; struct iov_iter iter; struct file *file = rac->file; struct inode *inode = file->f_mapping->host; struct xarray *i_pages; pgoff_t index; struct page *page; int ret; loff_t new_start = readahead_index(rac) * PAGE_SIZE; size_t new_len = 524288; readahead_expand(rac, new_start, new_len); npages = readahead_count(rac); offset = readahead_pos(rac); i_pages = &file->f_mapping->i_pages; iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE); /* read in the pages. */ ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); /* clean up. */ while ((page = readahead_page(rac))) { page_endio(page, false, 0); put_page(page); } } On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote: > > However, in Mike's orangefs_readahead_cleanup(), he could replace: > > > > rcu_read_lock(); > > xas_for_each(&xas, page, last) { > > page_endio(page, false, 0); > > put_page(page); > > } > > rcu_read_unlock(); > > > > with: > > > > while ((page = readahead_page(ractl))) { > > page_endio(page, false, 0); > > put_page(page); > > } > > > > maybe? > > I'd rather see that than open-coded use of the XArray. It's mildly > slower, but given that we're talking about doing I/O, probably not enough > to care about.
On Sat, Mar 27, 2021 at 11:40:08AM -0400, Mike Marshall wrote: > int ret; > > loff_t new_start = readahead_index(rac) * PAGE_SIZE; That looks like readahead_pos() to me. > size_t new_len = 524288; > readahead_expand(rac, new_start, new_len); > > npages = readahead_count(rac); > offset = readahead_pos(rac); > i_pages = &file->f_mapping->i_pages; > > iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE); readahead_length()? > /* read in the pages. */ > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > /* clean up. */ > while ((page = readahead_page(rac))) { > page_endio(page, false, 0); > put_page(page); > } > } What if wait_for_direct_io() returns an error? Shouldn't you be calling page_endio(page, false, ret) ? > On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote: > > > However, in Mike's orangefs_readahead_cleanup(), he could replace: > > > > > > rcu_read_lock(); > > > xas_for_each(&xas, page, last) { > > > page_endio(page, false, 0); > > > put_page(page); > > > } > > > rcu_read_unlock(); > > > > > > with: > > > > > > while ((page = readahead_page(ractl))) { > > > page_endio(page, false, 0); > > > put_page(page); > > > } > > > > > > maybe? > > > > I'd rather see that than open-coded use of the XArray. It's mildly > > slower, but given that we're talking about doing I/O, probably not enough > > to care about.
This seems OK... ? static void orangefs_readahead(struct readahead_control *rac) { loff_t offset; struct iov_iter iter; struct file *file = rac->file; struct inode *inode = file->f_mapping->host; struct xarray *i_pages; struct page *page; loff_t new_start = readahead_pos(rac); int ret; size_t new_len = 0; loff_t bytes_remaining = inode->i_size - readahead_pos(rac); loff_t pages_remaining = bytes_remaining / PAGE_SIZE; if (pages_remaining >= 1024) new_len = 4194304; else if (pages_remaining > readahead_count(rac)) new_len = bytes_remaining; if (new_len) readahead_expand(rac, new_start, new_len); offset = readahead_pos(rac); i_pages = &file->f_mapping->i_pages; iov_iter_xarray(&iter, READ, i_pages, offset, readahead_length(rac)); /* read in the pages. */ if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, readahead_length(rac), inode->i_size, NULL, NULL, file)) < 0) gossip_debug(GOSSIP_FILE_DEBUG, "%s: wait_for_direct_io failed. \n", __func__); else ret = 0; /* clean up. */ while ((page = readahead_page(rac))) { page_endio(page, false, ret); put_page(page); } } I need to go remember how to git send-email through the kernel.org email server, I apologize for the way gmail unformats my code, even in plain text mode... -Mike On Sat, Mar 27, 2021 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Mar 27, 2021 at 11:40:08AM -0400, Mike Marshall wrote: > > int ret; > > > > loff_t new_start = readahead_index(rac) * PAGE_SIZE; > > That looks like readahead_pos() to me. > > > size_t new_len = 524288; > > readahead_expand(rac, new_start, new_len); > > > > npages = readahead_count(rac); > > offset = readahead_pos(rac); > > i_pages = &file->f_mapping->i_pages; > > > > iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE); > > readahead_length()? > > > /* read in the pages. */ > > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > > > /* clean up. */ > > while ((page = readahead_page(rac))) { > > page_endio(page, false, 0); > > put_page(page); > > } > > } > > What if wait_for_direct_io() returns an error? Shouldn't you be calling > > page_endio(page, false, ret) > > ? > > > On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote: > > > > However, in Mike's orangefs_readahead_cleanup(), he could replace: > > > > > > > > rcu_read_lock(); > > > > xas_for_each(&xas, page, last) { > > > > page_endio(page, false, 0); > > > > put_page(page); > > > > } > > > > rcu_read_unlock(); > > > > > > > > with: > > > > > > > > while ((page = readahead_page(ractl))) { > > > > page_endio(page, false, 0); > > > > put_page(page); > > > > } > > > > > > > > maybe? > > > > > > I'd rather see that than open-coded use of the XArray. It's mildly > > > slower, but given that we're talking about doing I/O, probably not enough > > > to care about.
I have Linux 5.12-rc4. On top of that I have "David's patches", an 85 patch series I peeled off of David Howell's fscache-iter branch on git.kernel.org a few days ago. These patches include what I need to use readahead_expand. On top of that is my orangefs_readahead patch. I have run through the xfstests I run, and am failing generic 75 112 127 & 263, which I normally pass. So I got rid of my orangefs_readahead patch. Still failing 75 112 127 & 263. Then I got rid of David's patches, I'm at generic Linux 5.12-rc4, and am no longer failing those tests. Just thought I should mention it... other than that, I'm real happy with the orangefs_readahead patch, it is a giant improvement. Y'all's help made all the difference... -Mike On Sat, Mar 27, 2021 at 11:04 PM Mike Marshall <hubcap@omnibond.com> wrote: > > This seems OK... ? > > static void orangefs_readahead(struct readahead_control *rac) > { > loff_t offset; > struct iov_iter iter; > struct file *file = rac->file; > struct inode *inode = file->f_mapping->host; > struct xarray *i_pages; > struct page *page; > loff_t new_start = readahead_pos(rac); > int ret; > size_t new_len = 0; > > loff_t bytes_remaining = inode->i_size - readahead_pos(rac); > loff_t pages_remaining = bytes_remaining / PAGE_SIZE; > > if (pages_remaining >= 1024) > new_len = 4194304; > else if (pages_remaining > readahead_count(rac)) > new_len = bytes_remaining; > > if (new_len) > readahead_expand(rac, new_start, new_len); > > offset = readahead_pos(rac); > i_pages = &file->f_mapping->i_pages; > > iov_iter_xarray(&iter, READ, i_pages, offset, readahead_length(rac)); > > /* read in the pages. */ > if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, > &offset, &iter, readahead_length(rac), > inode->i_size, NULL, NULL, file)) < 0) > gossip_debug(GOSSIP_FILE_DEBUG, > "%s: wait_for_direct_io failed. \n", __func__); > else > ret = 0; > > /* clean up. */ > while ((page = readahead_page(rac))) { > page_endio(page, false, ret); > put_page(page); > } > } > > I need to go remember how to git send-email through the > kernel.org email server, I apologize for the way gmail > unformats my code, even in plain text mode... > > -Mike > > On Sat, Mar 27, 2021 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Sat, Mar 27, 2021 at 11:40:08AM -0400, Mike Marshall wrote: > > > int ret; > > > > > > loff_t new_start = readahead_index(rac) * PAGE_SIZE; > > > > That looks like readahead_pos() to me. > > > > > size_t new_len = 524288; > > > readahead_expand(rac, new_start, new_len); > > > > > > npages = readahead_count(rac); > > > offset = readahead_pos(rac); > > > i_pages = &file->f_mapping->i_pages; > > > > > > iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE); > > > > readahead_length()? > > > > > /* read in the pages. */ > > > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > > > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > > > > > /* clean up. */ > > > while ((page = readahead_page(rac))) { > > > page_endio(page, false, 0); > > > put_page(page); > > > } > > > } > > > > What if wait_for_direct_io() returns an error? Shouldn't you be calling > > > > page_endio(page, false, ret) > > > > ? > > > > > On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote: > > > > > However, in Mike's orangefs_readahead_cleanup(), he could replace: > > > > > > > > > > rcu_read_lock(); > > > > > xas_for_each(&xas, page, last) { > > > > > page_endio(page, false, 0); > > > > > put_page(page); > > > > > } > > > > > rcu_read_unlock(); > > > > > > > > > > with: > > > > > > > > > > while ((page = readahead_page(ractl))) { > > > > > page_endio(page, false, 0); > > > > > put_page(page); > > > > > } > > > > > > > > > > maybe? > > > > > > > > I'd rather see that than open-coded use of the XArray. It's mildly > > > > slower, but given that we're talking about doing I/O, probably not enough > > > > to care about.
Mike Marshall <hubcap@omnibond.com> wrote: > Then I got rid of David's patches, I'm at > generic Linux 5.12-rc4, and am no longer > failing those tests. Um - which patches? David
David>> Um - which patches? fscache-iter from https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git When the top commit was cachefiles: Don't shorten if over 1G: ff60d1fc9c7cb93d53d1d081f65490ff4ab2f122 followed by a bunch of commits up to iov_iter: Add ITER_XARRAY: 153907f0e36425bdefd50182ca9733d8bc8e716a And after that it was Linus' 5.12-rc2 tree... Linux 5.12-rc2: a38fd8748464831584a19438cbb3082b5a2dab15 You were posting patches from there on fs-devel, and I was reading your messages and knew what I needed was included there. When you asked me to try iov_iter_xarray I looked for it in your git tree instead of peeling it out of my gmail. Perhaps I got more stuff than you intended... I did git format-patch a38fd874..ff60d1fc and added that to my Linux 5.12-rc4 tree to make my orangefs_readahead patch that uses readahead_expand. -Mike On Mon, Mar 29, 2021 at 5:37 AM David Howells <dhowells@redhat.com> wrote: > > Mike Marshall <hubcap@omnibond.com> wrote: > > > Then I got rid of David's patches, I'm at > > generic Linux 5.12-rc4, and am no longer > > failing those tests. > > Um - which patches? > > David >
Mike Marshall <hubcap@omnibond.com> wrote: > I did git format-patch a38fd874..ff60d1fc > and added that to my Linux 5.12-rc4 tree to make my orangefs_readahead > patch that uses readahead_expand. You're using the readahead_expand patch and the iov_iter_xarray patches? Do you have a public branch I can look at? David
Hi David... I've been gone on a motorcycle adventure, sorry for the delay... here's my public branch... https://github.com/hubcapsc/linux/tree/readahead_v3 -Mike On Thu, Apr 1, 2021 at 9:42 AM David Howells <dhowells@redhat.com> wrote: > > Mike Marshall <hubcap@omnibond.com> wrote: > > > I did git format-patch a38fd874..ff60d1fc > > and added that to my Linux 5.12-rc4 tree to make my orangefs_readahead > > patch that uses readahead_expand. > > You're using the readahead_expand patch and the iov_iter_xarray patches? > > Do you have a public branch I can look at? > > David >
Mike Marshall <hubcap@omnibond.com> wrote: > Hi David... I've been gone on a motorcycle adventure, > sorry for the delay... here's my public branch... > > https://github.com/hubcapsc/linux/tree/readahead_v3 That seems to have all of my fscache-iter branch in it. I thought you'd said you'd dropped them because they were causing problems. Anyway, I've distilled the basic netfs lib patches, including the readahead expansion patch and ITER_XARRAY patch here: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib if that's of use to you? If you're using any of these patches, would it be possible to get a Tested-by for them that I can add? David
Hi David... I got your netfs-lib branch as of Apr 15. I added my orangefs_readahead on top of it and ran through xfstests. I failed generic 75, 112, 127, & 263, which I don't usually fail. I took off my orangefs_readahead patch and ran xfstests with your untouched netfs-lib branch. No regressions. I git-reset back to 5.12.0-rc4 (I think your netfs-lib branch is based on a linus-tree-of-the-day?) and ran xfstests... no regressions. So... I think all your stuff is working well from my perspective and that I need to figure out why my orangefs_readahead patch is causing the regressions I listed. My readahead implementation (via your readahead_expand) is really fast, but it is bare-bones... I'm probably leaving out some important stuff... I see other filesystem's readahead implementations doing stuff like avoiding doing readahead on pages that have yet to be written back for example. The top two commits at https://github.com/hubcapsc/linux/tree/readahead_v3 is the current state of my readahead implementation. Please do add Tested-by: Mike Marshall <hubcap@omnibond.com> -Mike On Tue, Apr 13, 2021 at 11:08 AM David Howells <dhowells@redhat.com> wrote: > > Mike Marshall <hubcap@omnibond.com> wrote: > > > Hi David... I've been gone on a motorcycle adventure, > > sorry for the delay... here's my public branch... > > > > https://github.com/hubcapsc/linux/tree/readahead_v3 > > That seems to have all of my fscache-iter branch in it. I thought you'd said > you'd dropped them because they were causing problems. > > Anyway, I've distilled the basic netfs lib patches, including the readahead > expansion patch and ITER_XARRAY patch here: > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib > > if that's of use to you? > > If you're using any of these patches, would it be possible to get a Tested-by > for them that I can add? > > David >
On Fri, Apr 16, 2021 at 10:36:52AM -0400, Mike Marshall wrote: > So... I think all your stuff is working well from my perspective > and that I need to figure out why my orangefs_readahead patch > is causing the regressions I listed. My readahead implementation (via your > readahead_expand) is really fast, but it is bare-bones... I'm probably > leaving out some important stuff... I see other filesystem's > readahead implementations doing stuff like avoiding doing readahead > on pages that have yet to be written back for example. You do?! Actual readahead implementations, or people still implementing the old ->readpages() method? The ->readahead() method is _only_ called for pages which are freshly allocated, Locked and !Uptodate. If you ever see a page which is Dirty or Writeback, something has gone very wrong. Could you tell me which filesystem you saw that bogosity in? > The top two commits at https://github.com/hubcapsc/linux/tree/readahead_v3 > is the current state of my readahead implementation. > > Please do add > Tested-by: Mike Marshall <hubcap@omnibond.com> > > -Mike > > On Tue, Apr 13, 2021 at 11:08 AM David Howells <dhowells@redhat.com> wrote: > > > > Mike Marshall <hubcap@omnibond.com> wrote: > > > > > Hi David... I've been gone on a motorcycle adventure, > > > sorry for the delay... here's my public branch... > > > > > > https://github.com/hubcapsc/linux/tree/readahead_v3 > > > > That seems to have all of my fscache-iter branch in it. I thought you'd said > > you'd dropped them because they were causing problems. > > > > Anyway, I've distilled the basic netfs lib patches, including the readahead > > expansion patch and ITER_XARRAY patch here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib > > > > if that's of use to you? > > > > If you're using any of these patches, would it be possible to get a Tested-by > > for them that I can add? > > > > David > >
In orangefs_readahead(): loff_t bytes_remaining = inode->i_size - readahead_pos(rac); loff_t pages_remaining = bytes_remaining / PAGE_SIZE; What happens if bytes_remaining < PAGE_SIZE? Or even what happens if bytes_remaining % PAGE_SIZE != 0? if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, readahead_length(rac), inode->i_size, NULL, NULL, file)) < 0) I wonder if you should use iov_length(&iter) rather than readahead_length(rac). They *should* be the same. Also, should you cache inode->i_size lest it change under you due to truncate? David
>> What happens if bytes_remaining < PAGE_SIZE? I think on a call where that occurs, new_len won't get set and readahead_expand won't get called. I don't see how that's not correct, but I question me more than I question you :-) ... >> what happens if bytes_remaining % PAGE_SIZE != 0 I think bytes_remaining % PAGE_SIZE worth of bytes won't get read on that call, but that the readahead callout keeps getting called until all the bytes are read? After you asked this, I thought about adding 1 to new_len in such cases, and did some tests that way, it seems to me like it works out as is. >> I wonder if you should use iov_length(&iter) iov_length has two arguments. The first one would maybe be iter.iov and the second one would be... ? >> should you cache inode->i_size lest it change under you due to truncate That seems important, but I can't return an error from the void readahead callout. Would I react by somehow returning the pages back to their original condition, or ? Anywho... I see that you've force pushed a new netfs... I think you have it applied to a linus-tree-of-the-day on top of 5.12-rc4? I have taken these patches from git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git (netfs-lib) 0001-iov_iter-Add-ITER_XARRAY.patch 0002-mm-Add-set-end-wait-functions-for-PG_private_2.patch 0003-mm-filemap-Pass-the-file_ra_state-in-the-ractl.patch 0004-fs-Document-file_ra_state.patch 0005-mm-readahead-Handle-ractl-nr_pages-being-modified.patch 0006-mm-Implement-readahead_control-pageset-expansion.patch 0007-netfs-Make-a-netfs-helper-module.patch 0008-netfs-Documentation-for-helper-library.patch 0009-netfs-mm-Move-PG_fscache-helper-funcs-to-linux-netfs.patch 0010-netfs-mm-Add-set-end-wait_on_page_fscache-aliases.patch 0011-netfs-Provide-readahead-and-readpage-netfs-helpers.patch 0012-netfs-Add-tracepoints.patch 0013-netfs-Gather-stats.patch 0014-netfs-Add-write_begin-helper.patch 0015-netfs-Define-an-interface-to-talk-to-a-cache.patch 0016-netfs-Add-a-tracepoint-to-log-failures-that-would-be.patch 0017-fscache-cachefiles-Add-alternate-API-to-use-kiocb-fo.patch ... and added them on top of Linux 5.12-rc8 and added my readahead patch to that. Now I fail one extra xfstest, I fail generic/075, generic/112, generic/127, generic/263 and generic/438. I haven't found an obvious (to me) problem with my patch and I can't claim to understand everything that is going on in the patches I have of yours... I'll keep looking... -Mike On Fri, Apr 16, 2021 at 11:41 AM David Howells <dhowells@redhat.com> wrote: > > In orangefs_readahead(): > > loff_t bytes_remaining = inode->i_size - readahead_pos(rac); > loff_t pages_remaining = bytes_remaining / PAGE_SIZE; > > What happens if bytes_remaining < PAGE_SIZE? Or even what happens if > bytes_remaining % PAGE_SIZE != 0? > > if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, > &offset, &iter, readahead_length(rac), > inode->i_size, NULL, NULL, file)) < 0) > > I wonder if you should use iov_length(&iter) rather than > readahead_length(rac). They *should* be the same. > > Also, should you cache inode->i_size lest it change under you due to truncate? > > David >
>>You do?! Actual readahead implementations, or >>people still implementing the old ->readpages() method? No :-) I grabbed that as an example off the top of my head of the kind of thing I saw while reading readahead code, but that I didn't try to handle in my simple implementation of readahead. I'm guessing that since I have some xfstest regressions maybe my implementation overlooks one or more important details... -Mike On Fri, Apr 16, 2021 at 11:14 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 16, 2021 at 10:36:52AM -0400, Mike Marshall wrote: > > So... I think all your stuff is working well from my perspective > > and that I need to figure out why my orangefs_readahead patch > > is causing the regressions I listed. My readahead implementation (via your > > readahead_expand) is really fast, but it is bare-bones... I'm probably > > leaving out some important stuff... I see other filesystem's > > readahead implementations doing stuff like avoiding doing readahead > > on pages that have yet to be written back for example. > > You do?! Actual readahead implementations, or people still implementing > the old ->readpages() method? The ->readahead() method is _only_ called > for pages which are freshly allocated, Locked and !Uptodate. If you ever > see a page which is Dirty or Writeback, something has gone very wrong. > Could you tell me which filesystem you saw that bogosity in? > > > The top two commits at https://github.com/hubcapsc/linux/tree/readahead_v3 > > is the current state of my readahead implementation. > > > > Please do add > > Tested-by: Mike Marshall <hubcap@omnibond.com> > > > > -Mike > > > > On Tue, Apr 13, 2021 at 11:08 AM David Howells <dhowells@redhat.com> wrote: > > > > > > Mike Marshall <hubcap@omnibond.com> wrote: > > > > > > > Hi David... I've been gone on a motorcycle adventure, > > > > sorry for the delay... here's my public branch... > > > > > > > > https://github.com/hubcapsc/linux/tree/readahead_v3 > > > > > > That seems to have all of my fscache-iter branch in it. I thought you'd said > > > you'd dropped them because they were causing problems. > > > > > > Anyway, I've distilled the basic netfs lib patches, including the readahead > > > expansion patch and ITER_XARRAY patch here: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib > > > > > > if that's of use to you? > > > > > > If you're using any of these patches, would it be possible to get a Tested-by > > > for them that I can add? > > > > > > David > > >
Mike Marshall <hubcap@omnibond.com> wrote: > >> I wonder if you should use iov_length(&iter) > > iov_length has two arguments. The first one would maybe be iter.iov and > the second one would be... ? Sorry, I meant iov_iter_count(&iter). I'll look at the other things next week. Is it easy to set up an orangefs client and server? David
Mike Marshall <hubcap@omnibond.com> wrote: > Anywho... I see that you've force pushed a new netfs... I think you > have it applied to a linus-tree-of-the-day on top of 5.12-rc4? > I have taken these patches from > git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git (netfs-lib) > > 0001-iov_iter-Add-ITER_XARRAY.patch > 0002-mm-Add-set-end-wait-functions-for-PG_private_2.patch > 0003-mm-filemap-Pass-the-file_ra_state-in-the-ractl.patch > 0004-fs-Document-file_ra_state.patch > 0005-mm-readahead-Handle-ractl-nr_pages-being-modified.patch > 0006-mm-Implement-readahead_control-pageset-expansion.patch > 0007-netfs-Make-a-netfs-helper-module.patch > 0008-netfs-Documentation-for-helper-library.patch > 0009-netfs-mm-Move-PG_fscache-helper-funcs-to-linux-netfs.patch > 0010-netfs-mm-Add-set-end-wait_on_page_fscache-aliases.patch > 0011-netfs-Provide-readahead-and-readpage-netfs-helpers.patch > 0012-netfs-Add-tracepoints.patch > 0013-netfs-Gather-stats.patch > 0014-netfs-Add-write_begin-helper.patch > 0015-netfs-Define-an-interface-to-talk-to-a-cache.patch > 0016-netfs-Add-a-tracepoint-to-log-failures-that-would-be.patch > 0017-fscache-cachefiles-Add-alternate-API-to-use-kiocb-fo.patch Can you add this patch also: https://lore.kernel.org/r/3545034.1619392490@warthog.procyon.org.uk/ [PATCH] iov_iter: Four fixes for ITER_XARRAY David
>> Is it easy to set up an orangefs client and server? I think it is easy to set up a test system on a single VM, but I do it all the time. I souped up the build details in Documentation/filesystems/orangefs.rst not too long ago, I hope it is useful. Your VM would need to be a "developer" VM, with all the autotools stuff and such if you build from source. I also worked on the configure stuff so that you would learn about any packages you lack at configure time, I hope that is also still good. I read your message about trying again with the "Four fixes for ITER_XARRAY" patch, I'll report on how that goes... -Mike On Sun, Apr 25, 2021 at 3:49 AM David Howells <dhowells@redhat.com> wrote: > > Mike Marshall <hubcap@omnibond.com> wrote: > > > >> I wonder if you should use iov_length(&iter) > > > > iov_length has two arguments. The first one would maybe be iter.iov and > > the second one would be... ? > > Sorry, I meant iov_iter_count(&iter). > > I'll look at the other things next week. Is it easy to set up an orangefs > client and server? > > David >
I added the "Four fixes for ITER_XARRAY" patch and got things running again. I ran the tests I had regressions on by themselves, and I still fail generic/075, generic/112, generic/127 and generic/263. generic/438 passes now. I was analyzing what test 075 was doing when it was failing, and I found it to be doing this: /home/hubcap/xfstests-dev/ltp/fsx -d -N 1000 -S 0 -P /home/hubcap/xfstests-dev /pvfsmnt/whatever The above used to fail every time... now it works every time. Progress :-). I'm about to launch the whole suite of tests, I'll report back on what happens later... -Mike On Mon, Apr 26, 2021 at 10:53 AM Mike Marshall <hubcap@omnibond.com> wrote: > > >> Is it easy to set up an orangefs client and server? > > I think it is easy to set up a test system on a single VM, > but I do it all the time. I souped up the build details in > Documentation/filesystems/orangefs.rst not too long > ago, I hope it is useful. > > Your VM would need to be a "developer" VM, > with all the autotools stuff and such if you build > from source. I also worked on the configure stuff > so that you would learn about any packages you > lack at configure time, I hope that is also still good. > > I read your message about trying again with the > "Four fixes for ITER_XARRAY" patch, I'll report > on how that goes... > > -Mike > > On Sun, Apr 25, 2021 at 3:49 AM David Howells <dhowells@redhat.com> wrote: > > > > Mike Marshall <hubcap@omnibond.com> wrote: > > > > > >> I wonder if you should use iov_length(&iter) > > > > > > iov_length has two arguments. The first one would maybe be iter.iov and > > > the second one would be... ? > > > > Sorry, I meant iov_iter_count(&iter). > > > > I'll look at the other things next week. Is it easy to set up an orangefs > > client and server? > > > > David > >
Mike Marshall <hubcap@omnibond.com> wrote:
> Progress :-).
Yay! :-)
David
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 48f0547d4850..682a968cb82a 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -244,6 +244,25 @@ static int orangefs_writepages(struct address_space *mapping, static int orangefs_launder_page(struct page *); +/* + * Prefill the page cache with some pages that we're probably + * about to need... + */ +static void orangefs_readahead(struct readahead_control *rac) +{ + pgoff_t index = readahead_index(rac); + struct page *page; + + while ((page = readahead_page(rac))) { + prefetchw(&page->flags); + put_page(page); + unlock_page(page); + index++; + } + + return; +} + static int orangefs_readpage(struct file *file, struct page *page) { struct inode *inode = page->mapping->host; @@ -260,11 +279,16 @@ static int orangefs_readpage(struct file *file, struct page *page) int remaining; /* - * Get up to this many bytes from Orangefs at a time and try - * to fill them into the page cache at once. Tests with dd made - * this seem like a reasonable static number, if there was - * interest perhaps this number could be made setable through - * sysfs... + * Orangefs isn't a good fit for reading files one page at + * a time. Get up to "read_size" bytes from Orangefs at a time and + * try to fill them into the page cache at once. Readahead code in + * mm already got us some extra pages by calling orangefs_readahead, + * but it doesn't know how many we actually wanted, so we'll + * get some more after we use up the extra ones we got from + * orangefs_readahead. Tests with dd made "read_size" seem + * like a reasonable static number of bytes to get from orangefs, + * if there was interest perhaps "read_size" could be made + * setable through sysfs or something... */ read_size = 524288; @@ -302,31 +326,19 @@ static int orangefs_readpage(struct file *file, struct page *page) slot_index = 0; while ((remaining - PAGE_SIZE) >= PAGE_SIZE) { remaining -= PAGE_SIZE; - /* - * It is an optimization to try and fill more than one - * page... by now we've already gotten the single - * page we were after, if stuff doesn't seem to - * be going our way at this point just return - * and hope for the best. - * - * If we look for pages and they're already there is - * one reason to give up, and if they're not there - * and we can't create them is another reason. - */ index++; slot_index++; - next_page = find_get_page(inode->i_mapping, index); + next_page = find_lock_page(inode->i_mapping, index); if (next_page) { - gossip_debug(GOSSIP_FILE_DEBUG, - "%s: found next page, quitting\n", - __func__); - put_page(next_page); - goto out; + printk("%s: found readahead page\n", __func__); + } else { + next_page = + find_or_create_page(inode->i_mapping, + index, + GFP_KERNEL); + printk("%s: alloced my own page\n", __func__); } - next_page = find_or_create_page(inode->i_mapping, - index, - GFP_KERNEL); /* * I've never hit this, leave it as a printk for * now so it will be obvious. @@ -659,6 +671,7 @@ static ssize_t orangefs_direct_IO(struct kiocb *iocb, /** ORANGEFS2 implementation of address space operations */ static const struct address_space_operations orangefs_address_operations = { .writepage = orangefs_writepage, + .readahead = orangefs_readahead, .readpage = orangefs_readpage, .writepages = orangefs_writepages, .set_page_dirty = __set_page_dirty_nobuffers, diff --git a/fs/orangefs/orangefs-mod.c b/fs/orangefs/orangefs-mod.c index 74a3d6337ef4..cd7297815f91 100644 --- a/fs/orangefs/orangefs-mod.c +++ b/fs/orangefs/orangefs-mod.c @@ -31,7 +31,7 @@ static ulong module_parm_debug_mask; __u64 orangefs_gossip_debug_mask; int op_timeout_secs = ORANGEFS_DEFAULT_OP_TIMEOUT_SECS; int slot_timeout_secs = ORANGEFS_DEFAULT_SLOT_TIMEOUT_SECS; -int orangefs_cache_timeout_msecs = 50; +int orangefs_cache_timeout_msecs = 500; int orangefs_dcache_timeout_msecs = 50; int orangefs_getattr_timeout_msecs = 50;