Message ID | 164303051132.2163193.10493291874899600548.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] cifs: Transition from ->readpages() to ->readahead() | expand |
On Mon, Jan 24, 2022 at 01:21:51PM +0000, David Howells wrote: > Transition the cifs filesystem from using the old ->readpages() method to > using the new ->readahead() method. > > For the moment, this removes any invocation of fscache to read data from > the local cache, leaving that to another patch. > > Questions for Willy: > - Can we get a function to return the number of pages in a readahead > batch? > > - Can we get a function to commit a readahead batch? Currently, this is > done when we call __readahead_batch(), but that means ractl->_nr_pages > isn't up to date at the point we need it to be. However, we want to > check to see if the ractl is empty, then get server credits and only > *then* call __readahead_batch() as we don't know how big a batch we're > allowed till we have the credits. If you insist on using the primitives in a way that nobody else uses them, you're going to find they don't work. What's wrong with the way that FUSE uses them in fuse_readahead()?
Matthew Wilcox <willy@infradead.org> wrote: > > Questions for Willy: > > - Can we get a function to return the number of pages in a readahead > > batch? > > > > - Can we get a function to commit a readahead batch? Currently, this is > > done when we call __readahead_batch(), but that means ractl->_nr_pages > > isn't up to date at the point we need it to be. However, we want to > > check to see if the ractl is empty, then get server credits and only > > *then* call __readahead_batch() as we don't know how big a batch we're > > allowed till we have the credits. > > If you insist on using the primitives in a way that nobody else uses > them, you're going to find they don't work. What's wrong with the > way that FUSE uses them in fuse_readahead()? You mean doing this? nr_pages = readahead_count(rac) - nr_pages; that would seem to indicate that the readahead interface is wrong. Why should readahead_count() need correction? I think I can see *why* the batching stuff is hidden, but it seems that the comment for readahead_count() needs to mention this if you aren't going to fix it. Would it be possible to make readahead_count() do: return rac->_nr_pages - rac->_batch_count; maybe? David
On Mon, Jan 24, 2022 at 03:14:41PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > > Questions for Willy: > > > - Can we get a function to return the number of pages in a readahead > > > batch? > > > > > > - Can we get a function to commit a readahead batch? Currently, this is > > > done when we call __readahead_batch(), but that means ractl->_nr_pages > > > isn't up to date at the point we need it to be. However, we want to > > > check to see if the ractl is empty, then get server credits and only > > > *then* call __readahead_batch() as we don't know how big a batch we're > > > allowed till we have the credits. > > > > If you insist on using the primitives in a way that nobody else uses > > them, you're going to find they don't work. What's wrong with the > > way that FUSE uses them in fuse_readahead()? > > You mean doing this? > > nr_pages = readahead_count(rac) - nr_pages; > > that would seem to indicate that the readahead interface is wrong. Why should > readahead_count() need correction? I think I can see *why* the batching stuff > is hidden, but it seems that the comment for readahead_count() needs to > mention this if you aren't going to fix it. > > Would it be possible to make readahead_count() do: > > return rac->_nr_pages - rac->_batch_count; > > maybe? Yes, I think that would make sense. Do we also need to change readhead_length()? It seems to me that it's only ever called once at initialisation, so it should be possible to keep the two in sync. Can you write & test such a patch? I'll support it going upstream (either by taking it myself or giving you a R-b to take it through your tree).
Matthew Wilcox <willy@infradead.org> wrote: > > Would it be possible to make readahead_count() do: > > > > return rac->_nr_pages - rac->_batch_count; > > > > maybe? > > Yes, I think that would make sense. Do we also need to change > readhead_length()? It seems to me that it's only ever called once at > initialisation, so it should be possible to keep the two in sync. > Can you write & test such a patch? I'll support it going upstream > (either by taking it myself or giving you a R-b to take it through your > tree). It seems I also have a problem with readahead_index() needing compensation too. I'm guessing that's more of a problem, however, as I think that's expected to refer to the beginning of the current batch. David
On Mon, Jan 24, 2022 at 03:46:27PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > > Would it be possible to make readahead_count() do: > > > > > > return rac->_nr_pages - rac->_batch_count; > > > > > > maybe? > > > > Yes, I think that would make sense. Do we also need to change > > readhead_length()? It seems to me that it's only ever called once at > > initialisation, so it should be possible to keep the two in sync. > > Can you write & test such a patch? I'll support it going upstream > > (either by taking it myself or giving you a R-b to take it through your > > tree). > > It seems I also have a problem with readahead_index() needing compensation > too. I'm guessing that's more of a problem, however, as I think that's > expected to refer to the beginning of the current batch. Well, that's the problem isn't it? You're expecting to mutate the state and then refer to its previous state instead of its current state, whereas the other users refer to the current state instead of the previous state. Can't you pull readahead_index() out of the ractl ahead of the mutation?
Matthew Wilcox <willy@infradead.org> wrote: > Well, that's the problem isn't it? You're expecting to mutate the state > and then refer to its previous state instead of its current state, No. I *don't* want to see the previous state. That's where the problem is: The previous state isn't cleared up until the point I ask for a new batch - but I need to query the ractl to find the remaining window before I can ask for a new batch. The first pass through the loop is, in effect, substantially different to the second and subsequent passes. > whereas the other users refer to the current state instead of the > previous state. Fuse has exactly the same issues: nr_pages = readahead_count(rac) - nr_pages; if (nr_pages > max_pages) nr_pages = max_pages; if (nr_pages == 0) break; ia = fuse_io_alloc(NULL, nr_pages); ... nr_pages = __readahead_batch(rac, ap->pages, nr_pages); It needs to find out how many pages it needs so that it can allocate its page array before it can get a new batch, but the ractl is holding the previous state. > Can't you pull readahead_index() out of the ractl ahead of the mutation? I'm not sure what you mean by that. What I'm now doing is: while ((nr_pages = readahead_count(ractl) - ractl->_batch_count)) { ... pgoff_t index = readahead_index(ractl) + ractl->_batch_count; ... rc = cifs_fscache_query_occupancy( ractl->mapping->host, index, nr_pages, ... rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize, &rsize, credits); ... nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl)); nr_pages = min_t(size_t, nr_pages, next_cached - index); ... rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete); ... got = __readahead_batch(ractl, rdata->pages, nr_pages); ... } I need the count to know how many, if any, pages are remaining; I need the count and index of the window to find out if fscache has any data; I need the count to allocate a page array. Only after all that can I crank the next batch for the server (assuming I didn't delegate to the cache along the way, but that calls readahead_page()). (Yes, I would like to remove this code entirely and just call into netfslib. I have patches to do that, but it involves quite an overhaul of the cifs driver, but the above might get cifs caching again more quickly pending that. Maybe I should just take the easy way here and cache the last state so that I can compensate in the way fuse does). David
On Mon, Jan 24, 2022 at 04:25:11PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > Well, that's the problem isn't it? You're expecting to mutate the state > > and then refer to its previous state instead of its current state, > > No. I *don't* want to see the previous state. That's where the problem is: > The previous state isn't cleared up until the point I ask for a new batch - > but I need to query the ractl to find the remaining window before I can ask > for a new batch. Oh, right, even worse -- you want to know the _future_ state of the iterator before you mutate it (it's kind of the same thing though if you haven't looked at how the iterator works internally). > > whereas the other users refer to the current state instead of the > > previous state. > > Fuse has exactly the same issues: ... and yet you refuse to solve it the same way as fuse ... > > Can't you pull readahead_index() out of the ractl ahead of the mutation? > > I'm not sure what you mean by that. What I'm now doing is: > > while ((nr_pages = readahead_count(ractl) - ractl->_batch_count)) { > ... > pgoff_t index = readahead_index(ractl) + ractl->_batch_count; > ... > rc = cifs_fscache_query_occupancy( > ractl->mapping->host, index, nr_pages, > ... > rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize, > &rsize, credits); > ... > nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl)); > nr_pages = min_t(size_t, nr_pages, next_cached - index); > ... > rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete); > ... > got = __readahead_batch(ractl, rdata->pages, nr_pages); > ... > } > > I need the count to know how many, if any, pages are remaining; I need the > count and index of the window to find out if fscache has any data; I need the > count to allocate a page array. Only after all that can I crank the next > batch for the server (assuming I didn't delegate to the cache along the way, > but that calls readahead_page()). The problem is that the other users very much want to refer to the pre-mutation state. eg, btrfs: while ((nr = readahead_page_batch(rac, pagepool))) { u64 contig_start = readahead_pos(rac); u64 contig_end = contig_start + readahead_batch_length(rac) - 1; contiguous_readpages(pagepool, nr, contig_start, contig_end, &em_cached, &bio_ctrl, &prev_em_start); } The API isn't designed to work the way you want it to work. So either we redesign it (... and change all the existing users ...) or you use it the way that FUSE does, which is admittedly suboptimal, but you're also saying it's temporary.
diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 59334be9ed3b..618e4851e240 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -4269,8 +4269,6 @@ cifs_readv_complete(struct work_struct *work) for (i = 0; i < rdata->nr_pages; i++) { struct page *page = rdata->pages[i]; - lru_cache_add(page); - if (rdata->result == 0 || (rdata->result == -EAGAIN && got_bytes)) { flush_dcache_page(page); @@ -4340,7 +4338,6 @@ readpages_fill_pages(struct TCP_Server_Info *server, * fill them until the writes are flushed. */ zero_user(page, 0, PAGE_SIZE); - lru_cache_add(page); flush_dcache_page(page); SetPageUptodate(page); unlock_page(page); @@ -4350,7 +4347,6 @@ readpages_fill_pages(struct TCP_Server_Info *server, continue; } else { /* no need to hold page hostage */ - lru_cache_add(page); unlock_page(page); put_page(page); rdata->pages[i] = NULL; @@ -4393,92 +4389,16 @@ cifs_readpages_copy_into_pages(struct TCP_Server_Info *server, return readpages_fill_pages(server, rdata, iter, iter->count); } -static int -readpages_get_pages(struct address_space *mapping, struct list_head *page_list, - unsigned int rsize, struct list_head *tmplist, - unsigned int *nr_pages, loff_t *offset, unsigned int *bytes) -{ - struct page *page, *tpage; - unsigned int expected_index; - int rc; - gfp_t gfp = readahead_gfp_mask(mapping); - - INIT_LIST_HEAD(tmplist); - - page = lru_to_page(page_list); - - /* - * Lock the page and put it in the cache. Since no one else - * should have access to this page, we're safe to simply set - * PG_locked without checking it first. - */ - __SetPageLocked(page); - rc = add_to_page_cache_locked(page, mapping, - page->index, gfp); - - /* give up if we can't stick it in the cache */ - if (rc) { - __ClearPageLocked(page); - return rc; - } - - /* move first page to the tmplist */ - *offset = (loff_t)page->index << PAGE_SHIFT; - *bytes = PAGE_SIZE; - *nr_pages = 1; - list_move_tail(&page->lru, tmplist); - - /* now try and add more pages onto the request */ - expected_index = page->index + 1; - list_for_each_entry_safe_reverse(page, tpage, page_list, lru) { - /* discontinuity ? */ - if (page->index != expected_index) - break; - - /* would this page push the read over the rsize? */ - if (*bytes + PAGE_SIZE > rsize) - break; - - __SetPageLocked(page); - rc = add_to_page_cache_locked(page, mapping, page->index, gfp); - if (rc) { - __ClearPageLocked(page); - break; - } - list_move_tail(&page->lru, tmplist); - (*bytes) += PAGE_SIZE; - expected_index++; - (*nr_pages)++; - } - return rc; -} - -static int cifs_readpages(struct file *file, struct address_space *mapping, - struct list_head *page_list, unsigned num_pages) +static void cifs_readahead(struct readahead_control *ractl) { int rc; - int err = 0; - struct list_head tmplist; - struct cifsFileInfo *open_file = file->private_data; - struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(file); + struct cifsFileInfo *open_file = ractl->file->private_data; + struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(ractl->file); struct TCP_Server_Info *server; pid_t pid; unsigned int xid; xid = get_xid(); - /* - * Reads as many pages as possible from fscache. Returns -ENOBUFS - * immediately if the cookie is negative - * - * After this point, every page in the list might have PG_fscache set, - * so we will need to clean that up off of every page we don't use. - */ - rc = cifs_readpages_from_fscache(mapping->host, mapping, page_list, - &num_pages); - if (rc == 0) { - free_xid(xid); - return rc; - } if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD) pid = open_file->pid; @@ -4489,7 +4409,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping, server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses); cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n", - __func__, file, mapping, num_pages); + __func__, ractl->file, ractl->mapping, readahead_count(ractl)); /* * Start with the page at end of list and move it to private @@ -4502,26 +4422,27 @@ static int cifs_readpages(struct file *file, struct address_space *mapping, * the order of declining indexes. When we put the pages in * the rdata->pages, then we want them in increasing order. */ - while (!list_empty(page_list) && !err) { - unsigned int i, nr_pages, bytes, rsize; - loff_t offset; - struct page *page, *tpage; + while (readahead_count(ractl) - ractl->_batch_count) { + unsigned int i, nr_pages, got, rsize; + struct page *page; struct cifs_readdata *rdata; struct cifs_credits credits_on_stack; struct cifs_credits *credits = &credits_on_stack; if (open_file->invalidHandle) { rc = cifs_reopen_file(open_file, true); - if (rc == -EAGAIN) - continue; - else if (rc) + if (rc) { + if (rc == -EAGAIN) + continue; break; + } } rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize, &rsize, credits); if (rc) break; + nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl)); /* * Give up immediately if rsize is too small to read an entire @@ -4529,16 +4450,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping, * reach this point however since we set ra_pages to 0 when the * rsize is smaller than a cache page. */ - if (unlikely(rsize < PAGE_SIZE)) { - add_credits_and_wake_if(server, credits, 0); - free_xid(xid); - return 0; - } - - nr_pages = 0; - err = readpages_get_pages(mapping, page_list, rsize, &tmplist, - &nr_pages, &offset, &bytes); - if (!nr_pages) { + if (unlikely(!nr_pages)) { add_credits_and_wake_if(server, credits, 0); break; } @@ -4546,36 +4458,31 @@ static int cifs_readpages(struct file *file, struct address_space *mapping, rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete); if (!rdata) { /* best to give up if we're out of mem */ - list_for_each_entry_safe(page, tpage, &tmplist, lru) { - list_del(&page->lru); - lru_cache_add(page); - unlock_page(page); - put_page(page); - } - rc = -ENOMEM; add_credits_and_wake_if(server, credits, 0); break; } - rdata->cfile = cifsFileInfo_get(open_file); - rdata->server = server; - rdata->mapping = mapping; - rdata->offset = offset; - rdata->bytes = bytes; - rdata->pid = pid; - rdata->pagesz = PAGE_SIZE; - rdata->tailsz = PAGE_SIZE; + got = __readahead_batch(ractl, rdata->pages, nr_pages); + if (got != nr_pages) { + pr_warn("__readahead_batch() returned %u/%u\n", + got, nr_pages); + nr_pages = got; + } + + rdata->nr_pages = nr_pages; + rdata->bytes = readahead_batch_length(ractl); + rdata->cfile = cifsFileInfo_get(open_file); + rdata->server = server; + rdata->mapping = ractl->mapping; + rdata->offset = readahead_pos(ractl); + rdata->pid = pid; + rdata->pagesz = PAGE_SIZE; + rdata->tailsz = PAGE_SIZE; rdata->read_into_pages = cifs_readpages_read_into_pages; rdata->copy_into_pages = cifs_readpages_copy_into_pages; - rdata->credits = credits_on_stack; - - list_for_each_entry_safe(page, tpage, &tmplist, lru) { - list_del(&page->lru); - rdata->pages[rdata->nr_pages++] = page; - } + rdata->credits = credits_on_stack; rc = adjust_credits(server, &rdata->credits, rdata->bytes); - if (!rc) { if (rdata->cfile->invalidHandle) rc = -EAGAIN; @@ -4587,7 +4494,6 @@ static int cifs_readpages(struct file *file, struct address_space *mapping, add_credits_and_wake_if(server, &rdata->credits, 0); for (i = 0; i < rdata->nr_pages; i++) { page = rdata->pages[i]; - lru_cache_add(page); unlock_page(page); put_page(page); } @@ -4597,10 +4503,13 @@ static int cifs_readpages(struct file *file, struct address_space *mapping, } kref_put(&rdata->refcount, cifs_readdata_release); + + ractl->_nr_pages -= ractl->_batch_count; + ractl->_index += ractl->_batch_count; + ractl->_batch_count = 0; } free_xid(xid); - return rc; } /* @@ -4924,7 +4833,7 @@ void cifs_oplock_break(struct work_struct *work) * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests * so this method should never be called. * - * Direct IO is not yet supported in the cached mode. + * Direct IO is not yet supported in the cached mode. */ static ssize_t cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter) @@ -5006,7 +4915,7 @@ static int cifs_set_page_dirty(struct page *page) const struct address_space_operations cifs_addr_ops = { .readpage = cifs_readpage, - .readpages = cifs_readpages, + .readahead = cifs_readahead, .writepage = cifs_writepage, .writepages = cifs_writepages, .write_begin = cifs_write_begin,
Transition the cifs filesystem from using the old ->readpages() method to using the new ->readahead() method. For the moment, this removes any invocation of fscache to read data from the local cache, leaving that to another patch. Questions for Willy: - Can we get a function to return the number of pages in a readahead batch? - Can we get a function to commit a readahead batch? Currently, this is done when we call __readahead_batch(), but that means ractl->_nr_pages isn't up to date at the point we need it to be. However, we want to check to see if the ractl is empty, then get server credits and only *then* call __readahead_batch() as we don't know how big a batch we're allowed till we have the credits. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <smfrench@gmail.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: linux-cachefs@redhat.com --- fs/cifs/file.c | 163 ++++++++++++-------------------------------------------- 1 file changed, 36 insertions(+), 127 deletions(-)