Message ID | 2213409.1677249075@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] cifs: Fix cifs_writepages_region() | expand |
On Fri, Feb 24, 2023 at 6:31 AM David Howells <dhowells@redhat.com> wrote: > > Here's the simplest fix for cifs_writepages_region() that gets it to work. Hmm. The commit message for this is wrong. > Fix the cifs_writepages_region() to just skip over members of the batch that > have been cleaned up rather than retrying them. It never retried them. The "skip_write" code did that same start += folio_size(folio); continue; that your patch does, but it *also* had that if (skips >= 5 || need_resched()) { thing to just stop writing entirely. > I'm not entirely sure why it fixes it, though. Yes. Strange. Because it does the exact same thing as the "Oh, the trylock worked, but it was still under writeback or fscache" did. I just merged all the "skip write" cases. But the code is clearly (a) not working and (b) the whole skip count and need_resched() logic is a bit strange to begin with. Can you humor me, and try if just removing that skip count thing instead? IOW, this attached patch? Because that whole "let's stop writing if we need to reschedule" sounds truly odd (we have a cond_resched(), although it's per folio batch, not per-folio), and the skip count logic doesn't make much sense to me either. SteveF? Linus
On Fri, Feb 24, 2023 at 02:31:15PM +0000, David Howells wrote: > Here's the simplest fix for cifs_writepages_region() that gets it to work. > > Fix the cifs_writepages_region() to just skip over members of the batch that > have been cleaned up rather than retrying them. I'm not entirely sure why it > fixes it, though. It's also not the most efficient as, in the common case, > this is going to happen a lot because cifs_extend_writeback() is going to > clean up the contiguous pages in the batch - and then this skip will occur for > those. Why are you doing it this way? What's wrong with using write_cache_pages() to push all the contiguous dirty folios into a single I/O object, submitting it when the folios turn out not to be contiguous, or when we run out of a batch? You've written an awful lot of code here and it's a different model from every other filesystem. Why is it better?
Matthew Wilcox <willy@infradead.org> wrote: > Why are you doing it this way? What's wrong with using > write_cache_pages() to push all the contiguous dirty folios into a single > I/O object, submitting it when the folios turn out not to be contiguous, > or when we run out of a batch? > > You've written an awful lot of code here and it's a different model from > every other filesystem. Why is it better? Because write_cache_pages(): (1) Takes no account of fscache. I can't just build knowledge of PG_fscache into it because PG_private_2 may be used differently by other filesystems (btrfs, for example). (I'm also trying to phase out the use of PG_private_2 and instead uses PG_writeback to cover both and the difference will be recorded elsewhere - but that's not there yet). (2) Calls ->writepage() individually for each folio - which is excessive. In AFS's implementation, we locate the first folio, then race through the following folios without ever waiting until we hit something that's locked or a gap and then stop and submit. write_cache_pages(), otoh, calls us with the next folio already undirtied and set for writeback when we find out that we don't want it yet. (3) Skips over holes, but at some point in the future we're going to need to schedule adjacent clean pages (before and after) for writeback too to handle transport compression and fscache updates if the granule size for either is larger than the folio size. It might be better to take what's in cifs, generalise it and replace write_cache_pages() with it, then have a "->submit_write()" aop that takes an ITER_XARRAY iterator to write from. David
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Can you humor me, and try if just removing that skip count thing > instead? IOW, this attached patch? That works too. > Because that whole "let's stop writing if we need to reschedule" sounds > truly odd (we have a cond_resched(), although it's per folio batch, not > per-folio), and the skip count logic doesn't make much sense to me either. The skip thing, in my code, is only used in WB_SYNC_NONE mode. If we hit 5 things in progress or rescheduling is required, we return to the caller on the basis that conflicting flushes appear to be happening in other threads. David
On Fri, Feb 24, 2023 at 05:15:41PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > Why are you doing it this way? What's wrong with using > > write_cache_pages() to push all the contiguous dirty folios into a single > > I/O object, submitting it when the folios turn out not to be contiguous, > > or when we run out of a batch? > > > > You've written an awful lot of code here and it's a different model from > > every other filesystem. Why is it better? > > Because write_cache_pages(): > > (1) Takes no account of fscache. I can't just build knowledge of PG_fscache > into it because PG_private_2 may be used differently by other filesystems > (btrfs, for example). (I'm also trying to phase out the use of > PG_private_2 and instead uses PG_writeback to cover both and the > difference will be recorded elsewhere - but that's not there yet). I don't understand why waiting for fscache here is even the right thing to do. The folio is dirty and needs to be written back to the server. Why should beginning that write wait for the current write to the fscache to complete? > (2) Calls ->writepage() individually for each folio - which is excessive. In > AFS's implementation, we locate the first folio, then race through the > following folios without ever waiting until we hit something that's > locked or a gap and then stop and submit. > > write_cache_pages(), otoh, calls us with the next folio already undirtied > and set for writeback when we find out that we don't want it yet. I think you're so convinced that your way is better that you don't see what write_cache_pages() is actually doing. Yes, the writepage callback is called once for each folio, but that doesn't actually submit a write. Instead, they're accumulated into the equivalent of a wdata and the wdata is submitted when there's a discontiguity or we've accumulated enough to satisfy the constraints of the wbc. > (3) Skips over holes, but at some point in the future we're going to need to > schedule adjacent clean pages (before and after) for writeback too to > handle transport compression and fscache updates if the granule size for > either is larger than the folio size. Then we'll need it for other filesystems too, so should enhance write_cache_pages(). > It might be better to take what's in cifs, generalise it and replace > write_cache_pages() with it, then have a "->submit_write()" aop that takes an > ITER_XARRAY iterator to write from. ->writepages _is_ ->submit_write. Should write_cache_pages() be better? Maybe! But it works a whole lot better than what AFS was doing and you've now ladelled into cifs.
On Fri, Feb 24, 2023 at 9:19 AM David Howells <dhowells@redhat.com> wrote: > > The skip thing, in my code, is only used in WB_SYNC_NONE mode. If we hit 5 > things in progress or rescheduling is required, we return to the caller on the > basis that conflicting flushes appear to be happening in other threads. Ahh. *That* is the difference, and I didn't realize. I made all the skip-write cases the same, and I really meant for that case to only trigger for WB_SYNC_NONE, but I stupidly didn't notice that the whole folio_test_dirty() re-test case was done without that WB_SYNC_NONE case that all the other cases had. Mea culpa, mea maxima culpa. That was just me being stupid. So that case isn't actually a "skip write" case at all, it's actually a "no write needed at all" case. Your original patch is the right fix, and I was just being silly for having not realized. I'll apply that minimal fix for now - I think the right thing to do is your bigger patch, but that needs more thinking (or at least splitting up). Sorry for the confusion, and thanks for setting me straight, Linus
On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'll apply that minimal fix for now - I think the right thing to do is > your bigger patch, but that needs more thinking (or at least splitting > up). Minimal fix applied - that way I can drop this for now, and we can discuss the whole "maybe we can just use write_cache_pages()" instead. Because that _would_ be lovely, even if it's possible that the generic helper might need some extra love to work better for cifs/afs. Linus
Matthew Wilcox <willy@infradead.org> wrote: > I don't understand why waiting for fscache here is even the right thing > to do. Then why do we have to wait for PG_writeback to complete? David
On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote: > > Then why do we have to wait for PG_writeback to complete? At least for PG_writeback, it's about "the _previous_ dirty write is still under way, but - since PG_dirty is set again - the page has been dirtied since". So we have to start _another_ writeback, because while the current writeback *might* have written the updated data, that is not at all certain or clear. I'm not sure what the fscache rules are. Linus
On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote: > On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote: > > > > Then why do we have to wait for PG_writeback to complete? > > At least for PG_writeback, it's about "the _previous_ dirty write is > still under way, but - since PG_dirty is set again - the page has been > dirtied since". > > So we have to start _another_ writeback, because while the current > writeback *might* have written the updated data, that is not at all > certain or clear. also, we only have a writeback bit, not a writeback count. And when the current writeback completes, it'll clear that bit. We're also being kind to our backing store and not writing to the same block twice at the same time. > I'm not sure what the fscache rules are. My understanding is that the fscache bit is set under several circumstances, but if the folio is dirty _and_ the fscache bit is set, it means the folio is currently being written to the cache device. I don't see a conflict there; we can write to the backing store and the cache device at the same time.
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Then why do we have to wait for PG_writeback to complete? > > At least for PG_writeback, it's about "the _previous_ dirty write is > still under way, but - since PG_dirty is set again - the page has been > dirtied since". > > So we have to start _another_ writeback, because while the current > writeback *might* have written the updated data, that is not at all > certain or clear. As I understand it, it's also about serialising writes from the same page to the same backing store. We don't want them to end up out-of-order. I'm not sure what guarantees, for instance, the block layer gives if two I/O requests go to the same place. > I'm not sure what the fscache rules are. I'm now using PG_fscache in exactly the same way: the previous write to the cache is still under way. I don't want to start another DIO write to the cache for the same pages. Hence the waits/checks on PG_fscache I've added anywhere we need to wait/check on PG_writeback. As I mentioned I'm looking at the possibility of making PG_dirty and PG_writeback cover *both* cases and recording the difference elsewhere - thereby returning PG_private_2 to the VM folks who'd like their bit back. This means, for instance, when we read from the server and find we need to write it to the cache, we set a note in the aforementioned elsewhere, mark the page dirty and leave it to writepages() to effect the write to the cache. It could get tricky because we have two different places to write to, with very different characteristics (e.g. ~6000km away server vs local SSD) with their own queueing, scheduling, bandwidth, etc. - and the local disk might have to share with the system. David
I am traveling today but will be testing this tonight or tomorrow Get Outlook for Android<https://aka.ms/AAb9ysg>
Matthew Wilcox <willy@infradead.org> wrote: > On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote: > > On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote: > > > > > > Then why do we have to wait for PG_writeback to complete? > > > > At least for PG_writeback, it's about "the _previous_ dirty write is > > still under way, but - since PG_dirty is set again - the page has been > > dirtied since". > > > > So we have to start _another_ writeback, because while the current > > writeback *might* have written the updated data, that is not at all > > certain or clear. > > also, we only have a writeback bit, not a writeback count. And when > the current writeback completes, it'll clear that bit. We're also > being kind to our backing store and not writing to the same block twice > at the same time. It's not so much being kind to the backing store, I think, as avoiding the possibility that the writes happen out of order. > > I'm not sure what the fscache rules are. > > My understanding is that the fscache bit is set under several > circumstances, but if the folio is dirty _and_ the fscache bit > is set, it means the folio is currently being written to the cache > device. I don't see a conflict there; we can write to the backing > store and the cache device at the same time. The dirty bit is nothing to do with it. If the fscache bit is set, then the page is currently being written to the cache - and we need to wait before starting another write. Sometimes we start a write to the cache from a clean page (e.g. we just read it from the server) and sometimes we start a write to the cache from writepages (e.g. the data is dirty and we're writing it to the server as well). Things will become more 'interesting' should we ever get around to implementing disconnected operation. Then we might end up staging dirty data through the cache. David
Have been doing additional testing of these - and also verified that David's most recent patch, now in my for-next branch ("cifs: Fix memory leak in direct I/O") fixes the remaining problem (the issue with xfstest 208 that Murphy pointed out). Will send the three small cifs fixes from David later this week, along with some unrelated fixes from Paulo and Shyam. -----Original Message----- From: Linus Torvalds <torvalds@linux-foundation.org> Sent: Friday, February 24, 2023 1:06 PM To: David Howells <dhowells@redhat.com> Cc: Steven French <Steven.French@microsoft.com>; Vishal Moola <vishal.moola@gmail.com>; Andrew Morton <akpm@linux-foundation.org>; Jan Kara <jack@suse.cz>; pc <pc@cjr.nz>; Matthew Wilcox <willy@infradead.org>; Huang Ying <ying.huang@intel.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Xin Hao <xhao@linux.alibaba.com>; linux-mm@kvack.org; mm-commits@vger.kernel.org; linux-kernel@vger.kernel.org Subject: [EXTERNAL] Re: [RFC][PATCH] cifs: Fix cifs_writepages_region() [You don't often get email from torvalds@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'll apply that minimal fix for now - I think the right thing to do is > your bigger patch, but that needs more thinking (or at least splitting > up). Minimal fix applied - that way I can drop this for now, and we can discuss the whole "maybe we can just use write_cache_pages()" instead. Because that _would_ be lovely, even if it's possible that the generic helper might need some extra love to work better for cifs/afs. Linus
diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 5365a3299088..ebfcaae8c437 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2893,8 +2893,9 @@ static int cifs_writepages_region(struct address_space *mapping, if (folio_mapping(folio) != mapping || !folio_test_dirty(folio)) { + start += folio_size(folio); folio_unlock(folio); - goto skip_write; + continue; } if (folio_test_writeback(folio) ||
Here's the simplest fix for cifs_writepages_region() that gets it to work. Fix the cifs_writepages_region() to just skip over members of the batch that have been cleaned up rather than retrying them. I'm not entirely sure why it fixes it, though. It's also not the most efficient as, in the common case, this is going to happen a lot because cifs_extend_writeback() is going to clean up the contiguous pages in the batch - and then this skip will occur for those. Fix: 3822a7c40997 ("Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") Signed-off-by: David Howells <dhowells@redhat.com> ---